APPLICATION OF SUPERVISED LEARNING
DEEP LEARNING
Question
[CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
|
True or False:In practice we often use Multi-headed self attention. This intuitively works as it is more useful to learn multiple simpler transformations into distinct sub-spaces than one complicated transformation into a richer sub-space.
|
True
|
|
False You might imagine a situation in which our word embedding somehow captures whether a word is a noun or a verb. Then it might be easier to learn long-range noun-noun and verb-verb relations if we learn two distinct, simpler subspaces (by using 2 attention heads) than one richer subspace, where the MLP would have to disentangle these notions. Additionally, it might make it easier for the transformed value-space to amplify certain semantically distinct properties which can be leveraged by subsequent layers.
|
|
Either A or B
|
|
None of the above
|
Explanation:
Detailed explanation-1: -Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension.
Detailed explanation-2: -Then, we suggest the main advantage of the multi-head attention is the training stability, since it has fewer layers than the single-head attention when using the same number of subspaces.
Detailed explanation-3: -Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
There is 1 question to complete.