True or False:In practice we often use Multi-headed self attention. This intuitively works as it is more useful to learn multiple simpler transformations into distinct sub-spaces than one complicated transformation into a richer sub-space.

DEEP LEARNING

Please wait while the activity loads.
If this activity does not load, try refreshing your browser. Also, this page requires javascript. Please visit using a browser with javascript enabled.

If loading fails, click here to try again

Question [CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]

A	True
B	False You might imagine a situation in which our word embedding somehow captures whether a word is a noun or a verb. Then it might be easier to learn long-range noun-noun and verb-verb relations if we learn two distinct, simpler subspaces (by using 2 attention heads) than one richer subspace, where the MLP would have to disentangle these notions. Additionally, it might make it easier for the transformed value-space to amplify certain semantically distinct properties which can be leveraged by subsequent layers.
C	Either A or B
D	None of the above

Explanation:

Detailed explanation-1: -Multi-head Attention is a module for attention mechanisms which runs through an attention mechanism several times in parallel. The independent attention outputs are then concatenated and linearly transformed into the expected dimension.

Detailed explanation-2: -Then, we suggest the main advantage of the multi-head attention is the training stability, since it has fewer layers than the single-head attention when using the same number of subspaces.

Detailed explanation-3: -Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

There is 1 question to complete.

MACHINE LEARNING

APPLICATION OF SUPERVISED LEARNING

DEEP LEARNING