We are creating a Transformer using multi-headed attention, such that input embeddings of dimension 128 match the output shape of our self-attention layer. If we use multi-headed attention, with 4 heads, what dimensionality will the outputs of each head have?

DEEP LEARNING

Please wait while the activity loads.
If this activity does not load, try refreshing your browser. Also, this page requires javascript. Please visit using a browser with javascript enabled.

If loading fails, click here to try again

Question [CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]

A	32
B	64
C	128
D	512 more

Explanation:

Detailed explanation-1: -The output of the first multi-headed attention is a masked output vector with information on how the model should attend on the decoder’s input.

Detailed explanation-2: -Multiple Attention Heads In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head.

There is 1 question to complete.

MACHINE LEARNING

APPLICATION OF SUPERVISED LEARNING

DEEP LEARNING