APPLICATION OF SUPERVISED LEARNING
DEEP LEARNING
Question
[CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
|
We are creating a Transformer using multi-headed attention, such that input embeddings of dimension 128 match the output shape of our self-attention layer. If we use multi-headed attention, with 4 heads, what dimensionality will the outputs of each head have?
|
32
|
|
64
|
|
128
|
|
512 more
|
Explanation:
Detailed explanation-1: -The output of the first multi-headed attention is a masked output vector with information on how the model should attend on the decoder’s input.
Detailed explanation-2: -Multiple Attention Heads In the Transformer, the Attention module repeats its computations multiple times in parallel. Each of these is called an Attention Head. The Attention module splits its Query, Key, and Value parameters N-ways and passes each split independently through a separate Head.
There is 1 question to complete.