MACHINE LEARNING

APPLICATION OF SUPERVISED LEARNING

DEEP LEARNING

Question [CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
Scaled dot-product attention is
A
O(NM(d+l))
B
O(NM2dl)
C
O(NM2d2l)
D
O(NMd2l) See Notes. Given two matrices A (NxM) and B (MxL), multiplying these has time complexity O(NML). Applying this to the question, the QK^T multiplication goes as O(NMd), then we calculate softmax, which is M operations (addition along row, and division) over N rows O(NM). Finally we multiply the resulting NxM matrix with V, which is O(NMl). So we get O(NM[l+d+1])
Explanation: 

Detailed explanation-1: -If we assume that and are-dimensional vectors whose components are independent random variables with mean and variance, then their dot product, q ⋅ k = ∑ i = 1 d k u i v i, has mean and variance .

Detailed explanation-2: -In summary, self-attention allows a transformer model to attend to different parts of the same input sequence, while attention allows a transformer model to attend to different parts of another sequence.

Detailed explanation-3: -While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

There is 1 question to complete.