APPLICATION OF SUPERVISED LEARNING
DEEP LEARNING
Question
[CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]
|
Scaled dot-product attention is
|
O(NM(d+l))
|
|
O(NM2dl)
|
|
O(NM2d2l)
|
|
O(NMd2l) See Notes. Given two matrices A (NxM) and B (MxL), multiplying these has time complexity O(NML). Applying this to the question, the QK^T multiplication goes as O(NMd), then we calculate softmax, which is M operations (addition along row, and division) over N rows O(NM). Finally we multiply the resulting NxM matrix with V, which is O(NMl). So we get O(NM[l+d+1])
|
Explanation:
Detailed explanation-1: -If we assume that and are-dimensional vectors whose components are independent random variables with mean and variance, then their dot product, q ⋅ k = ∑ i = 1 d k u i v i, has mean and variance .
Detailed explanation-2: -In summary, self-attention allows a transformer model to attend to different parts of the same input sequence, while attention allows a transformer model to attend to different parts of another sequence.
Detailed explanation-3: -While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.
There is 1 question to complete.