Scaled dot-product attention is $ propto softmaxleft(QK^Tright)V$ . As such, for matrices with shapes $ Qin R^{Ntimes d}$, $ Kin R^{Mtimes d}$, $ Vin R^{Mtimes I}$, what will be the time complexity of the operation, given that the complexity of the softmax operation is sub-leading compared to the matrix multiplications?

DEEP LEARNING

Please wait while the activity loads.
If this activity does not load, try refreshing your browser. Also, this page requires javascript. Please visit using a browser with javascript enabled.

If loading fails, click here to try again

Question [CLICK ON ANY CHOICE TO KNOW THE RIGHT ANSWER]

Scaled dot-product attention is

A	O(NM(d+l))
B	O(NM2dl)
C	O(NM2d2l)
D	O(NMd2l) See Notes. Given two matrices A (NxM) and B (MxL), multiplying these has time complexity O(NML). Applying this to the question, the QK^T multiplication goes as O(NMd), then we calculate softmax, which is M operations (addition along row, and division) over N rows O(NM). Finally we multiply the resulting NxM matrix with V, which is O(NMl). So we get O(NM[l+d+1])

Explanation:

Detailed explanation-1: -If we assume that and are-dimensional vectors whose components are independent random variables with mean and variance, then their dot product, q ⋅ k = ∑ i = 1 d k u i v i, has mean and variance .

Detailed explanation-2: -In summary, self-attention allows a transformer model to attend to different parts of the same input sequence, while attention allows a transformer model to attend to different parts of another sequence.

Detailed explanation-3: -While the two are similar in theoretical complexity, dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.

There is 1 question to complete.

MACHINE LEARNING

APPLICATION OF SUPERVISED LEARNING

DEEP LEARNING