nprofile1q…r59k6 As I understand it, masking is used to obtain causality so each ...

nprofile1qy2hwumn8ghj7un9d3shjtnddaehgu3wwp6kyqpqc9m22hkc5h6zgrwkz48crhcpw6vch2rf6j97746ugl3neys86jeqcr59k6 (nprofile…59k6) As I understand it, masking is used to obtain causality so each token only attends to the tokens preceding it (and itself).

Each token emits a query and a key and the code computes all pairwise dot products between queries and keys (using matmul) to create the attention matrix. This is then 'masked' by subtracting a large number from all the anti-causal logits in the matrix (everything above the leading diagonal if rows correspond to queries).

So the first token attends only to itself, the second attends to the first and itself - and so on. In general this uses half of the attention matrix. So the early tokens are being very wasteful, those in the middle are a little bit wasteful and those near the end are barely wasteful at all.

But maybe I have misunderstood your question...?

Tom_Drummond on Nostr: nprofile1q…r59k6 As I understand it, masking is used to obtain causality so each ...