Short Text Note by Tom_Drummond (reply)

nprofile1qy2hwumn8ghj7un9d3shjtnddaehgu3wwp6kyqpqc9m22hkc5h6zgrwkz48crhcpw6vch2rf6j97746ugl3neys86jeqcr59k6 (nprofile…59k6) nprofile1qy2hwumn8ghj7un9d3shjtnddaehgu3wwp6kyqpqjngmqxxnn9az3r8j257aag2e5vmxtqv9j3nmf8sugsv3us0dq0pqlw4xpp (nprofile…4xpp)
Ah - I think I understand your point now.

If in practice tokens only ever pay attention to recent tokens, then full attention matrices are very inefficient. I believe some implementations have done lower band diagonal implementations (ie rolling context windows) to address this - but it would eliminate the occasional long distance reference that a full-fat transformer would permit - maybe this matters…?

Tom_Drummond on Nostr: nprofile1q…r59k6 nprofile1q…w4xpp Ah - I think I understand your point now. If in ...