Why Nostr? What is Njump?
2024-11-19 02:53:22
in reply to

Tom_Drummond on Nostr: nprofile1q…r59k6 nprofile1q…w4xpp Ah - I think I understand your point now. If in ...


Ah - I think I understand your point now.

If in practice tokens only ever pay attention to recent tokens, then full attention matrices are very inefficient. I believe some implementations have done lower band diagonal implementations (ie rolling context windows) to address this - but it would eliminate the occasional long distance reference that a full-fat transformer would permit - maybe this matters…?
Author Public Key
npub1g4ss3v8573z4aus0ytsq3ewah9cphk5p48ffnfz5hp8htsc9ndjqphtq97