Reading time: 8 min Prerequisites: Know what a save file is uwu Survival rate: 100% (your GPU lives)
The Problem (Why You Should Care)
You know how AI forgets stuff in long conversations?
You: "My name is Dave"
... 100 messages later ...
You: "What's my name?"
AI: "I don't have access to that information"
You: *dies inside*
That’s the quadratic curse.
The MMO Raid Analogy
Imagine you’re raid leading in an MMO:
SMALL RAID (10 players):
- Everyone checks everyone's health
- 10 × 10 = 100 checks
- Your PC: fine
MEDIUM RAID (100 players):
- Everyone checks everyone's health
- 100 × 100 = 10,000 checks
- Your PC: sweating
MEGA RAID (10,000 players):
- Everyone checks everyone's health
- 10k × 10k = 100,000,000 checks
- Your PC: literal fire
- You: dead
- Raid: wiped
That’s your GPU running transformer attention.
The Math (Don’t Panic)
TRANSFORMER ATTENTION:
"Every token looks at every other token"
10 tokens: 10 × 10 = 100 ✅
100 tokens: 100 × 100 = 10k ✅
1k tokens: 1k × 1k = 1M 😰
10k tokens: 10k × 10k = 100M 🔥
100k tokens: 100k × 100k = 10B 💀
This is O(n²)
The bigger n gets, the FASTER it explodes
Why It Happens: Everyone Talks to Everyone
DISCORD SERVER ANALOGY:
Small server (10 people):
- Everyone can DM everyone
- 10 × 10 = 100 possible DMs
- manageable
Huge server (10,000 people):
- Everyone DMing everyone???
- 10k × 10k = 100M DMs
- Discord would literally die
Transformers = everyone DMing everyone
That's why context windows have limits
The Three Weapons
there are three ways to break the curse:
WEAPON 1: Associativity (reorder the math)
WEAPON 2: Delta Rule (update memory, not rebuild)
WEAPON 3: State Space (fixed-size notebook)
All three give you O(n) instead of O(n²)
let’s look at each uwu
Weapon 1: Associativity (The Parentheses Trick)
remember PEMDAS? parentheses matter:
NORMAL ATTENTION:
(Q × Kᵀ) × V
↓
build n×n matrix first
↓
O(n²) 💀
LINEAR ATTENTION:
Q × (Kᵀ × V)
↓
build d×d matrix first (d is small!)
↓
O(n) ✅
it’s the same multiplication, just different order!
GAME ANALOGY:
BAD: Roll for EVERY enemy, THEN sum damage
1000 enemies = 1000 × 1000 = 1M rolls
GOOD: Calculate your damage modifier ONCE
apply to each enemy
1000 enemies = 1000 rolls
Same result. Way less work.
Weapon 2: Delta Rule (The Git Amend)
instead of rebuilding everything, just update the diff:
NORMAL ATTENTION:
- Read ALL previous messages
- Recalculate EVERYTHING
- O(n²)
DELTA ATTENTION:
- Look at current memory
- What's NEW? (the delta)
- Update only that
- O(n)
GIT ANALOGY:
BAD: git clone the entire repo every commit
1000 commits = download 1000 repos
GOOD: git pull (just get the delta)
1000 commits = 1000 small downloads
That's delta attention!
It only cares about what CHANGED.
Weapon 3: State Space (The Notebook)
this is what RWKV does:
TRANSFORMER:
- KV cache grows with context
- 1M tokens = 1M cached tokens
- memory explodes
STATE SPACE (RWKV):
- fixed-size "notebook"
- 1M tokens = same notebook size
- memory constant
SAVE FILE ANALOGY:
TRANSFORMER save:
- Records EVERY input you ever made
- 100 hours = huge save file
- Eventually too big to load
RWKV save:
- Just saves current state
- 100 hours = same save size
- Always loads fast
Same game. Different save system.
The Numbers That Matter
@ 100,000 TOKENS:
TRANSFORMER:
- Operations: 100k × 100k = 10 BILLION
- Memory: gigabytes (and growing)
- Hardware: A100 cluster
LINEAR (any of the 3 weapons):
- Operations: 100k × 64 = 6.4 MILLION
- Memory: megabytes (constant)
- Hardware: your gaming rig
Which Weapon When?
ASSOCIATIVITY (Linear Transformer):
+ Drop-in replacement for attention
+ Same training as transformers
- Loses some precision (no softmax)
DELTA RULE (Delta Attention, Mamba):
+ Learns what to remember/forget
+ Very long context
- Needs special training
STATE SPACE (RWKV):
+ Infinite context (theoretically)
+ Fast inference
+ Can edit memory (v7)
- Different architecture entirely
The Visualization
TRANSFORMER (Quadratic):
┌─────────────────────────────┐
│ T1 ↔ T2 ↔ T3 ↔ T4 ↔ T5 │
│ ↕ ↕ ↕ ↕ ↕ │
│ T6 ↔ T7 ↔ T8 ↔ T9 ↔ T10 │
│ │
│ Everyone connects to all │
│ 10 tokens = 100 connections│
│ n tokens = n² connections │
└─────────────────────────────┘
LINEAR (Any Weapon):
┌─────────────────────────────┐
│ │
│ T1 → [MEMORY] ← T2 │
│ T3 → [MEMORY] ← T4 │
│ T5 → [MEMORY] ← T6 │
│ │
│ Everyone talks to memory │
│ 10 tokens = 10 updates │
│ n tokens = n updates │
└─────────────────────────────┘
TL;DR
| Thing | Quadratic | Linear |
|---|---|---|
| Cost | O(n²) | O(n) |
| 100k tokens | 10 billion ops | 6 million ops |
| Memory | grows forever | stays constant |
| Analogy | everyone DMs everyone | everyone posts to #general |
| Hardware | datacenter | gaming rig |
The Catch (Honest Section)
Why isn't everyone using linear attention?
1. Transformers are well-understood
2. Linear loses some precision
3. Training dynamics are different
4. Ecosystem (llama.cpp) still catching up
5. Some tasks NEED quadratic precision
But for long context?
The curse IS breakable.
The weapons work.
What We Ship
RWKV-7 "Goose" in eldr.ᚲ:
- Uses state space + delta rule
- O(n) attention
- 48 tok/s on GPU
- Infinite context (theoretically)
See: uwu-rwkv7-101 for details
You Survived!
now you understand:
- O(n²) = everyone DMing everyone (slow)
- O(n) = everyone posting to #general (fast)
- 3 weapons exist: associativity, delta rule, state space
- the curse was never inevitable - just the first design
your GPU thanks you uwu
rune.みんな