uwu-attention-101 | ᚲ kenaz

Reading time: 8 min Prerequisites: Know what a save file is uwu Survival rate: 100% (your GPU lives)

The Problem (Why You Should Care)

You know how AI forgets stuff in long conversations?

You: "My name is Dave"
... 100 messages later ...
You: "What's my name?"
AI:  "I don't have access to that information"
You: *dies inside*

That’s the quadratic curse.

The MMO Raid Analogy

Imagine you’re raid leading in an MMO:

SMALL RAID (10 players):
- Everyone checks everyone's health
- 10 × 10 = 100 checks
- Your PC: fine

MEDIUM RAID (100 players):
- Everyone checks everyone's health
- 100 × 100 = 10,000 checks
- Your PC: sweating

MEGA RAID (10,000 players):
- Everyone checks everyone's health
- 10k × 10k = 100,000,000 checks
- Your PC: literal fire
- You: dead
- Raid: wiped

That’s your GPU running transformer attention.

The Math (Don’t Panic)

TRANSFORMER ATTENTION:
"Every token looks at every other token"

10 tokens:     10 × 10 = 100         ✅
100 tokens:    100 × 100 = 10k       ✅
1k tokens:     1k × 1k = 1M          😰
10k tokens:    10k × 10k = 100M      🔥
100k tokens:   100k × 100k = 10B     💀

This is O(n²)
The bigger n gets, the FASTER it explodes

Why It Happens: Everyone Talks to Everyone

DISCORD SERVER ANALOGY:

Small server (10 people):
- Everyone can DM everyone
- 10 × 10 = 100 possible DMs
- manageable

Huge server (10,000 people):
- Everyone DMing everyone???
- 10k × 10k = 100M DMs
- Discord would literally die

Transformers = everyone DMing everyone
That's why context windows have limits

The Three Weapons

there are three ways to break the curse:

WEAPON 1: Associativity (reorder the math)
WEAPON 2: Delta Rule (update memory, not rebuild)
WEAPON 3: State Space (fixed-size notebook)

All three give you O(n) instead of O(n²)

let’s look at each uwu

Weapon 1: Associativity (The Parentheses Trick)

remember PEMDAS? parentheses matter:

NORMAL ATTENTION:
(Q × Kᵀ) × V
    ↓
build n×n matrix first
    ↓
O(n²) 💀

LINEAR ATTENTION:
Q × (Kᵀ × V)
        ↓
    build d×d matrix first (d is small!)
        ↓
    O(n) ✅

it’s the same multiplication, just different order!

GAME ANALOGY:

BAD: Roll for EVERY enemy, THEN sum damage
     1000 enemies = 1000 × 1000 = 1M rolls

GOOD: Calculate your damage modifier ONCE
      apply to each enemy
      1000 enemies = 1000 rolls

Same result. Way less work.

Weapon 2: Delta Rule (The Git Amend)

instead of rebuilding everything, just update the diff:

NORMAL ATTENTION:
- Read ALL previous messages
- Recalculate EVERYTHING
- O(n²)

DELTA ATTENTION:
- Look at current memory
- What's NEW? (the delta)
- Update only that
- O(n)

GIT ANALOGY:

BAD: git clone the entire repo every commit
     1000 commits = download 1000 repos

GOOD: git pull (just get the delta)
      1000 commits = 1000 small downloads

That's delta attention!
It only cares about what CHANGED.

Weapon 3: State Space (The Notebook)

this is what RWKV does:

TRANSFORMER:
- KV cache grows with context
- 1M tokens = 1M cached tokens
- memory explodes

STATE SPACE (RWKV):
- fixed-size "notebook"
- 1M tokens = same notebook size
- memory constant

SAVE FILE ANALOGY:

TRANSFORMER save:
- Records EVERY input you ever made
- 100 hours = huge save file
- Eventually too big to load

RWKV save:
- Just saves current state
- 100 hours = same save size
- Always loads fast

Same game. Different save system.

The Numbers That Matter

@ 100,000 TOKENS:

TRANSFORMER:
- Operations: 100k × 100k = 10 BILLION
- Memory: gigabytes (and growing)
- Hardware: A100 cluster

LINEAR (any of the 3 weapons):
- Operations: 100k × 64 = 6.4 MILLION
- Memory: megabytes (constant)
- Hardware: your gaming rig

Which Weapon When?

ASSOCIATIVITY (Linear Transformer):
+ Drop-in replacement for attention
+ Same training as transformers
- Loses some precision (no softmax)

DELTA RULE (Delta Attention, Mamba):
+ Learns what to remember/forget
+ Very long context
- Needs special training

STATE SPACE (RWKV):
+ Infinite context (theoretically)
+ Fast inference
+ Can edit memory (v7)
- Different architecture entirely

The Visualization

TRANSFORMER (Quadratic):
┌─────────────────────────────┐
│  T1 ↔ T2 ↔ T3 ↔ T4 ↔ T5   │
│   ↕    ↕    ↕    ↕    ↕    │
│  T6 ↔ T7 ↔ T8 ↔ T9 ↔ T10  │
│                             │
│  Everyone connects to all   │
│  10 tokens = 100 connections│
│  n tokens = n² connections  │
└─────────────────────────────┘

LINEAR (Any Weapon):
┌─────────────────────────────┐
│                             │
│  T1 → [MEMORY] ← T2        │
│  T3 → [MEMORY] ← T4        │
│  T5 → [MEMORY] ← T6        │
│                             │
│  Everyone talks to memory   │
│  10 tokens = 10 updates     │
│  n tokens = n updates       │
└─────────────────────────────┘

TL;DR

Thing	Quadratic	Linear
Cost	O(n²)	O(n)
100k tokens	10 billion ops	6 million ops
Memory	grows forever	stays constant
Analogy	everyone DMs everyone	everyone posts to #general
Hardware	datacenter	gaming rig

The Catch (Honest Section)

Why isn't everyone using linear attention?

1. Transformers are well-understood
2. Linear loses some precision
3. Training dynamics are different
4. Ecosystem (llama.cpp) still catching up
5. Some tasks NEED quadratic precision

But for long context?
The curse IS breakable.
The weapons work.

What We Ship

RWKV-7 "Goose" in eldr.ᚲ:
- Uses state space + delta rule
- O(n) attention
- 48 tok/s on GPU
- Infinite context (theoretically)

See: uwu-rwkv7-101 for details

You Survived!

now you understand:

O(n²) = everyone DMing everyone (slow)
O(n) = everyone posting to #general (fast)
3 weapons exist: associativity, delta rule, state space
the curse was never inevitable - just the first design

your GPU thanks you uwu

rune.みんな