Reading time: 6 min Prerequisites: None. We got you. Survival rate: 100% (these beasts are friendly)
The Quest (Why You Should Care)
You’ve heard of Mamba. You’ve heard of Delta Attention.
Different papers. Different teams. Different names.
But here’s the secret nobody told you:
They're the same beast wearing different skins.
Once you see it, you can’t unsee it.
The Hunt Board Notice
QUEST TYPE: CAPTURE - DO NOT SLAY
"Two rare beasts have been sighted. MAMBA in the
western servers (Tri Dao, Dec 2023), DELTA in the
eastern datacenters (Kimi, Nov 2024).
Guild research indicates they may be related species.
Capture both for study. High reward."
The Two Beasts: First Impressions
🐍 MAMBA △ DELTA LINEAR
----------------------------------------------
Born: Dec 2023 Born: Nov 2024
Origin: Western papers Origin: Eastern papers
Creator: Tri Dao et al. Creator: Kimi/Moonshot
llama.cpp: ✅ supported llama.cpp: Issue #16930
Status: Community favorite Status: New meta incoming
They LOOK different. Different equations. Different variable names.
But watch how they fight…
The Scary Runes (Side by Side)
🐍 Mamba’s Rune
h_t = A·h_{t-1} + B·x_t
y_t = C·h_t
△ Delta’s Rune
S_t = S_{t-1} + β_t ⊗ (v_t k_t^T - S_{t-1})
o_t = S_t · q_t
See? Totally different! …right?
Let’s bonk these runes.
Tamed Versions
🐍 Mamba Says
"I have a memory (h).
Every step, I update it:
new_memory = A × old_memory + B × new_input
A = how much to KEEP from before
B = how much the new stuff matters
Then I output through C."
Even simpler: “Blend old memory with new input. Output the result.”
△ Delta Says
"I have a memory (S).
Every step, I update it:
new_memory = old_memory + β × (new_stuff - old_memory)
β = how much to update (0 to 1)
(new_stuff - old_memory) = THE DELTA (what's different!)
Then I answer queries with q."
Even simpler: “Figure out what’s NEW. Move toward it a little bit.”
The Cousin Revelation
Now watch this:
MAMBA: new = A × old + B × new_input
"blend these two things"
DELTA: new = old + β × (target - old)
"move old toward target"
Rewrite Mamba slightly:
MAMBA: new = (1-α) × old + α × new_input
where α = B/(A+B)
DELTA: new = (1-β) × old + β × new_stuff
THEY’RE THE SAME FORMULA.
+--------------------------------------------+
|
Both are just: |
|
new = (keep_this_much × old) |
+ (add_this_much × new_stuff) |
|
Different names. Same idea. |
THEY'RE COUSINS. |
|
+--------------------------------------------+
The Family Tree
RECURRENCE
|
"update memory each step"
|
+-----------+-----------+
|
RNN/LSTM STATE SPACE
(old school) (new school)
|
+--------+--------+
|
🐍 MAMBA △ DELTA
"blend formula" "delta formula"
|
+--------------+-----------------+
|
SAME ANCESTOR:
"fixed memory, linear compute"
What They Share (The Family Traits)
BOTH beasts have:
- O(n) compute → no quadratic curse!
- Fixed memory size → bounded, predictable
- Content-aware gate → they LEARN what matters
- Selective updates → "bouncer" logic
- Linear scaling → 1M tokens? No problem
Different skins.
Different variable names.
SAME FAMILY.
The Everyday Analogy
Think of updating your phone:
MAMBA APPROACH:
- Take 80% of old apps
- Add 20% new apps
- Blend = new phone state
- "Keep most, add some"
DELTA APPROACH:
- Look at difference between old and target
- Move 20% toward target
- Result = new phone state
- "Move toward what I want"
SAME RESULT. Different mental model.
Why This Matters
HUNTER'S INSIGHT:
If you understand ONE of these beasts,
you understand BOTH.
- Learn Mamba first (more tutorials exist)
- When Delta drops fully, you're ready
- Same concepts, different notation
- Master the family, not just one member
DON'T:
- Wait for "the winner" to emerge
- Learn them as separate things
- Get confused by different notation
DO:
- See the family resemblance
- Learn the shared concepts
- Adapt to either instantly
Practical Status
🐍 MAMBA - Ready Now:
- Candle (Rust) ✅
- llama.cpp ✅
- HuggingFace ✅
- PyTorch native ✅
- GO USE IT
△ DELTA - Coming Soon:
- Kimi API (proprietary) ✅
- llama.cpp Issue #16930 (in progress)
- Open weights: kimi-k2 (2025)
- Community: catching up
- LEARN MAMBA, BE READY FOR DELTA
TL;DR
| Aspect | 🐍 Mamba | △ Delta |
|---|---|---|
| Core idea | Blend old + new | Add the difference |
| Gate name | Δ (discretization) | β (forget rate) |
| Memory | h (hidden state) | S (state matrix) |
| Complexity | O(n) | O(n) |
| Memory usage | Fixed | Fixed |
| Family | Linear attention | Linear attention |
| Usable now? | YES | Coming soon |
Key insight: Learn one, understand both. They’re cousins.
You Survived!
You now understand the linear attention family better than most researchers who only read one paper.
The beasts looked different because:
- Different authors
- Different notation conventions
- Different marketing
But now you see:
- Same ancestor (recurrence)
- Same goal (fixed memory, linear compute)
- Same mechanism (gated state update)
The beasts are family. Nobody told you.
rune.みんな ᚦ bonk - we bonk the scary math so you don’t have to