uwu-rwkv7-101

2026-01-29 14:30 967 words 5 min read

no table of contents
RWKV-7 "Goose" explained like a video game - your notebook that can use an eraser. Linear attention in Rust.

Reading time: 10 min Prerequisites: Know what a save file is uwu Survival rate: 100% (the goose is friendly)


The Problem (Why You Should Care)

Imagine you’re playing an RPG and the game tracks EVERY input you’ve ever made.

TRANSFORMER = that annoying save system:
- saves EVERY button press
- save file grows FOREVER
- 1M inputs = your SSD dies
- checking old saves = slower and slower

RWKV = normal save file:
- fixed size notebook
- overwrites old stuff
- 1M inputs = same size as 1
- constant speed forever

but v7 has something special…


The Video Game Analogy

v6 = Notebook with Pen Only

YOUR INVENTORY NOTEBOOK (v6):

  ┌─────────────────────────┐
  │ sword    [fade...fade]  │  ← old notes fade
  │ potion   [fading...]    │
  │ NEW: helmet             │  ← can only ADD
  └─────────────────────────┘

Problem: wrote "10 potions" but you used 3
Can't fix it! It just fades away eventually.

v7 = Notebook with Pen AND Eraser

YOUR INVENTORY NOTEBOOK (v7):

  ┌─────────────────────────┐
  │ sword    [fade...fade]  │  ← old notes fade
  │ potion: 10 → 7          │  ← CAN EDIT!
  │ NEW: helmet             │  ← can add
  └─────────────────────────┘

The eraser = "delta rule"
You can CORRECT mistakes, not just wait for them to fade!

Even Simpler: Git Analogy

v6 = append-only log
  - git add, git add, git add
  - can't amend
  - old stuff just gets buried

v7 = full git
  - git add
  - git commit --amend  ← THIS IS THE DELTA RULE
  - git add

v7 can REWRITE history (safely)

The Delta Rule = Game Trainer

you know those game trainers that modify memory?

NORMAL GAME:
- state = whatever happened

GAME WITH TRAINER:
- state = whatever happened
- trainer: "wait HP is wrong, fixing..."
- state = corrected value

RWKV-7:
- memory = fade old stuff + ADD new stuff
- delta rule: "wait that's wrong, fixing..."
- memory = corrected value

the model is basically running cheat engine on itself during inference uwu


The Formula (Don’t Panic)

OLD (v6):
notebook = fade × old_page + new_note

NEW (v7):
notebook = fade × old_page + ERASER(old_page) + new_note
            ↑                 ↑                   ↑
         keep some         fix mistakes        add new

that middle part (ERASER) is doing mini training while running.

like if your game learned your playstyle mid-session and adjusted.


Why O(n) vs O(n²) Matters

Transformer = Looking at EVERYONE’s inventory

Party of 4: check 4 × 4 = 16 comparisons
Party of 100: check 100 × 100 = 10,000 comparisons
Party of 1M: check 1M × 1M = 💀💀💀

"Hey does anyone have a potion?"
*checks literally everyone, every time*

RWKV = Just Check Your Own Notebook

Party of 4: check 1 notebook = 1 lookup
Party of 100: check 1 notebook = 1 lookup
Party of 1M: check 1 notebook = 1 lookup

"Do I have a potion?"
*checks personal notebook*
"yeah it says 7 here"

Real Numbers That Matter

TESTED ON GPU:
- Model: RWKV-7 2.9B "Goose"
- Speed: 48 tokens/sec
- Memory: fixed state size
- Context: infinite (theoretically)

TRANSFORMER EQUIVALENT:
- would need 48GB+ for same context
- slows down with longer input
- hard limit on context

The State = Your .git Folder

you don't see .git but it's THERE
tracking everything
fixed size (mostly)
contains full history (compressed)

RWKV state = same concept
- hidden tensor [B, H, K, V]
- fixed size always
- contains "compressed history"
- you don't see it but model uses it

v_first = Cross-Save Data

some games let later save files access early game data:

LAYER 0 (early game):
  v_first = "remembered the tutorial sword"

LAYER 15 (late game):
  v = current_value + blend(v_first, current)

  "hey remember that tutorial sword?
   it's relevant again for this boss"

this is why v7 can track things across LONG contexts


The Implementation

// the actual loop is just:
for t in 0..seq_len {
    // 1. Fade (things get less important over time)
    state = decay * state;

    // 2. Delta rule (FIX mistakes)
    state = state + correction_term;  // ← THE MAGIC

    // 3. Add new (write new notes)
    state = state + new_key_value;

    // 4. Read (what do we know?)
    output[t] = read_from(state);
}

that’s it. that’s the whole thing uwu


The Decay Formula

DECAY = how fast old stuff fades

too fast (0.1): forgets EVERYTHING
too slow (0.99): remembers GARBAGE

v7 sweet spot: 0.545 to 1.0
- bounded decay
- can't explode or vanish
- goldilocks zone

like a game that auto-clears old quest markers but not too aggressively


TL;DR

ThingAnalogy
StateYour notebook / .git folder
DecayOld notes fading
Delta ruleEraser / git amend
v_firstCross-save data
O(n)Just check your notebook
O(n²)Ask everyone at the party

What We’re Releasing

RWKV-7 "Goose" for Candle (Rust):
- Full implementation
- GPU accelerated
- 48 tok/s tested
- Works with HuggingFace weights

Files:
- candle-transformers/src/models/rwkv_v7.rs
- candle-examples/examples/rwkv7/

Run it:
cargo run --example rwkv7 --release --features cuda -- \
  --model /path/to/rwkv7 \
  --prompt "The meaning of life is"

You Survived!

now you understand:

  • state = notebook (fixed size, tracks everything)
  • delta rule = eraser (can fix mistakes)
  • O(n) = just check yourself (not everyone else)
  • v_first = cross-save data (early info for later)

the goose flies in Rust uwu



rune.みんな

© 2024 - 2026 rune.みんな
Powered by theme astro-koharu · Inspired by Shoka