Reading time: 8 min Prerequisites: Know what a photo is uwu Survival rate: 100% (the math is just rounding)
The Problem (Why You Should Care)
Your model is too thicc for your GPU.
THE PROBLEM:
- Model: 7B parameters
- Each param: 32 bits = 4 bytes
- Total: 7B × 4 = 28 GB
YOUR GPU:
- VRAM: 8 GB
- Model: won't fit
- You: sad
what if we made each parameter smaller?
The JPEG Analogy
you know how JPEG works?
ORIGINAL PHOTO:
- 24 bits per pixel
- 10 megapixel = 30 MB
- perfect quality
JPEG PHOTO:
- ~2 bits per pixel (effectively)
- 10 megapixel = 2 MB
- looks basically the same
JPEG = throw away stuff your eyes won't notice
NF4 = JPEG for neural networks
ORIGINAL WEIGHTS:
- 32 bits per weight
- 7B weights = 28 GB
- perfect precision
NF4 WEIGHTS:
- 4 bits per weight
- 7B weights = 3.5 GB
- works basically the same
NF4 = throw away precision your model won't notice
Why It Works: The Bell Curve
neural network weights aren’t random. they follow a pattern:
WEIGHT DISTRIBUTION:
▲ lots of weights
│ ╭───╮
│ ╭─╯ ╰─╮
│ ╭╯ ╰╮
│ ╭╯ ╰╮
│╭╯ ╰╮
└──────────────────▶
-1 0 +1
Most weights are near ZERO
Few weights are at extremes
this is called a “normal distribution” or bell curve
The Photo Album Analogy
imagine you’re organizing photos:
BAD ORGANIZATION:
- 16 folders, equally spaced
- folders: -1.0, -0.875, -0.75, ... +1.0
- most folders EMPTY (no photos that extreme)
- waste of folders!
GOOD ORGANIZATION (NF4):
- 16 folders, placed WHERE THE PHOTOS ARE
- lots of folders near center (where photos cluster)
- few folders at edges (rare photos)
- all folders used equally!
NF4 puts its “folders” (quantization levels) where the weights actually are.
The 16 Magic Values
NF4 LOOKUP TABLE:
(where we're allowed to round to)
NEGATIVE SIDE: POSITIVE SIDE:
-1.0000 (extreme) +0.0796
-0.6962 +0.1609
-0.5251 +0.2461
-0.3949 +0.3379
-0.2844 +0.4407
-0.1848 +0.5626
-0.0911 +0.7230
0.0000 (center) +1.0000 (extreme)
Notice: more values clustered near zero!
these aren’t random. they’re the “bins” where ~6.25% of weights naturally fall each.
How It Works: 3 Steps
Step 1: Find Your Range (Like Auto-Brightness)
your camera auto-adjusts brightness right?
"this scene ranges from shadow to highlight"
NF4 does same:
for each chunk of 64 weights:
range = biggest absolute value
"these 64 weights range from -0.5 to +0.5"
Step 2: Normalize (Scale to Standard Range)
like converting Celsius to Fahrenheit
put everything in same scale
normalized = weight / range
now all weights are between -1 and +1
Step 3: Round to Nearest Folder
original: 0.27
nearest folder: 0.2461 (index 10)
store: just the number 10 (4 bits!)
4 bits = 16 possible values = perfect
The Storage Trick: Packing
two 4-bit values fit in one byte!
NORMAL:
value1 = 1010 (4 bits)
value2 = 0011 (4 bits)
stored separately = 2 bytes
PACKED:
byte = [1010][0011] = one byte!
stored together = 1 byte
50% extra savings just from packing uwu
The Inventory Analogy
NORMAL INVENTORY:
- slot 1: "Iron Sword +5 of Fire"
- slot 2: "Healing Potion (Greater)"
- each slot = full description = lots of space
NF4 INVENTORY:
- slot 1: "item #7"
- slot 2: "item #12"
- lookup table knows what #7 and #12 mean
- each slot = just a number = tiny!
you're storing REFERENCES, not full items
Real Numbers
BEFORE NF4:
- Qwen3-4B: 8 GB (BF16)
- Load time: 4 seconds
- Your 8GB GPU: barely fits
AFTER NF4:
- Qwen3-4B: 2 GB (NF4)
- Load time: 1 second
- Your 8GB GPU: runs easy with room to spare!
TESTED:
- 35 tokens/sec on GPU
- Answers correct (2+2=4, checked uwu)
- Thinking mode works
The Implementation
// The 16 sacred values
const NF4_TABLE: [f32; 16] = [
-1.0, -0.6962, -0.5251, -0.3949,
-0.2844, -0.1848, -0.0911, 0.0,
0.0796, 0.1609, 0.2461, 0.3379,
0.4407, 0.5626, 0.7230, 1.0,
];
// To use a weight:
fn dequantize(index: u8, scale: f32) -> f32 {
NF4_TABLE[index as usize] * scale
}
// That's it. That's the whole thing.
GPU Speedup
the real trick: keep quantized weights ON the GPU
BAD PATH:
CPU [compressed] → copy → GPU → decompress → use
^^^^^^^^ ^^^^^^
slow slow
GOOD PATH:
GPU [compressed already] → decompress → use
^^^^^^^^^^^^^^^^^^^
already there!
zero copies = maximum speed
TL;DR
| Thing | Analogy |
|---|---|
| NF4 | JPEG for AI weights |
| 16 values | Photo album folders |
| Bell curve | Where weights naturally cluster |
| Scale | Auto-brightness per chunk |
| Packing | Two items per slot |
| 8× smaller | Same quality (basically) |
What We’re Releasing
NF4 Quantization for Candle (Rust):
- Pure Rust CPU implementation
- CUDA kernel for GPU
- Zero-copy GPU storage
- Works with any safetensors model
Files:
- candle-core/src/nf4/
- candle-transformers/src/models/qwen3_generic.rs
- candle-examples/examples/nf4-qwen3-chat/
Run it:
cargo run --example nf4-qwen3-chat --release --features cuda -- \
--model /path/to/qwen3 \
--prompt "Hello!"
You Survived!
now you understand:
- NF4 = JPEG for AI (lossy but good enough)
- 16 values = smart folders (placed where weights are)
- bell curve = why it works (most weights near zero)
- packing = bonus savings (2 values per byte)
the model fits in VRAM now uwu
rune.みんな