uwu-nf4-quant-101

2026-01-29 14:00 963 words 5 min read

no table of contents
NF4 quantization explained like JPEG for AI - shrink 8x, keep the vibe. Now in Rust with CUDA.

Reading time: 8 min Prerequisites: Know what a photo is uwu Survival rate: 100% (the math is just rounding)


The Problem (Why You Should Care)

Your model is too thicc for your GPU.

THE PROBLEM:
- Model: 7B parameters
- Each param: 32 bits = 4 bytes
- Total: 7B × 4 = 28 GB

YOUR GPU:
- VRAM: 8 GB
- Model: won't fit
- You: sad

what if we made each parameter smaller?


The JPEG Analogy

you know how JPEG works?

ORIGINAL PHOTO:
- 24 bits per pixel
- 10 megapixel = 30 MB
- perfect quality

JPEG PHOTO:
- ~2 bits per pixel (effectively)
- 10 megapixel = 2 MB
- looks basically the same

JPEG = throw away stuff your eyes won't notice

NF4 = JPEG for neural networks

ORIGINAL WEIGHTS:
- 32 bits per weight
- 7B weights = 28 GB
- perfect precision

NF4 WEIGHTS:
- 4 bits per weight
- 7B weights = 3.5 GB
- works basically the same

NF4 = throw away precision your model won't notice

Why It Works: The Bell Curve

neural network weights aren’t random. they follow a pattern:

WEIGHT DISTRIBUTION:

              ▲ lots of weights
              │     ╭───╮
              │   ╭─╯   ╰─╮
              │  ╭╯       ╰╮
              │ ╭╯         ╰╮
              │╭╯           ╰╮
              └──────────────────▶
             -1     0     +1

Most weights are near ZERO
Few weights are at extremes

this is called a “normal distribution” or bell curve


The Photo Album Analogy

imagine you’re organizing photos:

BAD ORGANIZATION:
- 16 folders, equally spaced
- folders: -1.0, -0.875, -0.75, ... +1.0
- most folders EMPTY (no photos that extreme)
- waste of folders!

GOOD ORGANIZATION (NF4):
- 16 folders, placed WHERE THE PHOTOS ARE
- lots of folders near center (where photos cluster)
- few folders at edges (rare photos)
- all folders used equally!

NF4 puts its “folders” (quantization levels) where the weights actually are.


The 16 Magic Values

NF4 LOOKUP TABLE:
(where we're allowed to round to)

NEGATIVE SIDE:          POSITIVE SIDE:
-1.0000 (extreme)       +0.0796
-0.6962                 +0.1609
-0.5251                 +0.2461
-0.3949                 +0.3379
-0.2844                 +0.4407
-0.1848                 +0.5626
-0.0911                 +0.7230
 0.0000 (center)        +1.0000 (extreme)

Notice: more values clustered near zero!

these aren’t random. they’re the “bins” where ~6.25% of weights naturally fall each.


How It Works: 3 Steps

Step 1: Find Your Range (Like Auto-Brightness)

your camera auto-adjusts brightness right?
"this scene ranges from shadow to highlight"

NF4 does same:
for each chunk of 64 weights:
  range = biggest absolute value

"these 64 weights range from -0.5 to +0.5"

Step 2: Normalize (Scale to Standard Range)

like converting Celsius to Fahrenheit
put everything in same scale

normalized = weight / range

now all weights are between -1 and +1

Step 3: Round to Nearest Folder

original: 0.27
nearest folder: 0.2461 (index 10)
store: just the number 10 (4 bits!)

4 bits = 16 possible values = perfect

The Storage Trick: Packing

two 4-bit values fit in one byte!

NORMAL:
value1 = 1010  (4 bits)
value2 = 0011  (4 bits)
stored separately = 2 bytes

PACKED:
byte = [1010][0011] = one byte!
stored together = 1 byte

50% extra savings just from packing uwu

The Inventory Analogy

NORMAL INVENTORY:
- slot 1: "Iron Sword +5 of Fire"
- slot 2: "Healing Potion (Greater)"
- each slot = full description = lots of space

NF4 INVENTORY:
- slot 1: "item #7"
- slot 2: "item #12"
- lookup table knows what #7 and #12 mean
- each slot = just a number = tiny!

you're storing REFERENCES, not full items

Real Numbers

BEFORE NF4:
- Qwen3-4B: 8 GB (BF16)
- Load time: 4 seconds
- Your 8GB GPU: barely fits

AFTER NF4:
- Qwen3-4B: 2 GB (NF4)
- Load time: 1 second
- Your 8GB GPU: runs easy with room to spare!

TESTED:
- 35 tokens/sec on GPU
- Answers correct (2+2=4, checked uwu)
- Thinking mode works

The Implementation

// The 16 sacred values
const NF4_TABLE: [f32; 16] = [
    -1.0, -0.6962, -0.5251, -0.3949,
    -0.2844, -0.1848, -0.0911, 0.0,
    0.0796, 0.1609, 0.2461, 0.3379,
    0.4407, 0.5626, 0.7230, 1.0,
];

// To use a weight:
fn dequantize(index: u8, scale: f32) -> f32 {
    NF4_TABLE[index as usize] * scale
}

// That's it. That's the whole thing.

GPU Speedup

the real trick: keep quantized weights ON the GPU

BAD PATH:
CPU [compressed] → copy → GPU → decompress → use
     ^^^^^^^^            ^^^^^^
     slow                slow

GOOD PATH:
GPU [compressed already] → decompress → use
     ^^^^^^^^^^^^^^^^^^^
     already there!

zero copies = maximum speed

TL;DR

ThingAnalogy
NF4JPEG for AI weights
16 valuesPhoto album folders
Bell curveWhere weights naturally cluster
ScaleAuto-brightness per chunk
PackingTwo items per slot
8× smallerSame quality (basically)

What We’re Releasing

NF4 Quantization for Candle (Rust):
- Pure Rust CPU implementation
- CUDA kernel for GPU
- Zero-copy GPU storage
- Works with any safetensors model

Files:
- candle-core/src/nf4/
- candle-transformers/src/models/qwen3_generic.rs
- candle-examples/examples/nf4-qwen3-chat/

Run it:
cargo run --example nf4-qwen3-chat --release --features cuda -- \
  --model /path/to/qwen3 \
  --prompt "Hello!"

You Survived!

now you understand:

  • NF4 = JPEG for AI (lossy but good enough)
  • 16 values = smart folders (placed where weights are)
  • bell curve = why it works (most weights near zero)
  • packing = bonus savings (2 values per byte)

the model fits in VRAM now uwu



rune.みんな

© 2024 - 2026 rune.みんな
Powered by theme astro-koharu · Inspired by Shoka