uwu-lora-moe-101 | ᚲ kenaz

Reading time: 10 min Prerequisites: Know what character classes are uwu Survival rate: 100% (your model stays intact)

The Problem (Why You Should Care)

You trained a model. It’s pretty good at coding. But now you want it to also be good at creative writing. And technical docs. And casual chat.

OPTION 1: Train 4 separate models
- 4 × 7B params = 28GB on disk
- load/unload each one? slow
- blend them? impossible

OPTION 2: Fine-tune ONE model on everything
- "jack of all trades, master of none"
- forgets old stuff when learning new stuff
- catastrophic forgetting is real

OPTION 3: LoRA MoE
- 1 base model + tiny expert add-ons
- blend at runtime with any weights
- switch personalities instantly

option 3 is the cheat code uwu

The RPG Class Analogy

imagine your character has a base class and equippable specializations:

BASE CLASS: Adventurer (7B params)
- knows general stuff
- decent at everything
- not exceptional at anything

SPECIALIZATIONS (LoRA experts):
┌──────────────┬─────────────┬─────────────┐
│ code_expert  │ chat_expert │ docs_expert │
│   32 MB      │    32 MB    │    32 MB    │
│  +coding     │ +friendly   │ +technical  │
│  +debugging  │ +casual     │ +formal     │
└──────────────┴─────────────┴─────────────┘

You can BLEND them:
- 70% code + 30% chat = friendly coder
- 50% code + 50% docs = technical developer
- 100% chat = pure casual mode

The Math (Don’t Panic)

NORMAL FINE-TUNING:
weights_new = weights_old + big_update
- need to store entire model
- 7B params = 14GB (BF16)

LoRA FINE-TUNING:
weights_new = weights_old + (A × B) × scale
                           ↑
                     tiny matrices!

A: (rank × in_features)  = 32 × 4096 = 131k params
B: (out_features × rank) = 4096 × 32 = 131k params
Total: 262k params per layer

Full model:     7,000,000,000 params
LoRA adapter:       8,000,000 params (0.1%!)

The Guitar Pedal Analogy

YOUR GUITAR (base model):
- makes sound
- has its own tone

PEDAL BOARD (LoRA experts):
┌─────────┐ ┌─────────┐ ┌─────────┐
│OVERDRIVE│ │  DELAY  │ │ CHORUS  │
│ (code)  │ │ (chat)  │ │ (docs)  │
└────┬────┘ └────┬────┘ └────┬────┘
     │           │           │
     └─────┬─────┴─────┬─────┘
           │           │
        [BLEND]     [OUTPUT]

Turn knobs to blend effects:
- overdrive 70%, delay 30% = crunchy with echo
- all three 33% each = full palette
- one pedal 100% = pure effect

the pedals don’t replace your guitar. they modify the signal.

How MoE Blending Works

FOR EACH LAYER, FOR EACH MODULE:

input tensor: x  (shape: [batch, seq, hidden])

base_output = base_model.forward(x)

expert_deltas = []
for expert, weight in zip(experts, weights):
    if weight > 0.001:  # skip negligible
        delta = expert.forward(x)  # small LoRA forward
        expert_deltas.append(delta * weight)

final_output = base_output + sum(expert_deltas)

that’s it. add the weighted deltas to the base output.

The Inventory Slot Analogy

TRADITIONAL LOADOUT:
┌─────────────────────────────────────┐
│ SLOT 1: Warrior Build (full model)  │  14 GB
│ SLOT 2: Mage Build (full model)     │  14 GB
│ SLOT 3: Healer Build (full model)   │  14 GB
└─────────────────────────────────────┘
Total: 42 GB, can only use ONE at a time

MoE LOADOUT:
┌─────────────────────────────────────┐
│ BASE: Adventurer (always loaded)    │  14 GB
│ ├─ warrior.lora (equipped)          │  32 MB
│ ├─ mage.lora (equipped)             │  32 MB
│ └─ healer.lora (equipped)           │  32 MB
└─────────────────────────────────────┘
Total: ~14.1 GB, can use ALL at once!

blend = 60% warrior + 30% mage + 10% healer
      = tanky battle-mage with light heals

Real Numbers

TESTED ON QWEN3-4B + 4 EXPERTS:

BASE MODEL (no LoRA):
  Speed: 67.50 tok/s
  Output: generic

2 EXPERTS (32 MB each):
  Speed: 44.52 tok/s
  Memory: +64 MB
  Output: blended style

4 EXPERTS (32 MB each):
  Speed: 21.64 tok/s
  Memory: +128 MB
  Output: rich blend of all 4

tradeoff: more experts = richer blend but slower
         (more matmuls per token)

The Visualization

SINGLE FINE-TUNED MODEL:
┌─────────────────────────────────┐
│         MODEL                   │
│  (fixed personality)            │
│                                 │
│  input → [layers] → output      │
└─────────────────────────────────┘

LoRA MoE MODEL:
┌─────────────────────────────────────────────┐
│                                             │
│  ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐   │
│  │exp_1 │  │exp_2 │  │exp_3 │  │exp_4 │   │
│  │ 0.3  │  │ 0.3  │  │ 0.2  │  │ 0.2  │   │
│  └──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘   │
│     │         │         │         │        │
│     └────┬────┴────┬────┴────┬────┘        │
│          │         │         │             │
│          ▼         ▼         ▼             │
│  input → [BASE] + [Σ deltas] → output      │
│                                             │
└─────────────────────────────────────────────┘

TL;DR

Thing	Traditional	LoRA MoE
Storage per style	14 GB	32 MB
Blend at runtime	impossible	trivial
Switch styles	reload model	change weights
Overhead	0%	2-5% speed hit
Memory	N × model size	1 model + N × tiny

The Catch (Honest Section)

Why isn't everyone using LoRA MoE?

1. Need to train separate LoRA experts first
2. More experts = slower inference
3. Routing logic adds complexity
4. Not all tasks benefit from blending
5. Quality depends on expert training

But when you need multiple personalities?
This is the way.

What We Ship

LoRA MoE in eldr.ᚲ:
- PeftLoRA loader (PEFT-format compatible)
- StaticRouter (fixed weights)
- WeightedRouter (adjustable)
- EncoderRouter (dynamic, input-based)
- Full Qwen3 integration

Run it:
cargo run --example lora-moe --release --features cuda -- \
  --model /path/to/qwen3 \
  --experts /path/to/expert1,/path/to/expert2 \
  --weights 0.7,0.3 \
  --prompt "Hello!"

You Survived!

now you understand:

LoRA = tiny add-on (0.1% of model size)
MoE = blend multiple add-ons (weighted sum)
runtime switching (no model reload)
it’s like RPG class specs (equip, blend, go)

your model has multiple personalities now uwu

rune.みんな