Reading time: 10 min Prerequisites: Know what character classes are uwu Survival rate: 100% (your model stays intact)
The Problem (Why You Should Care)
You trained a model. It’s pretty good at coding. But now you want it to also be good at creative writing. And technical docs. And casual chat.
OPTION 1: Train 4 separate models
- 4 × 7B params = 28GB on disk
- load/unload each one? slow
- blend them? impossible
OPTION 2: Fine-tune ONE model on everything
- "jack of all trades, master of none"
- forgets old stuff when learning new stuff
- catastrophic forgetting is real
OPTION 3: LoRA MoE
- 1 base model + tiny expert add-ons
- blend at runtime with any weights
- switch personalities instantly
option 3 is the cheat code uwu
The RPG Class Analogy
imagine your character has a base class and equippable specializations:
BASE CLASS: Adventurer (7B params)
- knows general stuff
- decent at everything
- not exceptional at anything
SPECIALIZATIONS (LoRA experts):
┌──────────────┬─────────────┬─────────────┐
│ code_expert │ chat_expert │ docs_expert │
│ 32 MB │ 32 MB │ 32 MB │
│ +coding │ +friendly │ +technical │
│ +debugging │ +casual │ +formal │
└──────────────┴─────────────┴─────────────┘
You can BLEND them:
- 70% code + 30% chat = friendly coder
- 50% code + 50% docs = technical developer
- 100% chat = pure casual mode
The Math (Don’t Panic)
NORMAL FINE-TUNING:
weights_new = weights_old + big_update
- need to store entire model
- 7B params = 14GB (BF16)
LoRA FINE-TUNING:
weights_new = weights_old + (A × B) × scale
↑
tiny matrices!
A: (rank × in_features) = 32 × 4096 = 131k params
B: (out_features × rank) = 4096 × 32 = 131k params
Total: 262k params per layer
Full model: 7,000,000,000 params
LoRA adapter: 8,000,000 params (0.1%!)
The Guitar Pedal Analogy
YOUR GUITAR (base model):
- makes sound
- has its own tone
PEDAL BOARD (LoRA experts):
┌─────────┐ ┌─────────┐ ┌─────────┐
│OVERDRIVE│ │ DELAY │ │ CHORUS │
│ (code) │ │ (chat) │ │ (docs) │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└─────┬─────┴─────┬─────┘
│ │
[BLEND] [OUTPUT]
Turn knobs to blend effects:
- overdrive 70%, delay 30% = crunchy with echo
- all three 33% each = full palette
- one pedal 100% = pure effect
the pedals don’t replace your guitar. they modify the signal.
How MoE Blending Works
FOR EACH LAYER, FOR EACH MODULE:
input tensor: x (shape: [batch, seq, hidden])
base_output = base_model.forward(x)
expert_deltas = []
for expert, weight in zip(experts, weights):
if weight > 0.001: # skip negligible
delta = expert.forward(x) # small LoRA forward
expert_deltas.append(delta * weight)
final_output = base_output + sum(expert_deltas)
that’s it. add the weighted deltas to the base output.
The Inventory Slot Analogy
TRADITIONAL LOADOUT:
┌─────────────────────────────────────┐
│ SLOT 1: Warrior Build (full model) │ 14 GB
│ SLOT 2: Mage Build (full model) │ 14 GB
│ SLOT 3: Healer Build (full model) │ 14 GB
└─────────────────────────────────────┘
Total: 42 GB, can only use ONE at a time
MoE LOADOUT:
┌─────────────────────────────────────┐
│ BASE: Adventurer (always loaded) │ 14 GB
│ ├─ warrior.lora (equipped) │ 32 MB
│ ├─ mage.lora (equipped) │ 32 MB
│ └─ healer.lora (equipped) │ 32 MB
└─────────────────────────────────────┘
Total: ~14.1 GB, can use ALL at once!
blend = 60% warrior + 30% mage + 10% healer
= tanky battle-mage with light heals
Real Numbers
TESTED ON QWEN3-4B + 4 EXPERTS:
BASE MODEL (no LoRA):
Speed: 67.50 tok/s
Output: generic
2 EXPERTS (32 MB each):
Speed: 44.52 tok/s
Memory: +64 MB
Output: blended style
4 EXPERTS (32 MB each):
Speed: 21.64 tok/s
Memory: +128 MB
Output: rich blend of all 4
tradeoff: more experts = richer blend but slower
(more matmuls per token)
The Visualization
SINGLE FINE-TUNED MODEL:
┌─────────────────────────────────┐
│ MODEL │
│ (fixed personality) │
│ │
│ input → [layers] → output │
└─────────────────────────────────┘
LoRA MoE MODEL:
┌─────────────────────────────────────────────┐
│ │
│ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ │
│ │exp_1 │ │exp_2 │ │exp_3 │ │exp_4 │ │
│ │ 0.3 │ │ 0.3 │ │ 0.2 │ │ 0.2 │ │
│ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ │
│ │ │ │ │ │
│ └────┬────┴────┬────┴────┬────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ input → [BASE] + [Σ deltas] → output │
│ │
└─────────────────────────────────────────────┘
TL;DR
| Thing | Traditional | LoRA MoE |
|---|---|---|
| Storage per style | 14 GB | 32 MB |
| Blend at runtime | impossible | trivial |
| Switch styles | reload model | change weights |
| Overhead | 0% | 2-5% speed hit |
| Memory | N × model size | 1 model + N × tiny |
The Catch (Honest Section)
Why isn't everyone using LoRA MoE?
1. Need to train separate LoRA experts first
2. More experts = slower inference
3. Routing logic adds complexity
4. Not all tasks benefit from blending
5. Quality depends on expert training
But when you need multiple personalities?
This is the way.
What We Ship
LoRA MoE in eldr.ᚲ:
- PeftLoRA loader (PEFT-format compatible)
- StaticRouter (fixed weights)
- WeightedRouter (adjustable)
- EncoderRouter (dynamic, input-based)
- Full Qwen3 integration
Run it:
cargo run --example lora-moe --release --features cuda -- \
--model /path/to/qwen3 \
--experts /path/to/expert1,/path/to/expert2 \
--weights 0.7,0.3 \
--prompt "Hello!"
You Survived!
now you understand:
- LoRA = tiny add-on (0.1% of model size)
- MoE = blend multiple add-ons (weighted sum)
- runtime switching (no model reload)
- it’s like RPG class specs (equip, blend, go)
your model has multiple personalities now uwu
rune.みんな