Edge ML: Quantization, Pruning, Distillation
Running ML on phones, IoT, and embedded hardware means making models 10-100x smaller. Three techniques do the heavy lifting.
Three techniques
Edge devices have small memory and limited power. Big models don't fit. Three techniques shrink models for the edge: quantisation (reduce precision), pruning (remove weights), distillation (train a smaller model to mimic a larger one). Each has trade-offs; combining them lets useful models run on phones, laptops, embedded devices.
The size-reduction math. A 7B-parameter model in FP16 is 14GB. Q4 quantisation: 3.5GB. Add pruning at 30%: 2.5GB. Add distillation to a 3B target: 1.5GB. Each step composes; the cumulative reduction makes models that wouldn't fit fit comfortably.
The quality cost. Each shrinking step costs some quality. Quantisation: typically 1-3 percentage points on benchmarks. Pruning: 2-5 points. Distillation: depends on student-teacher gap; 5-15 points typical. Combined: 8-20 points. The product question is whether the smaller model is "good enough" for the use case.
The use-case fit. Edge ML works for use cases where 80% of cloud-model quality is acceptable. Conversational assistants, basic Q&A, code completion, translation, voice transcription. Doesn't work for use cases requiring frontier capability, complex reasoning, technical analysis, edge-of-knowledge tasks.
Quantisation
Reduce weight precision from FP16 (16 bits) to int8 (8 bits) to int4 (4 bits) to even lower. Modern quantisation (GPTQ, AWQ, GGUF Q-series) preserves quality reasonably well at int8 (1-2 point drop) and int4 (2-4 point drop typical, sometimes higher). The smaller the precision, the more aggressive the quality compromise.
The mechanics. Quantisation maps the floating-point weight distribution to a smaller set of integer levels. Per-channel or per-group quantisation (different scales for different weight slices) preserves more quality than naive uniform quantisation. The choice of scaling and grouping is what separates good quantisation libraries from bad ones.
The precision tiers. FP16: baseline. INT8: 2x smaller, ~1-2 point quality drop. INT4: 4x smaller, ~2-4 point quality drop. INT2/INT3: 6-8x smaller but quality drops sharply (5-15 points), only justified for severe size constraints.
The activation quantisation. Beyond weights, activations can also be quantised. Activation quantisation requires calibration on representative data; without good calibration, quality drops fast. Most edge deployments quantise weights aggressively but keep activations in FP16 or INT8.
The format choice. GGUF (llama.cpp) is the standard for CPU inference of quantised models. For GPU edge (NVIDIA Jetson, etc), AWQ and GPTQ are common. Match the format to your inference runtime; mismatches mean re-quantising.
Pruning
Set some weights to zero, then re-train (or re-tune) on the remaining sparse model. Structured pruning (remove whole channels or attention heads) preserves dense-matrix-multiply efficiency. Unstructured pruning removes individual weights; more compression but harder to accelerate.
The structured-pruning advantage. Removing whole channels keeps the matrix shape regular. The pruned model is just a smaller dense model; standard hardware accelerates it normally. This is the practical approach for production.
The unstructured-pruning trade-off. Removing individual weights creates sparse matrices. Specialised sparse hardware (NVIDIA's 2:4 sparsity) accelerates these; commodity hardware doesn't. Without sparse acceleration, unstructured pruning is irrelevant.
The pruning workflow. Train (or use pre-trained) model. Identify low-magnitude weights (or low-importance channels via attribution). Remove them. Fine-tune the pruned model to recover lost capacity. Iterate; each round of prune-then-finetune retains more capability than one-shot pruning.
The realistic ratios. 30-50% weight removal is achievable with minimal quality loss when structured. 70%+ removal causes substantial quality degradation. The "easy" pruning is in the bottom 30-40%; aggressive pruning has diminishing returns.
Distillation
Train a small "student" model to match the outputs of a large "teacher" model. The student learns from the teacher's probability distributions, not just hard labels. Distillation can transfer surprising amounts of capability, a 7B distilled student often outperforms a 7B trained from scratch by 5-10 points on benchmarks.
The mechanics. The teacher produces output distributions over tokens (softmax probabilities). The student is trained to match these distributions, not just to predict the teacher's top-1 choice. The "soft labels" carry more information than hard labels; the student learns better generalisation patterns.
The teacher choice. The teacher should be substantially larger than the student (5-20x typically). Same-size teacher and student offer minimal benefit. Choose the largest teacher you can afford to run during distillation; distillation compute is one-time, the benefits are permanent.
The data choice. Distillation needs lots of input prompts. Use synthetic prompts generated by another model, real prompts from logs (anonymised), or curated diverse prompts. The data diversity determines what capabilities transfer; narrow data produces narrow students.
The compute cost. Distillation requires running the teacher on every training example, then training the student on the teacher's outputs. Roughly 2-5x the training compute of training the student from scratch. The cost is high but one-time; the resulting model is reusable.
Combining them
The pipeline that works in production: distill from a large teacher to a small student → prune the student → quantise the pruned student. Each step compounds; you get cumulative savings. A 70B teacher can compress to ~1B effective parameters at a few-percent quality cost, usable on a phone.
The order matters. Distill first (it's the biggest reduction, sets the baseline). Prune second (refines the distilled model). Quantise last (final size reduction for deployment). Reordering loses some efficiency; the standard order has been optimised by practice.
The compounding math. 7B → 3B (distill, ~5% quality cost) → 2B (prune 30%, ~3% cost) → 0.5GB Q4 (quantise, ~2% cost). Total: 14GB → 0.5GB, ~10% quality cost. The shrunk model fits anywhere; the quality is acceptable for many edge use cases.
The framework support. Hugging Face Transformers + bitsandbytes for quantisation, neural_compressor for pruning, distil-whisper-style scripts for distillation. The tooling is mature; the recipes are well-documented; the work is mostly tuning hyperparameters for your specific model and target.
Common antipatterns
Quantising without measuring quality. Some models tolerate Q4 well; others lose lots. Always evaluate on YOUR task before deploying.
Pruning without fine-tuning. One-shot pruning is destructive; fine-tuning recovers most of the lost quality. The fine-tune is essential, not optional.
Distilling on a narrow dataset. The student learns the dataset's distribution; OOD performance suffers. Use diverse, broad data for distillation.
Stacking compression too aggressively. Diminishing returns set in. 70% size reduction is usually achievable; 95% is rarely worth the quality cost.
What to do this week
Three moves. (1) For your current largest deployed model, measure how much you'd save with INT4 quantisation. Often 4x cheaper inference for 1-3 points quality loss. (2) If you have edge deployment plans, build the size budget first (memory limit, latency limit). The budget tells you which compression techniques you need. (3) Test a Q4-quantised version of your largest model on the actual deployment hardware. The performance ceiling is what determines product scope.