Back to System Design
System Design
easy
mid

What are large language models and small language models, and what does that mean for frontend engineers?

LLMs (Large Language Models) are very large neural networks trained on massive text corpora for broad, capable language tasks. SLMs (Small Language Models) are compact models — fewer parameters — that trade some capability for lower latency, lower cost, on-device/edge deployment, and privacy.

5 min read·~5 min to think through

Both are language models — neural networks (almost always transformers) that predict and generate text. The difference is scale, and scale drives a whole set of tradeoffs.

LLM — Large Language Model

  • Large — billions to hundreds of billions of parameters; trained on huge, broad datasets.
  • Capable & general — strong reasoning, broad knowledge, good at complex/open-ended tasks, multi-step instructions, code.
  • Costly — heavy compute/memory; usually run in the cloud on GPU/TPU clusters. Higher latency and per-call cost.
  • Examples: GPT-4/Claude/Gemini-class models.

SLM — Small Language Model

  • Small — millions to a few billion parameters; often distilled, quantized, or fine-tuned from or alongside larger models.
  • Efficient — low latency, low cost, can run on-device / at the edge (phone, browser, embedded), works offline, keeps data local (privacy).
  • Narrower — less general knowledge and reasoning depth; best when fine-tuned for a specific task/domain (classification, extraction, a focused assistant) where it can match or beat a general LLM at a fraction of the cost.
  • Examples: Phi, Gemma, Llama small variants, on-device models.

The tradeoff table

LLMSLM
Parametersbillions–hundreds of billionsmillions–few billion
Capabilitybroad, strong reasoningnarrower, task-focused
Latency / costhigherlow
Deploymentcloud, GPU clustersedge / on-device / cheap servers
Privacydata leaves the devicecan stay local
Best forcomplex, open-ended, general tasksspecific tasks, real-time, offline, cost-sensitive

How to think about choosing

It's not "bigger is better" — it's fit:

  • Complex, varied, reasoning-heavy → LLM.
  • Well-defined, high-volume, latency/cost/privacy-sensitive → SLM (often fine-tuned).
  • Common real architecture: an SLM handles the routine cases on-device/cheaply and routes hard cases to an LLM (a cascade/router pattern).

Why this matters for a frontend/product engineer

It's a real product decision: an SLM running in-browser (WebGPU) or on-device gives instant, private, offline inference with no per-call cost — great for autocomplete, classification, or a lightweight assistant. An LLM API gives you power but adds latency, cost, and a network dependency. Knowing the spectrum lets you pick the right one — or combine them.

Follow-up questions

  • When would you pick an SLM over an LLM?
  • What are distillation and quantization, and how do they produce SLMs?
  • What's a model router/cascade and why use one?
  • What are the tradeoffs of running a model on-device vs via an API?

Common mistakes

  • Assuming bigger is always better — ignoring latency, cost, and privacy.
  • Thinking SLMs are just 'worse LLMs' rather than a different fit.
  • Overlooking on-device/edge deployment as an option.
  • Not considering a cascade (SLM + LLM) architecture.

Performance considerations

  • SLMs offer low latency, low/zero per-call cost, and on-device/offline operation; LLMs add network latency and per-token cost. Routing routine cases to an SLM and hard cases to an LLM optimizes both cost and quality.

Edge cases

  • Tasks where a fine-tuned SLM beats a general LLM.
  • Offline or privacy-constrained environments forcing on-device models.
  • Cost-sensitive high-volume workloads.

Real-world examples

  • An in-browser SLM (WebGPU) doing instant private autocomplete or classification.
  • A cascade: on-device SLM handles common queries, escalates complex ones to a cloud LLM.

Senior engineer discussion

Seniors present it as a capability/cost/latency/privacy spectrum rather than a hierarchy, note that SLMs come from distillation/quantization/fine-tuning, and emphasize fit-for-purpose plus cascade/router architectures. For a product role they connect it to concrete frontend choices — on-device WebGPU inference vs an LLM API.

Related questions