Small Language Models for Vertical Agents

The Generalist Is Not Always the Right Choice

The default assumption in LLM-backed products is that bigger is better. Use the frontier model, prompt it well, ship it. That works. It also leaves money, latency, and privacy on the table when the agent only needs to do one thing.

Small language models — the 1B to 8B parameter range, usually fine-tuned for a narrow task — have quietly become the right answer for a growing class of production workloads. Not because they are smarter than frontier models, but because on a specific, bounded task, a well-tuned small model outperforms a frontier model at 1/50th the inference cost, with latency measured in milliseconds instead of seconds, and deployable in places a frontier model cannot go.

The shift that made this practical was the maturation of post-training techniques. LoRA fine-tuning is now routine. Distillation pipelines work. The tooling around preparing training data from production traces is good enough that a team can go from "we need a small model for this" to "it's serving traffic" in a week.

What Small Models Win At

The jobs where small models reliably beat frontier models in production:

Classification-heavy tasks. Intent detection, document routing, content moderation. The task has a bounded output space. A fine-tuned 3B model will match or beat a 100B+ model on a classification task it was trained for, because the training process optimized for that specific decision boundary.

Structured extraction. Pulling fields from documents, invoices, medical records, support tickets. The output is a known schema, the input is within a specific domain, and the task is repetitive. Small models fine-tuned on a few thousand examples hit precision targets that generalists only match when heavily prompted.

Routing within agentic systems. The first-stage classifier in a router that decides which tool to call, which agent to hand off to, or which retrieval strategy to use. This runs on every user request. Running it on a frontier model is paying a 50x premium for a decision a small model makes just as well.

High-volume similarity and embedding-like work. Quality scoring, duplicate detection, content matching. These are throughput-bound jobs where cost and latency dominate.

What They Do Not Win At

Reasoning across long contexts, synthesis of disparate information, open-ended writing, and any task where the input distribution drifts significantly from the training data. Those remain frontier-model territory.

The pattern that works in production is not "replace the frontier model" but "wrap the frontier model with small models at the edges." The frontier model handles the reasoning-intensive core. Small models handle the classification, routing, extraction, and scoring around it. Cost drops because the small models absorb the high-volume traffic. Latency drops because the small models return instantly. Quality often improves because each small model was tuned for its specific job.

The Training Data Problem

The bottleneck for small-model deployment is not the tooling. It is the training data. Teams that have been running a frontier model in production for six months have gold-mine data — every request, the model's response, whether the user accepted or rejected it — and most of them are not using it.

The pattern: pick a task where the frontier model is being over-used. Log the inputs and outputs for 30 days. Have the product team (or a labeling pipeline) review the outputs for correctness. The resulting dataset of input → validated output is training data for a small model fine-tuned on the same task. It usually takes 5,000 to 20,000 examples to get a small model to match frontier quality on a narrow task. Fewer if the task is very constrained.

The trap is using the frontier model's outputs as training labels without review. The small model will learn the frontier model's mistakes, which leaks into new failure modes that were not in the original system. Review is not optional; it is how you avoid inheriting the bugs you were trying to eliminate.

Deployment Options That Change the Math

Small models unlock deployment patterns that frontier models cannot match.

On-device inference. A 3B model quantized to 4-bit runs on commodity laptops and on newer phones. That changes the product story for anything privacy-sensitive. No request ever leaves the device. Latency is bounded by local compute. Offline operation is automatic. Frontier models cannot do this, and will not be able to for years.

Edge deployment. A small model running in a regional data center is closer to the user than any frontier API. That matters for anything real-time — voice applications, interactive UIs, low-latency agentic loops. Round-trip times drop from 500ms to under 50ms.

On-premise deployment. For regulated industries, the question is not whether you want to self-host — it is whether you can. A small model on your own hardware is a fundamentally different compliance posture than a call to a third-party API. The same institution that cannot ship a GPT-integrated product can ship one with an SLM running in their own environment.

Every one of those patterns opens a market that frontier-model systems cannot enter.

The Vertical Agent Pattern

The architecture that keeps emerging in our work: a vertical agent for a specific industry or function is a small model fine-tuned on that vertical's data, wrapped in structured tools that pull from that vertical's systems, orchestrated by a lightweight planner that may or may not be a frontier model.

The small model at the core is what makes the economics work. Fine-tuning captures the domain vocabulary, the typical workflows, the regulatory constraints specific to that industry. A generalist frontier model can approximate this with long prompts, but the prompt grows with every edge case until it is fighting the model's own priors. Fine-tuning folds those priors into the weights where they belong.

A medical documentation agent runs on a model fine-tuned on medical scribing data, tool-connected to the EHR, orchestrated by whatever reasoning layer the product team chose. A legal research agent runs on a model fine-tuned on case law and legal reasoning patterns. A field-service dispatch agent runs on a model fine-tuned on work-order data and scheduling constraints. The shape is the same; the weights change.

When Not to Bother

If your product is early, your traffic is low, and the task is still in discovery, stay on the frontier model. Fine-tuning a small model requires training data, and the fastest way to get training data is to run the frontier model in production and capture what it does. Skip that step and you are fine-tuning on synthetic data, which rarely reflects real usage.

If your task is open-ended or the input distribution is unbounded, small models will disappoint. Anything that looks like "respond to whatever the user asks" does not compress into a fine-tuned small model.

If your volume is low — a few thousand calls a month — the cost savings do not justify the operational overhead of a fine-tuned model. Frontier API is simpler and good enough.

The sweet spot is a production system with at least one high-volume, narrow, well-understood task sitting inside it. That task is where a small model pays for itself in weeks. If you want a scoped assessment of which tasks in your system fit the pattern, we scope that and return a shortlist. Small models are not a replacement for frontier models. They are how the mature version of a frontier-model system operates.

Small Language Models for Vertical Agents

The Generalist Is Not Always the Right Choice

What Small Models Win At

What They Do Not Win At

The Training Data Problem

Deployment Options That Change the Math

The Vertical Agent Pattern

When Not to Bother

Keep reading.

AI Evals in Production: The Work Nobody Sees Until It Breaks

LLM Cost Control: Strategies That Actually Move the Bill

Agentic RAG: When Retrieval Becomes a Reasoning Step