Writing archive

Cost per task

Small Models Are Catching Up. Your AI Strategy Should Notice.

Do not buy model hype. Benchmark the work your company actually does, then route each job to the cheapest model that reliably clears the bar.

May 202612 min
cost per taskmodel routingopen-weight AI
Model routing board comparing frontier, small proprietary, and open-weight models by task quality and cost

The benchmark is not which model is smartest

For the last few years, the default enterprise AI pattern has been simple: send the work to the most capable proprietary model available, then worry about the economics later. That was defensible when smaller models were mostly useful for routing, extraction, classification, or demos. It is much less defensible now.

The lower end of the model spectrum has changed. Smaller proprietary models, hosted open-weight models, and models that can run on consumer or prosumer hardware are becoming good enough for real scoped business work. Not all work. Not frontier-level reasoning. Not high-risk synthesis. But enough daily work that the economics deserve serious attention.

The useful question is not whether a model can beat the best model in the world. The useful question is whether it can clear the quality bar for a specific job at a materially lower cost per accepted task.

Start with the work

Most AI conversations still over-index on model prestige. A new model drops, a leaderboard shifts, a few demos circulate, and teams start asking whether they should move everything to the latest release. That is an expensive way to run AI.

The better way is to start with the work people actually ask models to do every day: drafting internal emails, summarizing meetings, classifying customer requests, extracting fields from documents, writing first-pass reports, checking spreadsheet logic, running narrow financial analysis, creating agent subroutines, transforming data, researching internal documents, and generating recommendations.

Those are different jobs. They have different risk profiles, error tolerance, latency requirements, and cost ceilings. Treating them as if they all require the same model is not sophistication. It is a lack of architecture.

A cost-per-task benchmark changes the conversation

I have been running a benchmark across 100 real-world financial-analysis interactions. These are not abstract leaderboard puzzles. They are examples of the kind of work people actually ask models to do: interpret intent, perform calculations, reason through a financial question, and return something consistent enough to be useful.

For this category of task, I generally view 94%+ accuracy as the threshold where a model starts to become remotely viable without excessive review overhead. That threshold is not universal. A marketing rewrite might tolerate a lower bar. A regulated financial recommendation might require a much higher bar and human review. But for this benchmark, 94% is the approximate line where I start paying serious attention.

The proprietary leaders performed well, as they should. The surprise was that smaller and open-weight models were close enough to matter. A locally run Gemma 4 26B A4B configuration crossed the same practical viability threshold that I use for this financial-analysis workflow. It did not beat the leader, but it may not need to.

Cost per token is the wrong economic unit

A provider rate card tells you what one million input or output tokens cost. It does not tell you what a completed business task costs. That gap matters because different models do not consume tokens the same way. One model might solve a task with a short answer. Another might produce a long answer. One might require retries. Another might follow a schema correctly on the first attempt.

Cost per accepted task is the more useful metric: model cost plus tool cost plus infrastructure cost plus retry cost plus review cost plus failure cost, divided by accepted outputs.

A cheap model that requires constant review is not cheap. An expensive model that solves a high-risk task correctly on the first pass may be worth it. A local model with zero marginal API token cost is not free if it needs hardware, power, monitoring, and engineering. A frontier model is not rational if it is rewriting routine internal emails all day.

Open-weight models are no longer only classification engines

For years, I wanted small models to be useful for more than classification. Most of the time, they were not. They could route, label, summarize shallowly, and extract fields. But as soon as the task required multi-step reasoning, calculation discipline, or domain context, the gap showed up quickly.

That is changing. Gemma, Qwen, Llama, and other open-weight families are increasingly credible building blocks for scoped business workflows. They are not universally reliable, and they still need evaluation. But they now create a middle layer between frontier APIs and bad local inference.

That middle layer matters because the enterprise model portfolio no longer has to be one premium model for everything. A company can route email drafting, classification, extraction, report drafting, calculation checks, and agent subroutines to different tiers based on benchmark performance and risk.

Benchmark the boring work

The biggest AI savings will not come from arguing about which model is most impressive. They will come from benchmarking the boring work your company does every day: emails, summaries, extractions, spreadsheet explanations, internal analyses, report drafts, customer request classifications, financial calculations, and agent subroutines.

Once you know the jobs, you can measure the models. Once you measure the models, you can route the work. Once you route the work, the economics change.

Small models are not catching up because they are suddenly better than frontier models. They are catching up because many tasks never needed frontier models in the first place.