Towards cheaper, better, faster language models

How language models are getting cheaper, better, and faster

We take for granted that technology gets cheaper, better, faster. But how exactly is this happening for language models?


Today's models are the most expensive language models will ever be. Costs are falling at each step of the development cycle: training, inference, and fine-tuning.

I model training costs through the lens of scaling laws. Scaling laws describe the relationship between compute, dataset, and model size. Of these three variables, compute is the most expensive. Early in our understanding of language models, scaling laws suggested that the best way to increase model performance was to throw more compute at the problem. However, it turns out that for a given compute budget, increasing dataset size can lead to a more performant model. We can come up with a compute-optimal training recipe. The days of simply throwing more compute at our models are over. This is a promising idea that, for the vast majority of economically useful tasks, points to a more affordable cost structure.

Training costs are fixed — you incur them once in a model's lifecycle. Inference costs, on the other hand, are ongoing and scale with usage. For language models to be more affordable, inference costs must also improve. I have been paying attention to two trends. The first is hardware-related. In August of this year, NVIDIA announced their new GH200 chips optimized for inference. The GH200s have a lot more memory, allowing you to centralize inference operations on a single chip cost-efficiently. There has also been a ton of recent non-hardware-related work to optimize inference cost. The key variables affecting inference costs are the number of input tokens and output tokens. You'd want to maximize the insight conveyed per token for each. The fewer words it takes for you or a language model to convey information, the better. I like this paper that shows how prompt adaption (finding the shortest prompt) and selectively routing specific queries to certain language model APIs can save costs.

The final cost bucket for language model developers is fine-tuning. Language models become valuable reusable assets when we can train them once and fine-tune them cheaply on downstream tasks. However, full fine-tuning is memory intensive and computationally costly. To solve this, researchers tried freezing a large portion of model parameters and only fine-tuning a small number of extra parameters. This technique achieves comparable results to full fine-tuning and also solves the problem of preserving a model's original capabilities (catastrophic forgetting). 

Every step of language model development is getting cheaper. These savings will eventually flow down to consumers through lower prices.


Multimodality and hallucination minimization are the most recent consequential improvements to language models. We’ve given language models eyes and ears while reducing their ability to spew nonsense.

Multimodal models perform better than text-only models on various tasks. This is intuitive. A model trained on text, images, audio, and video performs better than a model trained on just one modality. Additionally, multimodal models allow for higher bandwidth communication. Imagine a world where you can prompt a multimodal model with any combination of text, images, audio, or video and get a response back in any format.

Language models have also gotten better at solving hallucinations. Retrieval augmented generation (RAG) is the most popular technique for getting language models to provide accurate and helpful information. RAG uses knowledge from an existing dataset to guide the generation of a model's response. Another product-focused approach is to couple RAG with source citations in response generation. Finally, as with all language model problems, a decent amount of prompt hacking gets you far. Chain-of-thought reasoning — where you ask a model to reason step-by-step — has emerged as a typical pattern to reduce hallucination.


The usefulness of any computing primitive is bounded by input and output (I/O) — how fast you can read from it and write to it. For language models, inference latency matters quite a bit since many use cases tend to be interactive. 

I am particularly excited by all the work focused on reducing the memory footprint of language models — the most significant contributor to latency. The better we get at model compression, the faster we can inference and the more easily we can run large language models on consumer hardware, thereby eliminating network latency. This is an area of active research. Quantization compresses a model by using fewer bits to represent its parameters. For example,  you can use 16 or 4 bits to represent a float instead of 32 bits. Distillation essentially "steals" knowledge from a large model (teacher) to train a small model (student). The core of distillation involves transferring knowledge from the teacher to the student model. This is often done by using the output of the teacher model as a form of "soft labels" or guidance for training the student model. In traditional training, a model learns from the hard labels (the ground truth), but in distillation, the student model learns to mimic the behavior of the teacher model.

You can already use these techniques today and run quantized 70B parameter models on your laptop with tools like LM Studio.

What does this mean for startups?

All startup opportunities exist on a technology curve. As a given technology gets better, the magnitude of possibilities gets bigger. This was true for the internet, PC, and mobile revolutions. It will also be true of language models and AI more broadly.

Each bucket of improvement described above by itself is exciting. But when we combine all these, we start to dream up possibilities that compound on one another. What will you build with a low-latency, non-hallucinating, multimodal language model that costs almost nothing?

Subscribe to Kojo's blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.