RAG vs fine-tuning: which one do you actually need?

“Should we fine-tune a model or build a RAG system?” is one of the most common questions we hear at the start of an AI engagement. The honest answer is: in 90% of cases, you should start with RAG — and most teams never need fine-tuning at all.

Here’s how to think about the choice without getting lost in the hype.

What each one actually does

Retrieval-Augmented Generation (RAG) teaches the model what to know. You store your documents, search them at query time, and feed the relevant chunks into the prompt alongside the user’s question. The model’s behaviour doesn’t change — only its inputs do.

Fine-tuning teaches the model how to behave. You train it on examples of the inputs you want it to handle and the outputs you want it to produce. The behaviour changes, but the model still doesn’t gain real-time knowledge.

That distinction is the whole game. Once you internalise it, the choice is usually obvious.

Use RAG when:

Your knowledge changes. Documentation gets updated, prices change, contracts get signed. RAG handles this naturally — you re-index.
You need citations. RAG can show the user which document an answer came from. Fine-tuned models can’t.
You’re worried about hallucinations. Grounding the answer in retrieved context dramatically reduces fabricated information.
You don’t have thousands of training examples. Fine-tuning needs data; RAG just needs documents.

Use fine-tuning when:

You need a very specific output format the base model struggles to produce reliably (a domain-specific JSON schema, for example).
You need a particular tone or style that prompting alone cannot achieve consistently.
You’re trying to compress a long, repetitive prompt into a smaller model to cut latency or cost — and you have evaluation data to prove the smaller model still works.
You need to teach the model a skill (a new reasoning pattern, a translation between formats), not knowledge.

Use both when:

You have a high-volume, latency-sensitive use case where a fine-tuned smaller model retrieves relevant context via RAG and answers. This is what production AI systems usually look like at scale — but it is the third step, not the first.

The order we recommend

Baseline: a frontier model (Claude, GPT, Gemini) with a well-engineered prompt. Measure how good “good enough” actually is.
Add RAG: ground the answers in your data. This is where most quality gains come from.
Add evaluations: a test set of representative queries with expected behaviours, run automatically on every prompt change.
Optimise: caching, smaller models, hybrid search, reranking. Most teams find more wins here than in fine-tuning.
Fine-tune (maybe): only if you’ve exhausted the previous steps and the cost or latency math still doesn’t work.

The most common mistake

The most common mistake we see is teams jumping straight to fine-tuning because it sounds more sophisticated. It almost always costs more, takes longer, and produces worse results than a well-tuned RAG pipeline on a frontier model — at least until you’ve validated the use case end-to-end.

Start simple. Measure honestly. Add complexity only when the data demands it.

If you’re trying to decide which approach fits your use case, we’re happy to take a look — even a 30-minute conversation usually clarifies the right path.