Most RAG discussions go straight to retrieval.

Vector database. Embedding model. Chunk size. Top k. Reranking. Hybrid search.

All of that is important.

Bad retrieval will break the system.

But once RAG leaves the demo, retrieval gets too much attention.

A lot of production RAG fails before retrieval. Some fails after retrieval. And some fails because nobody can tell what actually broke.

The system found a chunk but the chunk was damaged during ingestion.

The system had the latest policy in the source system but the index still had the old policy.

The system retrieved the right context but the model used it in the wrong way.

The system looked better in a demo but regressed on real questions nobody had tested.

That is why production RAG is mostly not about retrieval.

Retrieval is one part. The real work is around it.

The hard parts are the things around retrieval: ingestion, data lifecycle, generation control and evaluation.

What ingestion destroys, retrieval cannot recover

Ingestion decides what meaning survives into the model’s context. Retrieval can only search what ingestion preserved.

This is where many RAG systems start breaking.

Not at the vector database. Not at top k. Not at the prompt.

Much earier, At ingestion.

Most teams treat ingestion like a setup task.

Parse the docs. Split them into chunks. Create embeddings. Store them somewhere.

Done.

But ingestion is not just a setup task.

It is where the original document becomes the version the model actually sees.

If the document had meaning in tables, headings, layout, hierarchy, code blocks or surrounding context, ingestion either keeps that meaning or destroys it.

A simple example.

User asks:

Does the Enterprise plan support audit log export?

The source document has a feature comparison table:

FeatureStarterProEnterprise
Audit log exportNoNoYes
SSONoYesYes
Data retention30 days90 daysCustom

The answer is obvious.

Enterprise supports audit log export.

Retrieval may even find the right page. So from outside, the system looks fine.

But during ingestion, the table gets flattened into this:

Audit log export No No Yes SSO No Yes Yes Data retention 30 days 90 days Custom

Now the chunk still looks relevant.

It has “audit log export.” It has “Enterprise.” It has “Yes.”

But the relationship is damaged.

Which “Yes” belongs to which plan? Which value belongs to which feature? Where did the column headers go?

The model is now trying to reconstruct structure that the ingestion pipeline already threw away.

And this is the trap.

People will debug retrieval.

Maybe the embedding model is weak. Maybe top k should be higher. Maybe reranking is needed. Maybe the prompt should say “answer only from context” more strongly.

But the real bug happened before all of that.

The system found the right document.

It just found a broken version of the document.

A better embedding model cannot recover missing column headers. A reranker cannot restore broken table structure. A stricter prompt cannot bring back relationships that were removed from the chunk.

The information may exist in the original doc but it no longer exists in the form the model receives.

This is not a small edge case. Even if 20% of your important answers depend on structure, naive ingestion can silently cap answer quality before retrieval even runs.

Tables are only one version of the problem. The same thing happens with API docs, where a parameter gets separated from the description that explains it. With policy docs, where the exception ends up in a different chunk from the rule. With code examples, where the explanation gets split away from the code it was explaining.

The source document had meaning. The ingested chunk lost it.

The model is still expected to answer as if nothing was lost.

That is not a retrieval problem. That is an ingestion problem.

So before tuning embeddings, reranking or prompts, read the actual text sent to the model.

If a human cannot answer the question from that chunk alone, the model is starting with damaged context.

Retrieval can only find what ingestion preserved.

The index has to follow the data, not the other way around

A RAG index is not a dump of documents.

It is a moving copy of your business state.

And that copy can drift.

A source document changes, but the old chunk is still in the index.

A pricing page gets updated, but the stale version still ranks higher.

A user loses access to a document, but the indexed chunk is still retrievable.

A policy gets deleted, but the assistant can still answer from it.

This is where production RAG becomes messy.

In demos, the corpus is static. You upload a few PDFs, index them once and ask questions.

In production, documents change every day.

New versions come in. Old versions should disappear. Permissions change. Ownership changes. Some systems sync every few minutes. Some sync once a day. Some fail silently.

Now imagine a user asks:

What is our refund policy for annual plans?

The company changed the policy last week from 30 days to 14 days.

The source system has the new policy but the RAG index still has both chunks.

One says 30 days. One says 14 days.

Both look relevant, both sound official and both came from company docs.

Which one should the system trust?

This is not a retrieval problem in the usual sense. Retrieval may find both.

The harder problem is knowing which source is current, allowed and authoritative.

That means the index needs lifecycle rules.

When a document changes, the old version should not keep answering.

When a document is deleted, its chunks should not survive forever.

When permissions change, access should be enforced at query time. Not only during ingestion.

When two sources disagree, the system needs a source of truth rule.

The hard part is that your index behaves like a cache.

Not in the technical sense. In the architectural sense.

It is a derived copy of source systems that keep changing. Docs change, policies change, permissions change pricing changes. Old versions disappear from the source but still survive in the index.

Cache invalidation was already hard when the cache was only serving values.

RAG makes it harder because now the cache is answering questions on behalf of the company.

When that lifecycle is not handled, the assistant becomes dangerous in a very boring way.

Not because the model made something up, because the index still had the old answer.

This is why freshness, deletion, versioning and permissions are not backend details.

They are answer quality features.

If the index does not follow the data lifecycle, the model is not answering from your business reality.

It is answering from an old snapshot of it.

Generation needs product control, not just prompting

Even when retrieval works, the answer can still be wrong.

This part gets underestimated.

They think once the right chunks are in context, the model will naturally do the right thing. Not always.

The model may answer beyond the evidence. It may merge two unrelated chunks. It may ignore the fresh source and use the older one. It may give a direct answer when it should ask a follow up. It may sound confident when the context is weak.

That is why generation cannot be treated as “just write a better prompt.”

The model is not just writing text. It is deciding what the product does next.

A simple example.

User asks:

Can I delete this customer’s data permanently?

The system retrieves two chunks.

One chunk says admins can delete customer data.

Another chunk says permanent deletion requires compliance approval for enterprise accounts.

Both are relevant.

A bad answer says:

Yes, admins can delete customer data.

A better answer says:

Not directly. For enterprise accounts, permanent deletion requires compliance approval.

The difference is not retrieval.

The difference is control.

The system needs rules for when to answer, when to qualify, when to refuse, when to ask for more information and when to escalate.

Especially in workflows where the cost of a wrong answer is high.

In support, a wrong answer becomes a refund. In legal, a wrong answer becomes a liability. In internal ops, a wrong answer quietly spreads because nobody fact checks the assistant.

In low risk cases, a slightly incomplete answer may be acceptable.

In high risk cases, the system should not “try its best.”

It should stay inside evidence, cite the source and say when it does not know.

That is product behavior.

Not prompt decoration.

So the question is not only:

Did we retrieve the right context?

The better question is:

What is the model allowed to do with that context?

Because production RAG is not just about finding information.

It is about controlling the answer path after information is found.

Evaluation is where teams are blind

Most RAG teams do not know if the system got better.

They only know if the last demo looked good.

That is a problem.

Because RAG has too many moving parts.

You change the chunk size. You change the embedding model. You add reranking. You rewrite the prompt. You increase top k. You add metadata filters. You change the sync pipeline.

The answer improves for one question. Everyone feels good.

But ten other questions quietly got worse.

This is why manual testing is dangerous.

You ask 20 sample questions. The answers look fine. The citations look believable. The model sounds confident. So the change ships.

But production does not fail on your favorite 20 questions. It fails on the ugly ones.

Ambiguous questions. Stale documents. Missing context. Conflicting sources. Permission sensitive answers. Questions where the right answer is “not enough information.”

This is why evals matter.

Not as a nice to have.

Because otherwise you do not know what you are breaking.

A RAG eval should not only ask:

Is the final answer good?

That is too late.

You need to test the whole path.

Did retrieval find the right source?

Did the final context include the evidence?

Did the model use that evidence correctly?

Did it refuse when the evidence was missing?

Each of these can fail separately. Each one sends you debugging in a different direction. And yes, this is how RAG finds you at 3 AM.

Retrieval can work but context construction can drop the useful chunk.

Context can be correct but the model can ignore it.

The answer can be right but the citation can point to the wrong source.

The answer can sound helpful but it should have refused.

That is why “the answer looks good” is not an evaluation strategy.

You need a small but serious golden set.

Real user questions. Known good sources. Expected answer behavior. Hard negatives. Permission sensitive cases. Old policy versus new policy cases. Questions where the system should ask a follow up.

And you need to run it every time you change the system.

Not only when you change the model.

A prompt change can break grounding. A chunking change can break tables. A reranker change can bury the right source. A sync change can bring stale data back. A metadata filter can remove the only useful evidence.

Without evals, every improvement is mostly a guess.

This is the part that separates a demo from a system.

A demo needs a few good answers.

A production system needs to know when it is wrong, where it is wrong and whether the latest change made it worse.

Retrieval is only one part of that.

Evaluation is what keeps ingestion honest. Evaluation is what catches lifecycle drift. Evaluation is what proves generation is staying inside the evidence.

Without evals, you are not improving a RAG system.

You are editing it and hoping.

The real work is around retrieval

Retrieval matters.

Bad retrieval will break the system.

But production RAG usually does not fail only because the vector database picked the wrong chunk.

It fails because the chunk was bad before retrieval touched it.

It fails because the index drifted away from the source data.

It fails because the model used the right context in the wrong way.

It fails because nobody had evals strong enough to catch the regression.

That is why “improve retrieval” is often too narrow.

The real system is bigger than that.

You have to preserve meaning during ingestion. You have to keep the index aligned with the business. You have to control what the model is allowed to do. You have to evaluate every step, not just the final answer.

RAG in production is not just a retrieval problem.

It is a data problem.

It is a product behavior problem.

It is an evaluation problem.

Retrieval may find the chunk.

Production decides whether that chunk is current, allowed, structured, sufficient and safe to use.

That is where the real work starts.