Shubham Kumar Nayak
All writing

Production RAG on Azure: A Practical Engineering Playbook

28 Aug 2025

RAGAzureAIArchitectureEvaluation

A practical playbook for building RAG systems on Azure with better retrieval quality, evaluation, observability, cost control, and human-review workflows.

Production RAG on Azure: A Practical Engineering Playbook

RAG demos are easy. Production RAG is a systems problem.

The difficult parts are rarely the chat UI. The hard parts are messy documents, weak retrieval, unclear evaluation, latency, cost, permissions, and the question nobody asks early enough: how will we know the answer is trustworthy?

This is the practical playbook I use when thinking about Retrieval-Augmented Generation systems on Azure.


Start with the workflow, not the model

Before choosing models or vector databases, I want to understand the real workflow:

  • Who asks the question?
  • What source documents are allowed?
  • What makes an answer correct?
  • Should the answer cite sources?
  • When should the system refuse to answer?
  • Does a human need to review the output?
  • What happens when the document source changes?

RAG is useful only when it fits the business process around it.

For internal tools, that may mean a review screen. For customer-facing flows, that may mean stricter guardrails and source citations. For operational workflows, it may mean connecting the answer to downstream actions through APIs or MCP-style tools.


Retrieval quality comes first

If retrieval is weak, prompting will not save the system.

I usually start with a small evaluation set:

  • real user questions
  • expected source documents
  • expected answer shape
  • unacceptable answer examples
  • edge cases where the system should say "I do not know"

Then I test retrieval before generation.

The goal is to know whether the right context is reaching the model. If the model never sees the right source, the final answer will be polished but unreliable.


Chunking should follow the document shape

Chunking is not just splitting text every few hundred tokens.

Good chunks preserve meaning:

  • section headings
  • document titles
  • page numbers
  • source type
  • owner or department
  • effective dates
  • permissions metadata

For policy documents, contracts, reports, and operational manuals, metadata is often as important as the text itself.

On Azure AI Search or a similar search layer, I prefer hybrid search for many business systems: keyword search for exact terms, vector search for semantic match, and filters for permission or document type.


Make observability part of the design

A RAG system should log enough to be debugged without exposing sensitive data unnecessarily.

The important traces are:

  • user query
  • retrieved document IDs
  • retrieval scores
  • selected chunks
  • model used
  • latency by stage
  • token usage
  • answer acceptance or rejection
  • feedback from users or reviewers

This is where Azure App Insights, Log Analytics, and structured application logs become valuable.

Without observability, every quality issue becomes a guessing game.


Cost control is architecture

AI cost is not just a billing concern. It changes the architecture.

I usually think in tiers:

  • cheap embeddings unless quality proves otherwise
  • cached answers for repeated low-risk queries
  • smaller models for extraction and classification
  • stronger models only for high-value reasoning
  • background jobs for slow or expensive processing
  • rate limits and budget alerts

RAG systems should be useful and financially predictable.

If a workflow cannot afford the cost per answer, it is not production-ready yet.


Human review is not a weakness

Many useful AI systems should not auto-complete the final action.

For real workflows, I like patterns such as:

  • draft generated answer
  • show cited sources
  • highlight missing evidence
  • allow user edits
  • store final approved output
  • feed approved corrections back into evaluation data

This turns AI into a useful assistant instead of an uncontrolled decision-maker.

It also makes the system easier to trust inside companies.


Engineering takeaway

Production RAG is not "add embeddings and a chatbot."

It is retrieval discipline, evaluation, observability, cost control, and workflow design.

That is why I connect RAG work to broader solution architecture. The value is not only in generating text. The value is in building a system that can answer from real sources, explain itself, stay within budget, and improve over time.

Related: