November 20, 2025

Preventing Context Loss in RAG Pipelines with Azure AI Search: A Semantic Chunking and Retrieval Strategy


RAG: that’s short for Retrieval-Augmented Generation is the secret sauce behind many AI systems that actually know what they’re talking about.
Instead of relying only on a language model’s memory, RAG lets the model search for relevant facts and use them as context when generating responses. It’s like giving your AI assistant a reading assignment before it answers your question.

Sounds great, right?

Well… almost.

Because there’s one sneaky issue that ruins the magic: "Context loss"
You ask a question like “Explain how AI evolved in the 1940s and 50s,” and the model gives you:

  • Just half the answer.
  • Or skips the definition of an important term.
  • Or mixes up two unrelated paragraphs.

This happens when the chunks of information fed into the model are either:

  • Too small to be meaningful, or
  • Too isolated to carry the full picture.

Today, we’re going to fix that.
We’ll build a smarter RAG pipeline using Azure AI Search, and along the way you’ll learn how to:

  • Chop up documents semantically (not just every 500 tokens)
  • Retrieve passages using both keywords and vector similarity
  • Stitch back the right context (even when your query didn’t know it needed it)

By the end, you’ll have a clean, modular setup that’s ready to power any LLM app that needs rich, relevant context without losing the thread.
Let’s start with what actually goes wrong and why it happens more often than you think.

The Problem: Context Loss in RAG Pipelines

On paper, a RAG setup sounds simple:
Break your documents into parts → Search through them → Provide the context to your model → Get a fact-based answer.

But in practice, there's a common issue that quietly sneaks in:
You lose the context right when it matters most.
Let’s say you’re indexing a long research doc. Somewhere in there, a paragraph says:


“This mechanism is a variation of Hebbian theory, which we introduced in the previous section.”

And now a user asks:
“What is Hebbian theory?”

Guess what?
Your retriever grabs the current chunk with that line but not the previous section that actually explains what Hebbian theory is.

Here’s why this happens so often:

Most pipelines split documents every N token (say, 500–800). That’s easy for machines, but brutal for meaning:

  • Sentences get cut mid-way.
  • Tables get sliced in half.
  • References point to nowhere.

Shallow retrieval:

RAG systems often rely on:

  • Keyword matches (BM25)
  • Or a single vector field (semantic similarity)

Both are good, but not enough on their own:

  • Keywords might miss reworded passages.
  • Vectors might pull something conceptually close… but not specific enough. 

Context isolation:

Even when you retrieve the right chunk, it might need its neighbors:

  • The chunk before might define a term.
  • The chunk after might finish the logic.
  • And they’re often left out entirely.

Most RAG pipelines are good at fetching passages,

but not great at reconstructing context.

Now let’s fix that without rewriting your whole stack. 

The Solution Strategy: Keep Your Context, Serve Better Answers

To solve the context-loss problem, we use a combination of semantic chunking, hybrid search, and smart indexing all powered by Azure OpenAI and Azure AI Search.

Here’s the game plan broken down:

Step 1: Semantic Chunking (Not Just Slicing Text)

We split your documents by meaning, not just fixed size. That means paragraphs that “belong together” stay together preserving the flow of thought.
This preserves semantic integrity, so the model sees the whole story.

Step 2: Index with Azure AI Search

Once we’ve chunked the content, we store it in a searchable index. Each chunk gets its own embedding and metadata (source URI, headings, position in doc, etc.).

Why this matters:

  • You get fast semantic search with vector support
  • Plus, keyword fallback when needed (hybrid search FTW!)

Step 3: Hybrid Retrieval = Vector + Keyword 

When the user asks a question, we combine:

  • Vector similarity: Find semantically close matches
  • BM25 keyword matching: Catch exact terms (e.g., "Turing Test")
  • Neighbor expansion – fetches previous and next chunks for continuity

Together, this improves precision + recall the model sees more relevant chunks, grounded in the user's intent.

Step 4: Feed to the Model as Context

We pass the top-k matching chunks to Azure OpenAI as context in your prompt.

This gives your model:

  • Enough signal to answer clearly
  • No noise from unrelated data
  • A better shot at staying grounded

let's jump in to the implementation

Prerequisites & Setup: 

Before we dive into code, let’s make sure we’ve got all the tools and ingredients ready. Think of this as your RAG recipe checklist

Python Packages to Install


pip install azure-search-documents openai python-docx tiktoken tenacity python-dotenv


Environment Variables:

Create a .env file with your credentials (never hardcode in scripts!):
AZURE_OPENAI_API_KEY=""
AZURE_OPENAI_ENDPOINT=""
AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small"

AZURE_SEARCH_ENDPOINT=""
AZURE_SEARCH_API_KEY=""
AZURE_SEARCH_INDEX_NAME="my-index-name"

Step 1: Semantic Chunking with Azure OpenAI

Before we send anything to a vector index, we need to split our text into smaller, meaningful chunks not just by paragraph or sentence, but by semantic boundaries (where the topic naturally shifts). That’s where SemanticChunker shines!

1. Setup Azure OpenAI Embeddings

from langchain_openai.embeddings import AzureOpenAIEmbeddings
import os

def get_azure_embeddings():
    """
    Creates an embedding client for Azure OpenAI
    Returns:
        AzureOpenAIEmbeddings: LangChain embedding object
    """
    return AzureOpenAIEmbeddings(
        azure_deployment=os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT"),
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    )

2. Semantic Chunking with LangChain

from langchain_experimental.text_splitter import SemanticChunker

def chunk_text_semantically(text: str, embeddings) -> list:
    """
    Splits long text into semantically meaningful chunks using Azure OpenAI embeddings.
    
    Args:
        text (str): Full document text
        embeddings: An AzureOpenAIEmbeddings object
    
    Returns:
        list: A list of Document chunks
    """
    splitter = SemanticChunker(
        embeddings=embeddings,
        breakpoint_threshold_type="percentile",   # How aggressive to split
        breakpoint_threshold_amount=95.0,         # Top 5% breakpoint
        min_chunk_size=120                        # Avoid tiny chunks
    )
    
    return splitter.create_documents([text])

Example Usage:

embeddings = get_azure_embeddings()
chunks = chunk_text_semantically(doc_text, embeddings)

print(f"Total chunks created: {len(chunks)}")
print("Sample Chunk:\n", chunks[0].page_content[:500])

Step 2: Indexing Chunks into Azure AI Search

Azure AI Search doesn’t just take your text and call it a day you need to prepare it right. Each chunk becomes a document with fields like id, content, and embedding.

Here’s how we do it, step by step

1. Define Your Index Schema (if not already created)

from azure.search.documents.indexes.models import (
    SearchIndex, SimpleField, SearchableField, VectorSearch, VectorSearchAlgorithmConfiguration
)

def build_search_index_schema(index_name: str) -> SearchIndex:
    return SearchIndex(
        name=index_name,
        fields=[
            SimpleField(name="id", type="Edm.String", key=True),
            SearchableField(name="content", type="Edm.String"),
            SimpleField(name="chunk_id", type="Edm.Int32"),
            SimpleField(name="doc_id", type="Edm.String"),
            SimpleField(name="source_uri", type="Edm.String"),
            SimpleField(name="prev_id", type="Edm.String"),
            SimpleField(name="next_id", type="Edm.String"),
            SimpleField(name="page_no", type="Edm.Int32", filterable=True),
            SimpleField(name="embedding", type="Collection(Edm.Single)", searchable=True, vector_search_dimensions=1536),
        ],
        vector_search=VectorSearch(
            algorithm_configurations=[
                VectorSearchAlgorithmConfiguration(
                    name="default-vector-config",
                    kind="hnsw",
                    parameters={"m": 4, "efConstruction": 400}
                )
            ]
        )
    )

2. Format Chunks with prev_id and next_id

import uuid

def format_chunks_for_indexing(chunks: list, doc_id: str, source_uri: str) -> list:
    formatted = []
    for i, chunk in enumerate(chunks):
        formatted.append({
            "id": f"{doc_id}_{i}".replace("#", "_"),
            "doc_id": doc_id,
            "chunk_id": i,
            "source_uri": source_uri,
            "page_no": chunk.metadata.get("page", None),
            "content": chunk.page_content,
            "prev_id": f"{doc_id}_{i-1}" if i > 0 else None,
            "next_id": f"{doc_id}_{i+1}" if i < len(chunks)-1 else None
        })
    return formatted

3. Embed and Upload to Azure AI Search

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

def index_chunks_to_azure(chunks: list, embedding_fn, search_client: SearchClient):
    for chunk in chunks:
        chunk["embedding"] = embedding_fn(chunk["content"])
    search_client.upload_documents(documents=chunks)
    print(f"Uploaded {len(chunks)} chunks to Azure Search")

Putting It All Together:

# 1. Setup SearchClient
search_client = SearchClient(
    endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"),
    index_name=os.getenv("AZURE_SEARCH_INDEX_NAME"),
    credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_API_KEY"))
)

# 2. Setup embeddings
embedding_model = get_azure_embeddings()
embedding_fn = lambda text: embedding_model.embed_query(text)

# 3. Format and push
doc_id = "ai_intro_doc"
formatted_chunks = format_chunks_for_indexing(chunks, doc_id, "document-path")
index_chunks_to_azure(formatted_chunks, embedding_fn, search_client)

Step 3: Semantic Retrieval with Context-Aware Expansion

Once your semantic chunks are indexed, it's time to make them useful. A great RAG system doesn’t just match keywords it understands meaning and also respects structure. That’s why we use Hybrid Search.
We'll:
  1. Embed the user query (for semantic search)
  2. Perform hybrid search: text + vector
  3. Pull neighboring chunks via `prev_id` and `next_id` to prevent context loss
  4. Format results for your model prompt

1. Embed the User Query

We’ll use the same embedding model to turn the query into a vector, so we can find the closest semantic matches in the index.
def get_query_embedding(query: str, embedding_model) -> list:
    return embedding_model.embed_query(query)

2. Perform Hybrid Search in Azure AI Search

Azure Search supports sending both a search_text (keyword) and vector (semantic) to run hybrid
from azure.search.documents.models import Vector

def hybrid_search(query: str, query_vector: list, search_client, k: int = 5):
    vector = Vector(value=query_vector, k=k, fields="embedding")

    results = search_client.search(
        search_text=query,
        vectors=[vector],
        select=["id", "content", "doc_id", "prev_id", "next_id"],
        top=k,
    )
    return list(results)

3. Expand Results with Prev/Next Context

def fetch_with_context(results, search_client):
    related_ids = set()
    for r in results:
        related_ids.add(r["id"])
        if r.get("prev_id"):
            related_ids.add(r["prev_id"])
        if r.get("next_id"):
            related_ids.add(r["next_id"])

    # Filter to fetch all related IDs
    filter_expr = " or ".join([f"id eq '{rid}'" for rid in related_ids])
    expanded_results = search_client.search(
        search_text="*", 
        filter=filter_expr,
        select=["id", "content", "doc_id"],
    )
    return list(expanded_results)

Putting It All Together

query = "What were the key milestones in early AI history?"
query_vector = get_query_embedding(query, embedding_model)

top_chunks = hybrid_search(query, query_vector, search_client, k=4)
contextual_chunks = fetch_with_context(top_chunks, search_client)

# Sort by doc_id/chunk_id to preserve flow
contextual_chunks = sorted(contextual_chunks, key=lambda c: c["id"])

# Display sample
for chunk in contextual_chunks:
    print(f"\n{chunk['id']}\n{chunk['content'][:300]}...")

Step 4: Stitch Chunks, Prompt the Model (The RAG Finale)

Once we’ve retrieved the best matching chunks including their neighbors it’s time to give them to the model.

But wait it’s not just “Top 3 chunks → Dump into prompt.”
We make sure the chunks are:

  • Deduplicated (no repeats)
  • Sorted (in reading order)
  • Joined with separators (so the model can distinguish them)

def prepare_prompt_context(results: list, k: int = 3) -> str:
    """
    Collects top-k search results, expands with neighbors, deduplicates, and prepares prompt-ready context.
    """
    seen = set()
    selected = []

    for doc in results[:k]:
        for chunk in [doc["prev_chunk"], doc["current_chunk"], doc["next_chunk"]]:
            chunk_id = chunk["id"]
            if chunk_id not in seen:
                selected.append(chunk)
                seen.add(chunk_id)

    # Sort by chunk_id to maintain reading order
    selected.sort(key=lambda x: x["chunk_id"])

    # Join with clear separators
    return "\n---\n".join(chunk["content"] for chunk in selected)

You can now take the returned string and plug it into your LLM prompt like so:
prompt = f"""You are an expert assistant. Use the following context to answer clearly and accurately.

{prepare_prompt_context(retrieved_results)}

Question: {user_query}
Answer:"""

Result: Your model sees a coherent slice of the source doc complete with the lead-in, answer, and follow-up. No more broken thoughts!

Wrapping Up: From Documents to Grounded Answers

Preventing context loss isn’t just a nice-to-have in Retrieval-Augmented Generation (RAG). It’s the difference between vague answers… and useful ones.

By combining:

  • Semantic chunking - keeps ideas together
  • Smart indexing - stores structure and meaning
  • Hybrid retrieval - balances precision and recall
  • Neighbor-aware context - completes the narrative

we make Azure AI Search and Azure OpenAI work together like a dream team.

This approach isn’t just scalable it’s grounded, relevant, and ready for production RAG applications.

Whether you're building internal knowledge assistants, research bots, or customer-facing copilots preserving context is your secret weapon.

If you have any questions you can reach out our SharePoint Consulting team here.

No comments:

Post a Comment