Optimizing Retrieval-Augmented Generation with Advanced Chunking Techniques: A Comparative Study

In this first research report of R&D Lab at Antematter, we have explored a variety of chunking strategies—including Spacy, NLTK, Semantic, Recursive, and context-enriched chunking—to demonstrate their impact on the performance of language models in processing complex queries. By experimenting with these different methodologies and evaluating their effectiveness through quantitative analysis, we aimed to identify the most efficient techniques for optimizing RAG-based systems.

💡 Articles

28 July 2025

In the field of natural language processing (NLP), Retrieval-Augmented Generation (RAG) stands as a cutting-edge method, combining the strengths of retrieval and generation models to produce responses that are both accurate and contextually relevant to user queries. This fusion enables RAG-based applications to leverage textual data more effectively, ensuring that responses are grounded in the most pertinent information available.

To facilitate this process, chunking plays a pivotal role. This technique involves breaking down a large corpus of text into smaller, more manageable segments. Such segmentation is key to enhancing the relevance of content retrieved during similarity searches.

By optimizing these chunks for context, they can be seamlessly integrated into the framework of a Large Language Model (LLM), enhancing the model's ability to generate coherent and contextually appropriate responses.

However, the challenge lies in determining the optimal size for these chunks. If the segments are too large or too small, it can lead to issues such as incomplete information, fragmented sentences, or even a loss of semantic coherence. These issues can, in turn, cause the LLM to produce incorrect responses or even fabricate information—a phenomenon known as "hallucination".

To mitigate these risks, it's critical to employ dynamic chunking strategies that can adjust the size of chunks based on the context, rather than relying on a one-size-fits-all approach.

One effective strategy to enhance chunking is to increase the chunk or overlap size, which can significantly improve the model's performance by providing it with a more comprehensive view of the text.

However, this approach is not always feasible due to computational or memory limitations. An alternative solution is the use of context-enriched chunks, which append summaries of documents to each segment, thereby enriching the information available to the LLM without excessively increasing the computational load.

In our research report, we explore a variety of chunking strategies—including Spacy, NLTK, Semantic, Recursive, and context-enriched chunking—to demonstrate their impact on the performance of language models in processing complex queries.

By experimenting with these different methodologies and evaluating their effectiveness through quantitative analysis, we aim to identify the most efficient techniques for optimizing RAG-based systems.

This comparison sheds light on the significance of selecting the right chunking strategy to ensure that the retrieved information aligns well with the user's query, thereby improving the language model's understanding and response accuracy. Through this detailed examination, we provide insights into best practices for enhancing the functionality of RAG-based applications, ultimately contributing to the advancement of natural language processing technology.

Our analysis reveals that semantic chunking has proven to be the most effective strategy for ensuring coherent information within chunks and outperforming other strategies in each category. However, it's important to note that its effectiveness may vary based on the dataset and scenarios, and it may not always be the most appropriate approach to use in certain cases.

Chunking

Chunking is a technique that breaks down large pieces of text into smaller units known as chunks.

Importance of Chunking

Chunking is crucial in dealing with extensive data collection in LLM applications. By chunking, we can feed the LLM with the most relevant information, which can help us avoid hallucinations because the information retrieved via RAG can be endorsed for LLM generation. Effective chunking of information can improve the relevancy of information that gets loaded in the LLM window during generation and this in turn makes the LLM generate more accurate responses.

Chunking Strategies

We aim to evaluate various chunking strategies based on the relevant evaluation metrics. By understanding the strengths and weaknesses of each, we can determine the most suitable approach for specific scenarios.

NLTK

"NLTKTextSplitter" utilizes NLTK Tokenizers to segment text into "lists" of substrings, streamlining the process of forming coherent and insightful chunks. Its sentence tokenizer uses predetermined rules based on punctuation marks and other language-specific patterns to identify sentence boundaries.

SPACY

"SpacyTextSplitter" utilizes Spacy’s tokenizer to segment text and generate "Doc" objects with the discovered segment boundaries, enabling better context preservation in the resulting chunks. Its sentence tokenizer uses a pre-trained statistical model to handle diverse text inputs and languages while adapting to various writing styles and conventions.

Semantic

"SemanticChunker" dynamically selects the breakpoint between paragraphs or sentences based on embedding similarity. This ensures that each "chunk" contains semantically cohesive sentences.

Recursive

"RecursiveCharacterTextSplitter" involves iteratively breaking down the text into smaller chunks using predefined separators. This process aims to achieve uniformity in segment size, although the exact sizes may vary.

Context-Enriched

"Context-enriched" chunking involves, breaking down information into meaningful segments and adding helpful summaries. With neatly organized sections and detailed descriptions, the model navigates through the information landscape. It uses query transformation to explore various phrasings and retrieves documents that align with the contextual understanding embedded within each query. It's not just about the keywords, but also the core content.

Windowed Summarization

We employed an additive pre-processing technique, which involved enriching each text chunk with a windowed summary of the previous few chunks. To explore the impact of different contextual scopes, we made the "window-size" adjustable, allowing for experimentation with various sizes. Window size refers to the number of previous text chunks used to create a summarized context for each chunk. In. The process was straightforward: for each chunk, labeled cn, we combined it with summaries of the two preceding chunks (cn-1 and cn-2), assuming a window size of n=3. We then created a summary of this combined content, marked as k, and appended it to cn, but chose not to generate a summary for cn alone.

Here is a high-level overview of context enriched chunking approach.

Experimental Setup:

To start the process, we implemented a chunking script to segment a given document into smaller parts called chunks, using a specific strategy. These chunks were subsequently fed into the context of the Mixtral MOE model for response generation. GPT 3.5 model is then used to assess the quality of responses against queries, enabling a more comprehensive evaluation that how well the query aligns with the response generated by a particular strategy. The reason for choosing a distinct evaluation model is to ensure a thorough examination and validation of the responses generated by the model, which would lead to a more reliable evaluation process. Furthermore, it adheres to professional standards to use different models for response generation and evaluation, as it acknowledges the inherent challenge in accurately evaluating one's work. Let's have a look at the necessary steps one by one that were involved in the experiment:

Import Documents

We’ve used a Paul Graham essays dataset.

Perform Chunking

We’ve to perform different types of chunking on a given document. The script takes the text and the custom chunking strategy as parameters and then applies the relevant chunking strategy.

Embeddings Generation

The current state-of-the-art embedding model on Hugging Face's MTEB (Massive Text Embedding Benchmark) Leaderboard is "BGE-model"; it was developed by the Beijing Academy of Artificial Intelligence (BAAI). It is a pretrained transformer model that can be used for various natural language processing tasks, such as text classification, question answering, text generation, etc. The model is trained on a massive dataset of text and code, and it has been fine-tuned on the Massive Text Embedding Benchmark (MTEB).

Embeddings are large vectors that contain the semantic meaning of the document. We’ve used an open-source BGE model for embeddings and Mixtral MOE as an LLM to create the search index after embedding the documents.

Questions Generation

We utilized the "RagDatasetGenerator" from llama index datasets to generate evaluation questions and then pass them as a query to Mixtral MOE for response generation. We also used Mixtral MOE for query generation to test for different cases. We also used mixtral to generate questions to test context-enriched chunks.

Performance Evaluation

At last, we rigorously assessed different chunking strategies utilizing "OpenAI's GPT-3.5" model. We evaluated on three key metrics:

Context Relevancy: Measures the relevance of retrieval information to query context.
Answer Relevancy: Measures the relevance and precision of generated answers to original questions.
Faith-fullness: Assesses the presence of hallucination or inaccuracies in the generated answer

Results and Discussions:

In our case study, We categorized the input queries into three certain levels of complexity (Easy, Medium, and Hard) to ensure a comprehensive evaluation of the performance of different chunking strategies. This approach allowed us to assess the ability of each strategy to handle queries with varying levels of complexity and identify their strengths and weaknesses. By using multiple metrics such as Faithfulness (F), Answer Relevancy (AR), and Context Relevancy (CR), we were able to provide a detailed and accurate comparison of the different strategies and their effectiveness in generating relevant and accurate responses. This approach can help in selecting the most appropriate strategy based on the complexity of the input queries and the desired outcomes.

The sample evaluation of queries against categories is:

Here is the detailed table (containing responses, contexts, and relevance scores against the given queries) here:

The results of the sample evaluation show that:

Spacy chunking outperforms the others against all the given three queries, achieving the highest score. The reason behind this is that Spacy's tokenizer uses a statistical model that captures diverse text inputs, enabling it to accurately identify and extract meaningful chunks from the text.
Recursive chunking exhibits the lowest context relevancy scores in one query. This subpar performance is due to the limitation of handling contextual similarities between texts.
Semantic chunking is good but shows the lowest context relevancy score against the medium-level complex query. This is due to the limitation of semantic chunking to handle diverse text inputs.
Context-enriched chunking performs better than recursive chunking but it still lacks the performance of capturing context. Although having summaries of previous chunks it still lacks the semantic coherence of current chunks.

Easy

The table below shows the average evaluation metric scores for various chunking strategies in the easy category, which consists of questions that are simple to understand.

Semantic performs the best context-aware chunking for easy input queries in our dataset. Although the answer relevancy score is the highest among others the context relevancy scores reveal that other strategies failed to align with the context for simple queries.

Medium

Medium category questions are more complex than easy questions and require a chunking strategy that preserves high contextual information to fully grasp the underlying context. The same evaluation metrics are used to provide insights into the performance of each strategy concerning how well they align with the context relevance of the answers provided, and their faithfulness to the original query.

The below table shows the average scores obtained within the medium category:

Semantic performs the best context-aware chunking for intermediate-level complexity queries.

Hard

The Hard category contains the most challenging and complex queries that require significant analysis and interpretation. They lack explicit prompts, making it harder to generate accurate and relevant responses. Evaluating the effectiveness of chunking strategies against hard difficulty questions provides insights into their ability to handle complex and nuanced input queries effectively.

The following table displays the average metrics scores across different chunking strategies within the hard category:

Although Semantic performs the best context-aware chunking for the hard category, it still faces challenges in retrieving the proper context in some cases as the score is not too good.

Analysis

Semantic chunking has proven to be the most effective strategy for ensuring coherent information within chunks and outperforming other strategies in each category. However, it's important to note that its effectiveness may vary based on the dataset and scenarios, and it may not always be the most appropriate approach to use in certain cases. Therefore, it's essential to evaluate the performance of different chunking strategies carefully, taking into account various factors such as the complexity of the queries, the type of data, and the desired outcomes. By doing so, it's possible to identify the best approach to use in each scenario and generate accurate and relevant responses that align with the context of the query.

Strengths and Weaknesses:

Semantic chunking preserves high context preservation and ensures each chunk contains coherent information but requires more computational resources.
Context-enriched chunking preserves the context but not fully, as chunks may end in mid-sentence, leading to a lack of semantic coherence in the current chunk.
NLTK and spacy capture linguistic structures effectively but the identified sentence boundaries might not always be accurate for structured text. Also, the variable chunk sizes can be challenging to manage in some models. Spacy's tokenizer uses a statistical model that captures diverse text inputs whereas NLTK tokenizer uses grammar rules and has less advanced algorithms for handling complex patterns and relationships between words.
Recursive chunking provides simple and uniform chunk size but does not preserve the context effectively, as chunks may end in mid-sentence, leading to a lack of semantic coherence.

Conclusion:

To sum up, our in-depth analysis of different chunking strategies reveals that hardcoding the chunking strategy or size is not the best approach. Instead, it’s crucial to perform experiments with various strategies and evaluate their performance using different metrics tailored to specific scenarios. Naive chunking methods demonstrated not that good performance due to the absence of comprehensive conceptual understanding within each chunk.

Our research emphasizes the dynamic nature of chunking strategies and the need for ongoing optimization to enhance the performance of RAG-based applications. Incorporating document summaries as context shows improvement over recursive but still lacks against other context-aware strategies. The exploration of chunking methods has revealed their significant impact on RAG performance. However, achieving optimal performance requires striking a balance between context preservation, semantic coherence, and computational efficiency.

Thus, ongoing optimization efforts are essential to unlock the full potential of RAG-based applications in information retrieval and response generation. By continuously exploring and refining chunking techniques, we can overcome these challenges and enhance the efficacy of RAG-based applications across various scenarios.

Share this post