Aussie AI

Chapter 9. Chunk Optimizations

  • Book Excerpt from "RAG Optimization: Accurate and Efficient LLM Applications"
  • by David Spuler and Michael Sharpe

Chapter 9. Chunk Optimizations

Data Requirements

To get the full benefit of an LLM that’s specialized to your particular type of business, you want and need some proprietary data. This means you can use this data in either a fine-tuning or RAG project, and this offers more specialized results for your users, whether they’re external customers or internal staff.

Assuming your RAG project needs data, which it does, there are various extra developmental tasks:

  • Data inventory — finding it.
  • Data cleaning
  • Access permissions review
  • Legal review
  • Data chunking
  • Chunk ingesting

Another aspect to project planning is the need to repeat all of the above over the RAG product’s lifetime. Data changes over time, which means that data cleaning, data ingestion, and potentially legal review, all need to be a periodic ongoing process.

RAG Data Management

When compiling data for a RAG implementation, you do have to go to some lengths to make sure your database of content has the data in it to answer questions. But it does not need to be a great number of samples. In fact, even a single hit with one chunk of data is enough for the LLM to form an answer. With any RAG, if you search in the text-based index or the vector index and do not get good hits, it will produce poor results.

Also, unlike fine-tuning data, RAG does not require question and answer type content, but only documents that can be searched. The RAG chunks can be based on reference materials, FAQs, or summary materials, and don’t need to be conversational.

The data is very important to a successful RAG project. RAG systems are not much different conceptually to searching Google for your own data. At the end, you have something that can produce eloquent writing better than 90% of the population.

Data Inventory

There are several major problems with data causing strain in many commercial AI projects.

    (a) Finding good data.

    (b) Cleaning it up!

    (c) Structuring it logically.

    (d) Formatting it.

All of these issues are bottlenecks that are often underestimated, especially since the people pushing for AI projects tend to be tech staff, who only care about code.

Some of the aspects of the data that need to be considered:

  • Quality of the data
  • Formats (input and output)
  • Authorship
  • Legal license rights

Data Formats

With regard to data formats, they could be anything, but common examples include:

  • HTML pages of your public website (or internal intranet)
  • Database records
  • Emails
  • PDF files
  • Microsoft Word document files
  • Plain text
  • Free-form text (e.g., user questions and staff answers in a trouble ticketing support system)

What is Good Data?

Higher quality data is better for fine-tuning, and is also one of the ways that Small Language Models (SLMs) have improved. Some of the issues to consider in terms of data quality include:

  • User-generated content versus professionally created content.
  • Completeness
  • Accuracy
  • Tone of writing (e.g., casual versus formal)
  • Reading level
  • Use of complex jargon
  • Up-to-date
  • Harmless, safe, and non-toxic

To put it more succinctly, would the content actually contain answers that would be helpful to users of your LLM project? This may depend on whether they’re internal staff or your external customers.

Generally speaking, good data for an LLM project is:

    (a) professionally written,

    (b) fully-owned by the company, and

    (c) written with the general public as the intended audience.

A good example would be the customer “data sheets” about your products, which are either glossy brochures in PDF or “white papers” with technical details. That sort of data would be great to train a user support chatbot on how to answer customer questions about your products.

Hence, the first stop on your quest for good data: the marketing department.

Data Cleaning

Be careful if you load up a USB drive full of PDFs from the marketing server, or set up your Linux box spidering the entire corporate intranet. Might not be such a great idea.

Also, if you’ve gathered some dodgy data, don’t expect the LLM to save your bacon! The AI engines are really dumb about this kind of stuff, and won’t recognize that they shouldn’t regurgitate all this out in answers to the general public. It’s kind of like having your kids at show-and-tell announce that you only cook microwave TV dinners at home.

Some of the issues with cleaning of internal proprietary data include:

  • Confidential data (all sorts of things!)
  • Source code
  • Bank account numbers
  • Internal discussions (e.g., developers cussing at support staff in the trouble ticket database).
  • Individual names, email addresses, or other personally-identifiable information.
  • Out-of-date information
  • Irrelevant information
  • Cuss words
  • Sensitive topics (many!)

But those are only the exciting stuff. A lot of the problems are much more mundane:

  • Typos
  • Badly formatted documents
  • Poorly written content (e.g., in emails or trouble tickets)
  • Incomplete data
  • Just Plain Wrong (JPW) data

So, the main thing is that you have to very carefully curate the sources of all the information, and then run a lot of scans on it.

Data Access Permissions. Another aspect of data cleaning is to ensure that the data should really be shown to end users. Are there any restrictions or permissions required to view the data? I have seen RAG projects that end up giving more access to the data than would have been available in the original source. This can be a “security” issue where your RAG application divulges confidential or sensitive information in some contexts.

Open Source Data

Should you use open source data in your business application? There are plenty of curated data sets of AI training data available on the internet. These are free and mostly have permissive open licenses that permit commercial usage. Sounds great!

Firstly, what’s the point? If it’s a publicly available data set, it’s probably already been included in the training set of whatever pre-trained foundation model you’re going to use, whether it’s commercial or open source. Foundation models are data-hungry beasts, and everyone in the AI industry knows this trick!

Secondly, open source data is probably only generic data, too. The whole idea of either RAG or fine-tuning is to find some special data, stored on a dusty mag tape hidden away in a cupboard down on the shipping dock floor. Then you can fine-tune your magic data into a fancy fine-tuned model, that has genius-level intelligence about “sewer solvents” or whatever you’re selling to your customers, and then you are the AI hero.

This is the data economy now. I read somewhere that it was the companies with the most data that would win at AI. Or maybe it was the companies with the most patents, I forget.

In any case, using open source data is not necessarily a great idea. There are probably situations where it’s useful, such as to fill RAG gaps or to fine-tune a model to be more likely to follow the information or style of a specialized set of data. Or if your boss demands that you get some data because of some reason or other, then use free data by all means. Here’s a tip: there are various companies that sell data, too.

Legal Issues with Data

Data is a good place to start the initial conversation with the legal department. Some of the legal concerns with regard to data being available for use with an LLM include:

  • Ownership — was the data internally generated and thereby fully owned by the business, or is it subject to a third-party content license?
  • Licensed rights — what does a third-party content license actually permit for third-party data?
  • Copyrighted data — it’s a currently unresolved issue with regard to non-licensed copyrighted data being used in the training data set of an LLM. It’s also a highly controversial issue at the moment, with active lawsuits.
  • User-generated content — does the user license and associated privacy policy allow you to use the data? Or do you want it to disclose that you won’t?
  • Minor-created content — are all the users who agreed to your website user terms actually old enough to do so? Can you identify such content to handle it separately? Even if you have rights to such content, would you want to use it?
  • Proof of license acceptance — if your user license supposedly permits your use of user data for training models, has anything documenting the user’s acceptance actually been retained? And what is required?
  • Mixed copyrighted data — some data may be a mix of user-created content and copyrighted data from other sources, such as where a user uploads an excerpt from a published book.
  • Open source license compliance — if some or all of a data set is open source, there are still some complicated compliance aspects, such as attribution and supplying a copy of the license, even for superficially simple licenses like the MIT License or the Apache 2 License.
  • Copyleft-licensed content — if the content has a “copyleft” or “share-alike” license, such as Wikipedia or CC-BY-SA, can you use it without onerous obligations attaching to your other intellectual property? It’s an unclear legal issue whether copyleft licenses attach to an LLM trained using that data.
  • International data — can data sets from overseas be moved between countries according to the applicable terms and privacy policies, and also the overarching legal systems of the government where it is located?
  • Synthetic data — if you’re using “synthetic data” created by some other LLM, what are the legal issues surrounding that? What data sets were used to train the other LLM?
  • Photos, images, and multimedia — don’t assume that you have unlimited rights to images, photos, or video just because they appear in company articles or multimedia materials. Many such media may have been licensed from clipart or stock photography websites, with very complex license terms that are often restrictive with onerous penalties for non-compliance.

There is, of course, a huge gray cloud of unclear boundaries hanging over all of these legal issues, some of which is currently making its way through the courts in various countries. There’s all that data publicly on the web for anyone to consume for free, but this does not necessarily mean it can be used for training, or does it? Where an author has not stated you cannot use it for AI, this does not imply they have stated you can use it, either. Five years ago, nobody had a clue that data like this could possibly be used for AI, and now we’re dealing with it.

Anyway, your only response to the gray fog and that awful list should be to immediately enrol in a law degree. There’s already plenty of these copyright lawsuits to get your book full, and the AI patent lawsuits will be ramping up in a year or two. The only job that pays more than AI IP attorney is training a trillion-parameter model.

Chunk Size

Even with long context models being available, it is still important to pick a good chunk size. If a vector database is used, every chunk will be represented by a vector embedding. This is a concise numerical summary of the key concepts in the chunk. Ideally there should not be very many concepts in a single chunk.

If there are too many concepts in a chunk, it will be a candidate chunk a lot, but likely not be one of the top chunks. If it is too small, it will likely not have enough content in it to allow the LLM to form a good answer.

The context size of the LLM is important too. The user is likely going to have a short conversation after the initial query. As such, the system needs to ensure there is enough room left in the context window to ensure the LLM can keep track of the whole conversation.

If the RAG system uses too much of the context window with chunks, some history may be lost, and the conversation will be disjointed. Things will not flow well, simply because the LLM will forget what it has already responded with.

This is less of an issue now, as LLM context windows are far bigger than the 4K to 8K which they were initially released with. It’s common to see 32k to 128k context windows now, and the latest models can be even bigger, up to a million tokens and beyond.

So, what is the optimal size of a chunk? The sweet spot is between 250-500 tokens. It does depend a little on the type of answer the RAG system is trying to provide too.

    Small Chunks (50-200 tokens) — good for short Q & A, but more chunks will need to be given to the LLM, some chunks might be of little value

    Medium Chunks (200-500 tokens) — balanced and good for most use cases. Good trade-off between accuracy and costs

    Large Chunk (500+ tokens) — useful for providing long and broader answers. Can lead to less precision.

A lot of this guidance on chunk size was created based on older models with smaller context windows. Does it still apply today? It does, because the idea of a chunk it to capture a topic. Bigger chunks mean more topics get captured. Any retriever trying to rank the chunks as to their appropriateness for answering a user’s question will prefer chunks that fully match the question instead of partial matches.

But, surely it is possible to take advantage of bigger context windows? It is, more chunks can be provided. More matches or chunks “related” to the matched chunk. This gives the LLM more fodder to generate a good answer.

Overlapping Chunks

Overlapping of chunk text from multiple chunks is a longstanding accuracy optimization for RAG retrieval. However, recent advances in context window sizes, length generalization (needle in haystack answering), and vector lookup accuracy have made the need for overlapping chunk text less important. The overlapping chunk technique is considered somewhat old and naïve right now, but it’s the starting point for all subsequent theory, so it deserves some attention.

In the early days of RAG, when taking the initial content and breaking it into chunks, it was often advised to overlap the chunks. The idea here is that nearby data in a document is likely relevant. So, if chunk X is selected as a candidate for the answer, then chunk X-1 and X+1 are likely related to the matched chunk. By overlapping X-1, X and X+1 in the input, there is a good chance that all three chunks will be retrieved and provided to the LLM.

The initial literature and techniques all spoke about chunking and overlapping chunks to ensure that if chunk X and X+1 were related then the overlap would allow both chunk X and X+1 to be included when retrieved. Basic idea was if chunk X was retrieved, it gives a bonus to the adjacent chunks X-1 and X+1.

It’s not clear that this is compelling any more, though. If you have chunk X and chunk X+1, and both are just chunks next to each other, if you are concerned that X and X+1 are related, then either the vector embeddings will reveal that anyway, or you could modify the ranking algorithm so that if chunk X was retrieved and the vector of X+1 is close to the vector of X (but perhaps not sufficient on its own) then include it anyway.

Given that...why bother with trying to overlap stuff? Just makes chunking more complex.

Indeed, another practical problem with the overlapped text approach is that it turns out that overlapped chunking is not all that easy to do. Ideally, when creating chunk, the chunk should also match some sort of boundary in the document being chunked. Perhaps the chunks correspond to paragraphs, or div’s in a web page, or tickets in a ticketing system, and so on.

In many situations, processing the source data top to bottom in a sequence does not lead to an ideal ordering of chunks for RAG accuracy, so the previous and the next chunk may not be the best chunks to provide for extra content. For example, if the source is a website, perhaps a chunk from a hyper link in the match chunk makes more sense.

Also troubling is what to do with short paragraphs, or a paragraph that is already bigger than a chunk, but only by a small delta. Finding a chunk that aligns with a document boundary, also overlaps other chunks, and is of the optimal size for the application can get tricky.

Furthermore, if it makes sense that chunks X-1 and X+1 are relevant to chunk X, then why not just index the chunks with the vectors? Whenever chunk X matches, combine it with X-1 and X+1 directly and provide the bigger chunk or all three chunks to the LLM.

Overall, the use of overlapping chunk text may not be worth the effort with modern LLMs. Other approaches such as graph-based RAG may be a better way to provide a general association between multiple chunks. In the graph mechanism similar things happen, but instead of looking for blocks adjacent to each other in a reading order, it’s retrieving blocks near each other in the graph. More detail on graph RAG is covered in Chapter 16.

Chunk Pre-Summarization

The idea of pre-summarization is to use the LLM to summarize each chunk in a batch mode during development. The summarization process becomes another step in the chunking pipeline.

Summarization of chunks is an optimization with dual aims of improving accuracy and speed. Using an LLM summary can lead to better semantic keywords for chunk lookups, or to more accurate pre-written answers. The cost of answering user queries is also reduced if the summaries have fewer tokens than the initial chunks.

There are multiple ways to use an LLM summary of a chunk. The summary can be used in a RAG architecture with these modifications:

  • With the chunk — combining both into one chunk.
  • Instead of the chunk — a more-readable chunk to use.
  • Query expansion — as source of embedding keywords.
  • Dual index — summaries in a separate vector database.

The last idea is the most expensive, which involves effectively having two indexes over the chunks: one based on the chunk itself, and a second one based on the summary of the chunk. We can run both vector database queries in parallel, and let the reranker sort out what goes to the LLM.

Matching the user question with the summary of each chunk might yield some additional content for a more accurate answer. Any other metadata information from the source content can be useful to include in the summary, too. For example, the source being chunked may be known to be about Product X of a company’s offerings due to the fact that the source page is under the Product X documentation directory. The hierarchy of the source information could be divided by many criteria, rather than just product name.

References

Research papers on chunking and RAG data management include:

  1. Chen, T., Wang, H., Chen, S., Yu, W., Ma, K., Zhao, X., & Yu, D., 2024, Dense X retrieval: What retrieval granularity should we use? In Proc. of EMNLP 2024 (pp. 15159–15177). Association for Computational Linguistics. https://ar5iv.labs.arxiv.org/html/2312.06648
  2. Li, S., Stenzel, L., Eickhoff, C., & Bahrainian, S. A., 2025, Enhancing retrieval-augmented generation: A study of best practices. In Proc. of the International Conference on Computational Linguistics (COLING) (pp. 6705–6714). ACL. https://aclanthology.org/2025.coling-main.449.pdf
  3. Zhong, Z., Liu, H., Cui, X., Zhang, X., & Qin, Z., 2024, Mix-of-Granularity: Optimize the chunking granularity for retrieval-augmented generation, arXiv preprint arXiv:2406.00456, https://arxiv.org/html/2406.00456v1
  4. Zhang, W., & Zhang, J., 2025, Hallucination mitigation for retrieval-augmented large language models: A review, Mathematics, 13(5), 856. https://doi.org/10.3390/math13050856 https://www.mdpi.com/2227-7390/13/5/856
  5. Jimeno-Yepes, A., You, Y., Milczek, J., Laverde, S., & Li, L., 2024, Financial report chunking for effective retrieval augmented generation, arXiv preprint arXiv:2402.05131, https://arxiv.org/html/2402.05131v3
  6. Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, Xuanjing Huang, 1 Jul 2024, Searching for Best Practices in Retrieval-Augmented Generation, https://arxiv.org/abs/2407.01219 Project: https://github.com/FudanDNN-NLP/RAG (Attempts to optimize the entire RAG system, including the various options for different RAG modules in the RAG pipeline, such as optimal methods for chunking, retrieval, embedding models, vector databases, prompt compression, reranking, repacking, summarizers, and other components.)
  7. Thuwarakesh Murallie, Aug 2024, How to Achieve Near Human-Level Performance in Chunking for RAGs: The costly yet powerful splitting technique for superior RAG retrieval, https://towardsdatascience.com/agentic-chunking-for-rags-091beccd94b1
  8. Florian June, Sep 2024, Kotaemon Unveiled: Innovations in RAG Framework for Document QA: PDF Parsing, GraphRAG, Agent-Based Reasoning, and Insights, https://ai.gopubby.com/kotaemon-unveiled-innovations-in-rag-framework-for-document-qa-0b6d67e4b9b7
  9. Rama Akkiraju, Anbang Xu, Deepak Bora, Tan Yu, Lu An, Vishal Seth, Aaditya Shukla, Pritam Gundecha, Hridhay Mehta, Ashwin Jha, Prithvi Raj, Abhinav Balasubramanian, Murali Maram, Guru Muthusamy, Shivakesh Reddy Annepally, Sidney Knowles, Min Du, Nick Burnett, Sean Javiya, Ashok Marannan, Mamta Kumari, Surbhi Jha, Ethan Dereszenski, Anupam Chakraborty, Subhash Ranjan, Amina Terfai, Anoop Surya, Tracey Mercer, Vinodh Kumar Thanigachalam, Tamar Bar, Sanjana Krishnan, Samy Kilaru, Jasmine Jaksic, Nave Algarici, Jacob Liberman, Joey Conway, Sonu Nayyar, Justin Boitano, 10 Jul 2024, FACTS About Building Retrieval Augmented Generation-based Chatbots, NVIDIA Research, https://arxiv.org/abs/2407.07858
  10. Brandon Smith, Anton Troynikov, July 03, 2024, Evaluating Chunking Strategies for Retrieval, Chroma Technical Report, https://research.trychroma.com/evaluating-chunking https://github.com/brandonstarxel/chunking_evaluation
  11. Siran Li, Linus Stenzel, Carsten Eickhoff, Seyed Ali Bahrainian, 13 Jan 2025, Enhancing Retrieval-Augmented Generation: A Study of Best Practices, https://arxiv.org/abs/2501.07391 https://github.com/ali-bahrainian/RAG_best_practices (Examines RAG best practices such as model size, prompt wording, chunk size, knowledge base size, and more.)
  12. Sergey Filimonov, Jan 15, 2025, Ingesting Millions of PDFs and why Gemini 2.0 Changes Everything, https://www.sergey.fyi/articles/gemini-flash-2
  13. Andrew Neeser, Kaylen Latimer, Aadyant Khatri, Chris Latimer, Naren Ramakrishnan, 16 Feb 2025, QuOTE: Question-Oriented Text Embeddings, https://arxiv.org/abs/2502.10976 (Augmenting RAG chunks with additional information, such as questions the chunk might answer.)

 

Online: Table of Contents

PDF: Free PDF book download

Buy: RAG Optimization: Accurate and Efficient LLM Applications

RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization