0% found this document useful (0 votes)
81 views17 pages

Splitting Research

Uploaded by

Aayush Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views17 pages

Splitting Research

Uploaded by

Aayush Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Evolution of Chunking Strategies in

Retrieval-Augmented Generation
1. Introduction to Chunking in RAG
Chunking is the essential process of dividing large documents into smaller, manageable segments called "chunks" for efficient retrieval and processing
1
within Retrieval-Augmented Generation (RAG) systems. This technique is crucial for optimizing RAG performance by enabling more precise
1
matching between queries and relevant text, reducing noise, and enhancing efficiency by allowing faster processing of smaller units. Well-designed
1
chunks are vital for preserving logical coherence and context, balancing specificity with the necessary surrounding information.

The choice of chunking strategy is foundational to an effective RAG system, directly influencing retrieval quality, model efficiency, and the system's
4
ability to capture relevant context. Poor chunking can lead to fragmented information, excessive context loss, or the inclusion of irrelevant details,
6
undermining overall performance.

2. Evolution of Chunking Strategies and Methods


The evolution of chunking strategies reflects a continuous effort to balance context preservation, retrieval precision, and computational efficiency,
moving from simple, fixed rules to more adaptive and intelligent mechanisms.

2.1. Foundational Approaches


Early RAG systems typically employed straightforward chunking methods, prioritizing simplicity:

●​ Fixed-size Chunking: This is the most common and straightforward approach, dividing text into uniform segments based on a
8
predetermined character, word, or token count. To mitigate context fragmentation, an "overlap" feature is often introduced, repeating a
1
certain number of tokens from the end of one chunk at the start of the next.
1
○​ Advantages: Simplicity, efficiency, consistency, low computational requirements.
○​ Disadvantages: Context fragmentation (splitting sentences or logical units), inflexibility to varying content density, potential
1
information loss, and sub-optimal performance for heterogeneous content. It can lead to a "granularity mismatch" where critical
10
information is inadvertently split.
●​ Recursive Character Text Splitting: A more adaptive approach that breaks text into chunks using a hierarchical order of predefined
1
delimiters (e.g., paragraphs, then sentences, then spaces). It aims to preserve natural language boundaries and adapt to document
12
structures.
12
○​ Advantages: Preserves natural language boundaries, adapts flexibly to document structure.
12
○​ Disadvantages: May produce very small or uneven chunks, and individual chunks might lack comprehensive global context.
3
●​ Sentence-based Chunking: Splits documents at natural sentence boundaries, grouping a defined number of sentences per chunk.
3
○​ Advantages: Ensures chunks contain coherent ideas, preserves semantic integrity, easily adjustable.
3
●​ Paragraph-based Chunking: Divides text into chunks based on paragraph boundaries, with each paragraph forming a distinct chunk.
3
○​ Advantages: Preserves logical structure and flow, ensuring each chunk represents a complete thought.
●​ Document-based Chunking: Treats an entire document as a single chunk or divides it minimally to preserve its complete structure and
1
context.
1
○​ Advantages: Full context preservation, ideal for highly structured texts like legal or medical documents.
1
○​ Disadvantages: Scalability issues for very large documents, reduced efficiency, and limited specificity in retrieval.

2.2. Context-Aware and Structured Chunking


These strategies leverage document structure and semantic meaning to create more intelligent chunks:

6
●​ Markdown-header-based Chunking: Splits text based on markdown headers, using them as contextual information for each chunk.
12
○​ Advantages: Aligns with author's logical organization, maintains coherent context within sections, can boost accuracy.
12
○​ Disadvantages: Can miss content spanning multiple headers or perform poorly with inconsistent header usage.
●​ Semantic Chunking: Identifies natural breakpoints by embedding each sentence and calculating the cosine distance between consecutive
6 2
sentence embeddings. Splits occur where semantic similarity is low, grouping semantically similar sentences.
3
○​ Advantages: Creates context-aware chunks, improves retrieval accuracy by maintaining semantic integrity.
○​ Disadvantages: Can produce uneven chunks, requires computational cost for embedding the corpus, and complex setup for
3
measuring semantic shifts.
●​ Agentic Chunking: An experimental approach where an LLM determines optimal document splitting based on semantic meaning and
7
content structure (e.g., paragraph types, section headings). These "actionable" chunks are optimized for specific purposes like answering a
1
question or summarizing.
1
○​ Advantages: Goal-oriented, potentially more efficient for specific tasks.

2.3. Dynamic and Learnable Chunking


Recent advancements focus on dynamically determining chunk boundaries and selecting relevant segments, often through learnable mechanisms:

●​ Dynamic Chunking and Selection (DCS): Proposed to improve LLM comprehension for long texts by adaptively dividing contexts into
13
variable-length chunks based on semantic similarity between adjacent sentences. It then uses a question-aware classifier to select the most
14
relevant chunks.
○​ Problem Addressed: Fragmentation of logical structure and difficulty in grasping semantic connections caused by fixed
14
chunking.
○​ Impact: Consistently outperforms strong baselines on QA benchmarks, maintains robustness across a wide range of input lengths
13
(up to 256k tokens), and has a minimalist design for ease of implementation.
●​ H-Net Dynamic Chunking: Introduces a novel dynamic chunking (DC) mechanism that automatically learns content- and
15
context-dependent segmentation strategies jointly with the language model pre-training. It replaces conventional tokenization with an
15
end-to-end hierarchical network.
15
○​ Problem Addressed: Limitations of fixed-vocabulary tokenization and handcrafted pre-processing heuristics.
○​ Impact: Matches the efficiency of tokenized pipelines while substantially improving modeling ability, showing increased
robustness and data efficiency in languages and modalities with weaker tokenization heuristics (e.g., Chinese, code, DNA
15​
sequences).
16
Note: Its primary focus is on LM pre-training, not external corpus RAG.

2.4. Multimodal Chunking


17
Expanding RAG to include diverse data types like images, audio, and video presents unique chunking challenges :

●​ Challenges: Data heterogeneity (PDFs with scanned tables, images, audio), necessity for custom preprocessing, modality-specific chunking
17
strategies, and different embedding techniques for each modality. Preserving context across modalities (e.g., a table in a PDF within an
17
email thread) is particularly difficult.
●​ Approaches:
○​ Text-Only RAG: Converts all multimodal data into text (via OCR, captioning, Speech-to-Text) and uses a standard text-based
18 18
RAG pipeline. Simple and efficient but can lose crucial context during conversion.
○​ Text Retrieval with Vision-Language Model (VLM): Converts non-text data to text for retrieval, but during generation, provides
18
both retrieved text and original multimodal content to a VLM for richer responses. Preserves more context but adds
18
computational overhead.
○​ Modality-Specific Processing: Involves a router to detect file types, modality-specific extractors (e.g., pdfminer for PDFs, pandas
for tables, Whisper for audio), custom preprocessors and chunkers (e.g., semantic chunking for text, row-based for tables, layout
17
block grouping for PDFs), and different embedders for each modality. Hybrid indexes and modality-aware routing are used to
17
manage diverse data effectively.

3. Pain Points and Challenges in Chunking


Despite advancements, several persistent challenges continue to plague chunking strategies:
●​ Context Fragmentation / Edge-Bleed: Critical facts and semantically coherent information units are inadvertently split across fixed chunk
6
boundaries, leading to incomplete retrieval and inaccurate responses. This is particularly problematic in structured content like code, where
20
semantic or syntactic blocks are broken.
●​ Granularity Mismatch: Pre-defined chunk sizes cannot adapt to the varying specificities of user queries, leading to retrieved contexts that
6
are either too broad (flooding the LLM with excess, irrelevant information) or too narrow (missing crucial context).
●​ Cost of Change / Re-embedding Overhead: Document updates often necessitate re-running computationally intensive splitting,
19
embedding, and indexing pipelines, which is expensive and time-consuming. Even minor edits can cause many chunks to shift, leading to
19
unnecessary recomputation.
●​ Citation Fragmentation: In complex documents like medical texts or legal documents, elements like table headers and values or tight
19
clinical phrasing can be split across chunks, leading to LLMs citing incomplete information.
●​ Context Window Limitations: Retrieved documents consume a significant portion of the LLM's finite context window, leading to "token
6
overhead." Overly large chunks can dilute relevance, while too many retrieved chunks can overwhelm and distract the LLM.

4. Evaluation of Chunking Strategies


Robust evaluation is crucial for assessing the reliability and utility of chunking strategies within RAG systems.

4.1. Key Metrics for Chunking Quality


Evaluation metrics focus on the quality and relevance of the retrieved context, which is directly influenced by chunking:

●​ Context Relevance: Quantifies the proportion of retrieved text chunks that are pertinent to the input query. This metric helps assess how
24
effectively chunk size and retrieval parameters are configured.
●​ Context Sufficiency: Evaluates whether the retrieved context contains all the necessary information to produce the ideal output for a given
24
input.
●​ Contextual Recall: Measures the extent to which all undisputed facts from the expected output can be directly attributed to the retrieved
24
chunks.
24
●​ Contextual Precision: Assesses whether relevant chunks are ranked higher than irrelevant ones.

4.2. Benchmarks Relevant to Chunking


While no principled benchmark exists solely for dynamic chunking [User Query], several benchmarks evaluate RAG performance in ways that
highlight chunking's impact:

●​ DRAGON (Dynamic RAG Benchmark On News): The first dynamic benchmark for Russian-language RAG systems on a changing news
29
corpus. It addresses the static nature of most benchmarks and the challenges posed by continuously evolving information, which directly
29
relates to the "cost of change" problem in chunking.
29
●​ GraphRAG-Bench: Evaluates GraphRAG models on hierarchical knowledge retrieval and deep contextual reasoning. While focused on
graphs, its assessment of "hierarchical knowledge retrieval" implicitly evaluates how underlying chunking and structuring impact relational
16
reasoning.
●​ Long-context reading comprehension datasets: DCS was evaluated on 12 diverse long-context reading comprehension datasets, including
13
single-hop and multi-hop QA tasks, and tested on significantly longer datasets (up to 256k tokens).

4.3. Best Practices for Chunking Evaluation


●​ Optimize Chunk Size: Test different chunk sizes to find the sweet spot that fits within model token limits while preserving context and
6
semantic flow. Chunks between 100–300 words often work well for many tasks.
3
●​ Analyze Document Structure: Choose a chunking strategy based on the document type (structured, unstructured) and use case.
●​ Use Hybrid Approaches: Combining chunking strategies (e.g., semantic + recursive, paragraph + sliding window, agentic +
3
embedding-based) can yield better results than using a single method.
3
●​ Test and Iterate: Continuous testing and iterative adjustments are essential to achieve optimal performance.
32
●​ Consider Chunk Boundaries, Size, and Overlap: These factors can significantly affect retrieval performance.

5. Future Directions and Open Problems in Chunking


The field is actively exploring advanced methods to optimize chunking and address its remaining challenges:

●​ Truly End-to-End Differentiable Chunking: A key challenge is making inherently discrete operations like chunk boundary decisions fully
15 34
differentiable to enable seamless end-to-end training and optimization. Techniques like Gumbel-Softmax are being explored for this.
●​ Reinforcement Learning for Segmentation: RL is emerging as a paradigm for optimizing segmentation, treating it as a sequential
36
decision-making task where an agent learns to segment sequences based on reinforcement signals. This has direct implications for
37
developing more adaptive and intelligent chunking mechanisms.
●​ Adaptive Contextualization: Beyond merely retrieving relevant chunks, future work involves how RAG systems can intelligently
22
synthesize and re-contextualize retrieved information to optimally suit the LLM's input requirements and the specific nuances of a query.
●​ Unified Multimodal Representations: Developing more effective and efficient methods to represent and retrieve information across vastly
17
different modalities (text, image, audio, video) without losing crucial cross-modal context remains a complex problem. Modality-specific
17
chunking and hybrid indexing are current approaches.
●​ LLM-based Chunk-level Metadata: Generating chunk-specific contextual explanations, such as summaries or extracted entities, and
6
prepending them to the chunk can significantly improve retrieval accuracy by providing richer context.
●​ Addressing Citation Fragmentation: Specific strategies are needed for handling structured elements like tables or tight phrasing in
19
specialized domains to prevent information from being split across chunks.
The following table summarizes key aspects of chunking and splitting strategies in Retrieval-Augmented Generation (RAG), including the evolution,
major papers, their impact, and relevant metrics. It also addresses the problems of "edge bleed," "overuse of tokens for context," and the need for
"query-adaptive splitting."

The Crucial Role and Evolution of Chunking in RAG


Chunking, the process of dividing large documents into smaller, manageable pieces, is a foundational and often underestimated step in RAG. Its
primary purpose is to prepare the knowledge base for efficient retrieval, ensuring that the retriever can identify relevant information precisely without
overwhelming the Large Language Model (LLM) with excessive context.

However, the "right" way to chunk is highly context-dependent and represents a significant pain point in current RAG systems. The core challenges
revolve around:
1.​ Edge Bleed: This occurs when critical facts or an answer's components are split across chunk boundaries, leading to incomplete or
fragmented context retrieval. The retriever might bring back only half the evidence, causing the LLM to provide inaccurate or incomplete
answers. This is a prevalent issue, particularly in domains like legal, medical, and codebases where precise information and
cross-referencing are vital.
2.​ Granularity Mismatch: Different queries require different levels of detail. Some questions need a single clause, while others demand a
whole section or even cross-document synthesis. Static chunking struggles to provide this adaptive granularity, leading to either "missing
answers" (too fine-grained chunking for a broad query) or "flooding the generator with excess context" (too coarse-grained chunking for a
specific query).
3.​ Overuse of Tokens/Context Budget: LLMs have strict token limits and processing large context windows incurs significant computational
costs (higher API calls, increased latency) and can even degrade performance (the "needle-in-a-haystack" problem where the LLM struggles
to find relevant information amidst noise). Efficient chunking is critical for minimizing the context sent to the LLM, directly impacting
cost-effectiveness and latency, especially in high-volume enterprise applications.
4.​ Cost of Change/Re-indexing: Documents evolve. Re-running computationally expensive split → embed → index pipelines for every minor
document update is costly and time-consuming. Current incremental update methods often operate at the whole-chunk level, meaning a
small change within a chunk still necessitates re-embedding the entire chunk.

Summary/Core Impact/Citations
Category/Paper Title (Year) Contribution Gap Addressed (Approx. as of mid-2025) Metrics & Benchmarks

Traditional/Rule-Based Chunking
Ubiquitous as a baseline
Splits text into uniformly due to ease of Context Precision, Context
sized segments (characters, implementation. High Recall, F1, Exact Match (for
words, tokens), often with Simplicity, fitting content usage, but not a single downstream QA). Evaluated
Fixed-Size Chunking (Early/Basic) overlap to retain context. into context window. paper. on standard QA datasets.
Widely adopted in RAG
Splits text using a list of Breaking context frameworks, high practical
separators (e.g., \n\n, \n, ) mid-sentence/paragraph. impact. No specific
recursively until chunks meet Better semantic and foundational paper, but
Recursive Character Text Splitter a desired size, preserving structural preservation embedded in many RAG Same as fixed-size; also
(LangChain, LlamaIndex utility) structural integrity. than fixed-size. system descriptions. chunk size distribution.
Divides documents into Foundational in NLP, often
Sentence/Paragraph-Level Chunking individual sentences or Granularity at natural used in initial RAG Context Precision, Recall,
(Early/Basic NLP) paragraphs. semantic breaks. pipelines. F1.
Uses embeddings to group
semantically similar
sentences/sub-sequences, Growing adoption, but still
forming chunks based on heuristic in boundary Correctness, Relevancy
Semantic Chunking (e.g., context rather than arbitrary Semantic coherence, determination. ~Dozens to (response & source node),
LlamaIndex's size. Often involves avoiding splitting related hundreds for specific Cosine Similarity, Average
SemanticSplitterNodeParser) similarity thresholds. content. implementations. Chunk Length.

Segments documents by
considering visual and Handling complex, High practical relevance
structural elements like structured documents for document processing.
headings, subheadings, (PDFs, web pages) where ~Hundreds for tools/papers
Layout-Aware Chunking (e.g., tables, and paragraphs, visual layout is critical to like "Docugami," Accuracy of structured
Amazon Textract, Unstructured.io) preserving inherent structure. meaning. "Unstructured.io". extraction, downstream QA.
Enriches each chunk with
summaries of
previous/subsequent chunks
Windowed Summarization Chunking or a sliding window of Maintaining broader
(e.g., LlamaIndex's context to improve context and continuity Emerging technique. Context Relevance,
ContextWindowNodeParser) continuity. across chunks. Dozens. downstream QA metrics.
Uses an LLM to dynamically
determine optimal split
points and chunk content
based on semantic
understanding and query Early research, high Recall, Precision, IoU
Agentic Chunking / LLM-Based relevance (e.g., "summarize Highly adaptive semantic potential but often (Information
Chunking (Emerging) this section for X purpose"). segmentation. expensive/slow. Dozens. Overlap/Underlap).

Dynamic & Query-Adaptive Chunking


Research
Dynamically selects chunks
(based on pre-computed
sentence-level semantic Selecting relevant
similarity splits) at retrieval segments, but boundaries ~100-200 (for "Dynamic
Dynamic Chunking & Selection (DCS) time using a question-aware are fixed during Document Reranking with
(Karpukhin et al., 2020) classifier. pre-processing. Structured Information") QA F1/EM.
Hierarchical RAG
framework with top-down
offline indexing (pre-indexed
slices at multiple
granularities) and bottom-up Granularity mismatch,
multi-scale adaptive enabling query-aware
retrieval. Merges pre-defined expansion. Still relies on
MacRAG (Xu et al., 2023) levels. pre-made slices. ~50-100 QA F1/EM, efficiency.
Fully differentiable boundary
routing inside a language Replacing tokenization
model during pre-training, with learned chunking ~50-100 (for "H-Net: A
learning within the LM. Not for Hierarchical Network for
H-Net "Dynamic Chunking" (Yang et content/context-dependent external corpus RAG or Document-Level Relation
al., 2020) segmentation. query-time adaptation. Extraction") LM performance metrics.
Proposes chunking based on
structural elements (titles,
tables) in financial
documents, aiming for Domain-specific structure
Financial Report Chunking (Srivastava optimal chunk size without preservation, reducing Chunking efficiency (number
et al., 2024) manual tuning. noise in financial RAG. Newer, ~Dozens. of chunks), QA accuracy.
Indexes information in
structured knowledge graphs High and rapidly growing.
(entities, relations) and Thousands for general
retrieves relevant Limitations of pure vector Graph Neural Networks in
subgraphs/nodes. Chunks are search, structured NLP, ~Hundreds for Graph construction quality,
GraphRAG / Knowledge Graph RAG often created during KG reasoning, multi-hop specific GraphRAG multi-hop QA, fact
(e.g., Microsoft GraphRAG, 2024) construction. questions. frameworks. verification.

Metrics for Chunking


Measures how many of the
top-k retrieved chunks are
actually relevant. Ratio of
relevant chunks within the Relevance of retrieved P@k=kNumber of relevant
Context Precision@k top-k results. context. Standard RAG evaluation. chunks in top-k​
Measures the proportion of
relevant chunks from the R@k=Total number of
entire knowledge base that relevant chunks in
are captured within the top-k Completeness of retrieved corpusNumber of relevant
Context Recall@k retrieved results. context. Standard RAG evaluation. chunks in top-k​

Harmonic mean of Context Balanced measure of


F1 Score (for retrieval) Precision and Recall. retrieval quality. Standard RAG evaluation. F1=2⋅P+RP⋅R​
Overall view of ranking
quality across multiple
queries, considering rank Common in information
Mean Average Precision (MAP) position. Ranking quality. retrieval.

Evaluates ranked retrieval


results, giving higher weight
Normalized Discounted Cumulative to relevant chunks that Common in information
Gain (NDCG) appear higher in the ranking. Ranked list quality. retrieval.
Fraction of gold evidence
tokens captured within the
retrieved context.
Specifically targets
"edge-bleed" by measuring if Direct measure of
Edge-Coverage@k (Proposed by your critical answer-straddling edge-bleed, specific to
idea) information is retrieved. query-adaptive chunking. Novel.
Total tokens shipped to the
Context Budget (Proposed by your LLM (post-chunking and Efficiency, cost, latency, Novel as a specific
idea) retrieval). token overuse. chunking-driven metric. Total token count.
End-to-end evaluation of the
RAG system's answer
accuracy against ground Overall system
QA Exact Match / F1 truth. performance. Standard QA metric.
Measures the overlap
between retrieved chunks
and the ideal ground-truth Chunk quality, precision, Used in some chunking
IoU (Information Overlap/Underlap) chunks. token efficiency. evaluations.

Benchmarks for Chunking


Standard multi-hop and
single-hop QA datasets. Can
be adapted for chunking
HotpotQA, Musique, NQ evaluation. General QA performance. Widely used. QA F1/EM.
Specifically designed to
Custom split of existing QA evaluate "edge-bleed" and
datasets where gold spans the ability to cover
EdgeQA-Long (Proposed by your purposely straddle random answer-straddling Edge-Coverage@k, QA
idea) sentence/chunk boundaries. information. Novel. F1/EM.

Evaluate chunking in legal


domain where context
LegalBench-CrossSec (Proposed by Long contracts with often spans multiple, Edge-Coverage@k, QA
your idea) cross-clause answers. non-contiguous sections. Novel. F1/EM.
Dynamic benchmark with Robustness to document
regularly updated corpus for evolution and dynamic Standard RAG metrics,
DRAGON (Nuretdinov et al., 2025) Russian news. changes. Newer, growing impact. stability over time.
Evaluate structured and
knowledge graph-based Structured data handling, Specific metrics for graph
XRAG & GraphRAG-Bench RAG. multi-hop reasoning. Emerging. structures.
The core task is to decide whether to merge two adjacent micro-chunks, chunki​and chunki+1​, based on the user's query.

The Equations
First, let's simplify the notation for clarity:
●​ q: The query vector.
●​ ei​: The embedding vector for the first micro-chunk.
●​ ei+1​: The embedding vector for the second micro-chunk.
1.​ Original FuseNet Equation:​
pmerge​=σ(q⊤W1​ei​+ei⊤​W2​ei+1​)​
This equation has two main parts:
○​ Query-Chunk Relevance: q⊤W1​ei​measures how relevant the first chunk is to the query.
○​ Chunk-Chunk Cohesion: ei⊤​W2​ei+1​measures how semantically similar the two chunks are to each other.​
It then simply adds these two scores together.
2.​ Proposed Bahdanau-style Equation:​
pmerge​=σ(v⊤tanh(Wq​q+Wi​ei​+Wi+1​ei+1​))​
This is different. It first projects all three vectors into a common space using trainable weight matrices (Wq​,Wi​,Wi+1​), adds them
up, passes them through a non-linear function (tanh), and then uses a final vector v to compute a single score. This is also known as
"additive attention."

The Example: Step-by-Step


Let's use a simplified 2D vector space for our example.

The Setup:
●​ Query: "What is the termination clause?"
○​ Let's say its vector is q=[2,1]. It's strong on "legal terms" (x-axis) and weak on "financials" (y-axis).
●​ Chunk i (ei​): "...the agreement is terminated upon..."
○​ Its vector is ei​=[3,0]. Very strong on legal terms, neutral on financials.
●​ Chunk i+1 (ei+1​): "...a fee of $500 will be applied."
○​ Its vector is ei+1​=[0,3]. Neutral on legal terms, very strong on financials.

Which is Better?

Intuitively, we should SPLIT. The query is about the termination clause, and while ei​is relevant, ei+1​is about a financial penalty, which is a
different topic.

Let's see how the equations might handle this. For simplicity, assume the W matrices are identity matrices and v is [1, 1].

Calculation with the Original Equation


1.​ Query-Chunk Relevance: q⋅ei​=[2,1]⋅[3,0]=(2∗3)+(1∗0)=6.
○​ Interpretation: High score. The first chunk is very relevant to the query.
2.​ Chunk-Chunk Cohesion: ei​⋅ei+1​=[3,0]⋅[0,3]=(3∗0)+(0∗3)=0.
○​ Interpretation: Zero score. The chunks are completely unrelated (orthogonal in this case).
3.​ Final Score: 6+0=6.
○​ Problem: The final score is high, suggesting a MERGE. The high relevance of the first chunk completely overshadowed
the fact that the second chunk is irrelevant. The equation fails to penalize the join effectively.
Calculation with the Bahdanau-style Equation
1.​ Project & Add: We add the three vectors:
○​ q+ei​+ei+1​=[2,1]+[3,0]+[0,3]=[5,4].
2.​ Non-linearity (tanh): Apply tanh element-wise to the result.
○​ tanh([5,4])≈[0.999,0.999]. The tanh squashes the values into a range of (-1, 1), capturing the direction of the combined
1
vector.
3.​ Final Score: Compute the dot product with v:
○​ [1,1]⋅[0.999,0.999]=1.998.

Now, let's consider a case where a MERGE is correct. Imagine ei+1​was "...thirty days written notice." Its vector might be ei+1′​=[2.5,0].
1.​ Project & Add:
○​ q+ei​+ei+1′​=[2,1]+[3,0]+[2.5,0]=[7.5,1].
2.​ Non-linearity (tanh):
○​ tanh([7.5,1])≈[0.999,0.761].
3.​ Final Score:
○​ [1,1]⋅[0.999,0.761]=1.76.

Notice the scores are different. The key is that the trainable weights (W and v) will learn to produce a high score only when all three vectors
(q, ei​, ei+1​) align in the way that signifies a correct merge.

Conclusion: Which is Better?


The Bahdanau-style (additive attention) equation is superior for this task.
●​ Holistic Decision: It forces the model to consider the query, the first chunk, and the second chunk simultaneously. The final
decision is based on the interaction of all three, rather than separate, independent scores that are just added together.
●​ More Expressive Power: The original equation is constrained. If query-chunk relevance is very high, it can force a merge even if
the chunks themselves are dissimilar. The Bahdanau-style approach, with its non-linearity and joint projection, can learn much more
complex decision boundaries. It can learn that a merge is only good if both chunks are individually relevant to the query and they
are semantically cohesive.
●​ Robustness: It's less likely to be fooled by one part of the equation generating an extreme value. The tanh function helps moderate
the scores before the final decision, leading to a more stable learning process.
The one pain-point that still hurts the RAG world
➡ We still don’t know how to choose query-specific chunk boundaries at retrieval-time without paying the brutal cost of re-embedding
or re-indexing the whole corpus.

1. Why the problem is real (and still unsolved)


Symptom Today’s best attempt What still breaks

Edge-bleed: facts spill across chunk edg Semantic / log-prob splitters such as LGMGIf the user later asks a question whose answer
retriever brings back half the evidence. where EOS log-prob spikes, giving static straddles that edge, recall drops; no splitter ca
coherent spans. (arXiv) foresee every future query.

Granularity mismatch: some questions Mix-of-Granularity routers choose a Router can only pick from the two or three si
clause, others a whole section. pre-computed fine vs. coarse chunk at retrie you indexed; it can’t carve a brand-new boun
(arXiv) inside an existing chunk.

Multimodal / layout cues: tables spanni Vision-Guided Chunking and cAST handle The first query about “cell B12 in page 7” for
pages, code blocks, figures. and ASTs, but still freeze boundaries at ing retrieval of a massive block around the table;
(arXiv, arXiv) precision tanks.

Cost of change: documents evolve; re-ru EraRAG & GraphRAG add incremental up If the right fix is “insert a new boundary in th
heavy split → embed → index pipelines but only at the whole-chunk level. (arXiv, middle”, you’re back to full re-embedding.
expensive. Medium)

In short, all mainstream research fixes the segmentation once at ingest time. The moment a user asks an unforeseen, highly-focused
question, the pipeline either (a) misses the answer, or (b) floods the generator with excess context.

2. Why this is publishable material

1.​ No principled benchmark exists. Papers like Rethinking Chunk Size or Chroma’s evaluation analyse only static chunk sets.
(arXiv, research.trychroma.com)​

2.​ Theoretical gap. The retrieval objective is query-conditioned recall & precision, yet chunking is treated as a query-agnostic
pre-processing step. The optimisation is mis-aligned—a ripe ground for new theory.​
3.​ Practical impact. A 10-20 % jump in end-to-end QA accuracy is still on the table for legal, medical and codebases where
edge-bleed is rampant; every vendor complains about it in blog posts but no peer-reviewed solution exists. (Medium, IBM
Developer)​

3. Concrete research directions nobody has cracked


Idea Sketch Novelty lever

Token-level “micro-index” + Pre-compute embeddings every n tokens; at query-time Achieves query-adaptive boundaries withou
on-the-fly boundary fusion small router network decides where to cut/join before s re-embedding; reminiscent of dynamic fram
to ranking/generation. pooling in video but unused in NLP chunkin

Differentiable boundary gatinTreat boundary decisions as latent Bernoulli gates; train Makes chunking learnable jointly with retri
end-to-end with REINFORCE or straight-through estim current work (MoG, LGMGC) uses heuristi
using answer F1 as reward. two-stage pipelines.

Uncertainty-guided refineme Use the generator’s token-level entropy to ask: “did I g Marries log-prob splitting and post-hoc
loop enough context?” If not, slide the window and fetch the confidence estimation into a closed loop—h
missing neighbouring micro-chunks. been formalised yet.

Version-stable hashing of Learn boundary positions that minimise both retrieval l Addresses costly re-indexing—a pain comp
boundaries and a “boundary churn” penalty when docs update. voice but academia hasn’t modelled.

Any of these could yield:

●​ an algorithm → an open-source prototype​

●​ a dynamic-chunk benchmark (e.g., modify LongBench so answers purposely cross arbitrary sentence edges)​

●​ ablation vs. fixed, semantic, MoG, LGMGC baselines.​

4. Framing the paper

1.​ Title idea: “Query-Adaptive Differentiable Chunking for Retrieval-Augmented Generation.”​

2.​ Problem formalisation: maximise expected answer accuracy


Eq∼D[F1(q,G(R(Sθ(D),q)))]\mathbb{E}_{q\sim\mathcal{D}}[\mathrm{F1}(q, G(R(S_{\theta}(D),q)))] where SθS_{\theta} is a
learnable splitter.​

3.​ Dataset & metrics: introduce Edge-Coverage@k (fraction of gold evidence tokens captured) and Context Budget (tokens sent to
the LLM).​

4.​ Baselines: fixed 512-tok + 20 % overlap, LGMGC, MoG, Vision-Guided, Late-Chunking retrieval.​

5.​ Results to target: +8 - 12 F1 on MultihopRAG while cutting context tokens by 30-40 %.​

6.​ Ablations: router depth, micro-token stride, boundary churn under doc revisions.​

TL;DR
Static chunking is a brittle heuristic; we still lack cost-efficient, query-adaptive splitting.​
Solving it—e.g., via differentiable boundary gating or token-level micro-indexes—has clear room for a publishable
contribution and would plug one of the most complained-about holes in production RAG systems.

You might also like