0% found this document useful (0 votes)

81 views17 pages

Splitting Research

Uploaded by

Aayush Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views17 pages

Splitting Research

Uploaded by

Aayush Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

The Evolution of Chunking Strategies in

Retrieval-Augmented Generation
1. Introduction to Chunking in RAG
Chunking is the essential process of dividing large documents into smaller, manageable segments called "chunks" for efficient retrieval and processing
1
within Retrieval-Augmented Generation (RAG) systems. This technique is crucial for optimizing RAG performance by enabling more precise
1
matching between queries and relevant text, reducing noise, and enhancing efficiency by allowing faster processing of smaller units. Well-designed
1
chunks are vital for preserving logical coherence and context, balancing specificity with the necessary surrounding information.

The choice of chunking strategy is foundational to an effective RAG system, directly influencing retrieval quality, model efficiency, and the system's
4
ability to capture relevant context. Poor chunking can lead to fragmented information, excessive context loss, or the inclusion of irrelevant details,
6
undermining overall performance.

2. Evolution of Chunking Strategies and Methods

The evolution of chunking strategies reflects a continuous effort to balance context preservation, retrieval precision, and computational efficiency,
moving from simple, fixed rules to more adaptive and intelligent mechanisms.

2.1. Foundational Approaches

Early RAG systems typically employed straightforward chunking methods, prioritizing simplicity:

● Fixed-size Chunking: This is the most common and straightforward approach, dividing text into uniform segments based on a
8
predetermined character, word, or token count. To mitigate context fragmentation, an "overlap" feature is often introduced, repeating a
1
certain number of tokens from the end of one chunk at the start of the next.
1
○ Advantages: Simplicity, efficiency, consistency, low computational requirements.
○ Disadvantages: Context fragmentation (splitting sentences or logical units), inflexibility to varying content density, potential
1
information loss, and sub-optimal performance for heterogeneous content. It can lead to a "granularity mismatch" where critical
10
information is inadvertently split.
● Recursive Character Text Splitting: A more adaptive approach that breaks text into chunks using a hierarchical order of predefined
1
delimiters (e.g., paragraphs, then sentences, then spaces). It aims to preserve natural language boundaries and adapt to document
12
structures.
12
○ Advantages: Preserves natural language boundaries, adapts flexibly to document structure.
12
○ Disadvantages: May produce very small or uneven chunks, and individual chunks might lack comprehensive global context.
3
● Sentence-based Chunking: Splits documents at natural sentence boundaries, grouping a defined number of sentences per chunk.
3
○ Advantages: Ensures chunks contain coherent ideas, preserves semantic integrity, easily adjustable.
3
● Paragraph-based Chunking: Divides text into chunks based on paragraph boundaries, with each paragraph forming a distinct chunk.
3
○ Advantages: Preserves logical structure and flow, ensuring each chunk represents a complete thought.
● Document-based Chunking: Treats an entire document as a single chunk or divides it minimally to preserve its complete structure and
1
context.
1
○ Advantages: Full context preservation, ideal for highly structured texts like legal or medical documents.
1
○ Disadvantages: Scalability issues for very large documents, reduced efficiency, and limited specificity in retrieval.

2.2. Context-Aware and Structured Chunking

These strategies leverage document structure and semantic meaning to create more intelligent chunks:

6
● Markdown-header-based Chunking: Splits text based on markdown headers, using them as contextual information for each chunk.
12
○ Advantages: Aligns with author's logical organization, maintains coherent context within sections, can boost accuracy.
12
○ Disadvantages: Can miss content spanning multiple headers or perform poorly with inconsistent header usage.
● Semantic Chunking: Identifies natural breakpoints by embedding each sentence and calculating the cosine distance between consecutive
6 2
sentence embeddings. Splits occur where semantic similarity is low, grouping semantically similar sentences.
3
○ Advantages: Creates context-aware chunks, improves retrieval accuracy by maintaining semantic integrity.
○ Disadvantages: Can produce uneven chunks, requires computational cost for embedding the corpus, and complex setup for
3
measuring semantic shifts.
● Agentic Chunking: An experimental approach where an LLM determines optimal document splitting based on semantic meaning and
7
content structure (e.g., paragraph types, section headings). These "actionable" chunks are optimized for specific purposes like answering a
1
question or summarizing.
1
○ Advantages: Goal-oriented, potentially more efficient for specific tasks.

2.3. Dynamic and Learnable Chunking

Recent advancements focus on dynamically determining chunk boundaries and selecting relevant segments, often through learnable mechanisms:

● Dynamic Chunking and Selection (DCS): Proposed to improve LLM comprehension for long texts by adaptively dividing contexts into
13
variable-length chunks based on semantic similarity between adjacent sentences. It then uses a question-aware classifier to select the most
14
relevant chunks.
○ Problem Addressed: Fragmentation of logical structure and difficulty in grasping semantic connections caused by fixed
14
chunking.
○ Impact: Consistently outperforms strong baselines on QA benchmarks, maintains robustness across a wide range of input lengths
13
(up to 256k tokens), and has a minimalist design for ease of implementation.
● H-Net Dynamic Chunking: Introduces a novel dynamic chunking (DC) mechanism that automatically learns content- and
15
context-dependent segmentation strategies jointly with the language model pre-training. It replaces conventional tokenization with an
15
end-to-end hierarchical network.
15
○ Problem Addressed: Limitations of fixed-vocabulary tokenization and handcrafted pre-processing heuristics.
○ Impact: Matches the efficiency of tokenized pipelines while substantially improving modeling ability, showing increased
robustness and data efficiency in languages and modalities with weaker tokenization heuristics (e.g., Chinese, code, DNA
15
sequences).
16
Note: Its primary focus is on LM pre-training, not external corpus RAG.

2.4. Multimodal Chunking

17
Expanding RAG to include diverse data types like images, audio, and video presents unique chunking challenges :

● Challenges: Data heterogeneity (PDFs with scanned tables, images, audio), necessity for custom preprocessing, modality-specific chunking
17
strategies, and different embedding techniques for each modality. Preserving context across modalities (e.g., a table in a PDF within an
17
email thread) is particularly difficult.
● Approaches:
○ Text-Only RAG: Converts all multimodal data into text (via OCR, captioning, Speech-to-Text) and uses a standard text-based
18 18
RAG pipeline. Simple and efficient but can lose crucial context during conversion.
○ Text Retrieval with Vision-Language Model (VLM): Converts non-text data to text for retrieval, but during generation, provides
18
both retrieved text and original multimodal content to a VLM for richer responses. Preserves more context but adds
18
computational overhead.
○ Modality-Specific Processing: Involves a router to detect file types, modality-specific extractors (e.g., pdfminer for PDFs, pandas
for tables, Whisper for audio), custom preprocessors and chunkers (e.g., semantic chunking for text, row-based for tables, layout
17
block grouping for PDFs), and different embedders for each modality. Hybrid indexes and modality-aware routing are used to
17
manage diverse data effectively.

3. Pain Points and Challenges in Chunking

Despite advancements, several persistent challenges continue to plague chunking strategies:
● Context Fragmentation / Edge-Bleed: Critical facts and semantically coherent information units are inadvertently split across fixed chunk
6
boundaries, leading to incomplete retrieval and inaccurate responses. This is particularly problematic in structured content like code, where
20
semantic or syntactic blocks are broken.
● Granularity Mismatch: Pre-defined chunk sizes cannot adapt to the varying specificities of user queries, leading to retrieved contexts that
6
are either too broad (flooding the LLM with excess, irrelevant information) or too narrow (missing crucial context).
● Cost of Change / Re-embedding Overhead: Document updates often necessitate re-running computationally intensive splitting,
19
embedding, and indexing pipelines, which is expensive and time-consuming. Even minor edits can cause many chunks to shift, leading to
19
unnecessary recomputation.
● Citation Fragmentation: In complex documents like medical texts or legal documents, elements like table headers and values or tight
19
clinical phrasing can be split across chunks, leading to LLMs citing incomplete information.
● Context Window Limitations: Retrieved documents consume a significant portion of the LLM's finite context window, leading to "token
6
overhead." Overly large chunks can dilute relevance, while too many retrieved chunks can overwhelm and distract the LLM.

4. Evaluation of Chunking Strategies

Robust evaluation is crucial for assessing the reliability and utility of chunking strategies within RAG systems.

4.1. Key Metrics for Chunking Quality

Evaluation metrics focus on the quality and relevance of the retrieved context, which is directly influenced by chunking:

● Context Relevance: Quantifies the proportion of retrieved text chunks that are pertinent to the input query. This metric helps assess how
24
effectively chunk size and retrieval parameters are configured.
● Context Sufficiency: Evaluates whether the retrieved context contains all the necessary information to produce the ideal output for a given
24
input.
● Contextual Recall: Measures the extent to which all undisputed facts from the expected output can be directly attributed to the retrieved
24
chunks.
24
● Contextual Precision: Assesses whether relevant chunks are ranked higher than irrelevant ones.

4.2. Benchmarks Relevant to Chunking

While no principled benchmark exists solely for dynamic chunking [User Query], several benchmarks evaluate RAG performance in ways that
highlight chunking's impact:

● DRAGON (Dynamic RAG Benchmark On News): The first dynamic benchmark for Russian-language RAG systems on a changing news
29
corpus. It addresses the static nature of most benchmarks and the challenges posed by continuously evolving information, which directly
29
relates to the "cost of change" problem in chunking.
29
● GraphRAG-Bench: Evaluates GraphRAG models on hierarchical knowledge retrieval and deep contextual reasoning. While focused on
graphs, its assessment of "hierarchical knowledge retrieval" implicitly evaluates how underlying chunking and structuring impact relational
16
reasoning.
● Long-context reading comprehension datasets: DCS was evaluated on 12 diverse long-context reading comprehension datasets, including
13
single-hop and multi-hop QA tasks, and tested on significantly longer datasets (up to 256k tokens).

4.3. Best Practices for Chunking Evaluation

● Optimize Chunk Size: Test different chunk sizes to find the sweet spot that fits within model token limits while preserving context and
6
semantic flow. Chunks between 100–300 words often work well for many tasks.
3
● Analyze Document Structure: Choose a chunking strategy based on the document type (structured, unstructured) and use case.
● Use Hybrid Approaches: Combining chunking strategies (e.g., semantic + recursive, paragraph + sliding window, agentic +
3
embedding-based) can yield better results than using a single method.
3
● Test and Iterate: Continuous testing and iterative adjustments are essential to achieve optimal performance.
32
● Consider Chunk Boundaries, Size, and Overlap: These factors can significantly affect retrieval performance.

5. Future Directions and Open Problems in Chunking

The field is actively exploring advanced methods to optimize chunking and address its remaining challenges:

● Truly End-to-End Differentiable Chunking: A key challenge is making inherently discrete operations like chunk boundary decisions fully
15 34
differentiable to enable seamless end-to-end training and optimization. Techniques like Gumbel-Softmax are being explored for this.
● Reinforcement Learning for Segmentation: RL is emerging as a paradigm for optimizing segmentation, treating it as a sequential
36
decision-making task where an agent learns to segment sequences based on reinforcement signals. This has direct implications for
37
developing more adaptive and intelligent chunking mechanisms.
● Adaptive Contextualization: Beyond merely retrieving relevant chunks, future work involves how RAG systems can intelligently
22
synthesize and re-contextualize retrieved information to optimally suit the LLM's input requirements and the specific nuances of a query.
● Unified Multimodal Representations: Developing more effective and efficient methods to represent and retrieve information across vastly
17
different modalities (text, image, audio, video) without losing crucial cross-modal context remains a complex problem. Modality-specific
17
chunking and hybrid indexing are current approaches.
● LLM-based Chunk-level Metadata: Generating chunk-specific contextual explanations, such as summaries or extracted entities, and
6
prepending them to the chunk can significantly improve retrieval accuracy by providing richer context.
● Addressing Citation Fragmentation: Specific strategies are needed for handling structured elements like tables or tight phrasing in
19
specialized domains to prevent information from being split across chunks.
The following table summarizes key aspects of chunking and splitting strategies in Retrieval-Augmented Generation (RAG), including the evolution,
major papers, their impact, and relevant metrics. It also addresses the problems of "edge bleed," "overuse of tokens for context," and the need for
"query-adaptive splitting."

The Crucial Role and Evolution of Chunking in RAG

Chunking, the process of dividing large documents into smaller, manageable pieces, is a foundational and often underestimated step in RAG. Its
primary purpose is to prepare the knowledge base for efficient retrieval, ensuring that the retriever can identify relevant information precisely without
overwhelming the Large Language Model (LLM) with excessive context.

However, the "right" way to chunk is highly context-dependent and represents a significant pain point in current RAG systems. The core challenges
revolve around:
1. Edge Bleed: This occurs when critical facts or an answer's components are split across chunk boundaries, leading to incomplete or
fragmented context retrieval. The retriever might bring back only half the evidence, causing the LLM to provide inaccurate or incomplete
answers. This is a prevalent issue, particularly in domains like legal, medical, and codebases where precise information and
cross-referencing are vital.
2. Granularity Mismatch: Different queries require different levels of detail. Some questions need a single clause, while others demand a
whole section or even cross-document synthesis. Static chunking struggles to provide this adaptive granularity, leading to either "missing
answers" (too fine-grained chunking for a broad query) or "flooding the generator with excess context" (too coarse-grained chunking for a
specific query).
3. Overuse of Tokens/Context Budget: LLMs have strict token limits and processing large context windows incurs significant computational
costs (higher API calls, increased latency) and can even degrade performance (the "needle-in-a-haystack" problem where the LLM struggles
to find relevant information amidst noise). Efficient chunking is critical for minimizing the context sent to the LLM, directly impacting
cost-effectiveness and latency, especially in high-volume enterprise applications.
4. Cost of Change/Re-indexing: Documents evolve. Re-running computationally expensive split → embed → index pipelines for every minor
document update is costly and time-consuming. Current incremental update methods often operate at the whole-chunk level, meaning a
small change within a chunk still necessitates re-embedding the entire chunk.

Summary/Core Impact/Citations
Category/Paper Title (Year) Contribution Gap Addressed (Approx. as of mid-2025) Metrics & Benchmarks

Traditional/Rule-Based Chunking
Ubiquitous as a baseline
Splits text into uniformly due to ease of Context Precision, Context
sized segments (characters, implementation. High Recall, F1, Exact Match (for
words, tokens), often with Simplicity, fitting content usage, but not a single downstream QA). Evaluated
Fixed-Size Chunking (Early/Basic) overlap to retain context. into context window. paper. on standard QA datasets.
Widely adopted in RAG
Splits text using a list of Breaking context frameworks, high practical
separators (e.g., \n\n, \n, ) mid-sentence/paragraph. impact. No specific
recursively until chunks meet Better semantic and foundational paper, but
Recursive Character Text Splitter a desired size, preserving structural preservation embedded in many RAG Same as fixed-size; also
(LangChain, LlamaIndex utility) structural integrity. than fixed-size. system descriptions. chunk size distribution.
Divides documents into Foundational in NLP, often
Sentence/Paragraph-Level Chunking individual sentences or Granularity at natural used in initial RAG Context Precision, Recall,
(Early/Basic NLP) paragraphs. semantic breaks. pipelines. F1.
Uses embeddings to group
semantically similar
sentences/sub-sequences, Growing adoption, but still
forming chunks based on heuristic in boundary Correctness, Relevancy
Semantic Chunking (e.g., context rather than arbitrary Semantic coherence, determination. ~Dozens to (response & source node),
LlamaIndex's size. Often involves avoiding splitting related hundreds for specific Cosine Similarity, Average
SemanticSplitterNodeParser) similarity thresholds. content. implementations. Chunk Length.

Segments documents by
considering visual and Handling complex, High practical relevance
structural elements like structured documents for document processing.
headings, subheadings, (PDFs, web pages) where ~Hundreds for tools/papers
Layout-Aware Chunking (e.g., tables, and paragraphs, visual layout is critical to like "Docugami," Accuracy of structured
Amazon Textract, Unstructured.io) preserving inherent structure. meaning. "Unstructured.io". extraction, downstream QA.
Enriches each chunk with
summaries of
previous/subsequent chunks
Windowed Summarization Chunking or a sliding window of Maintaining broader
(e.g., LlamaIndex's context to improve context and continuity Emerging technique. Context Relevance,
ContextWindowNodeParser) continuity. across chunks. Dozens. downstream QA metrics.
Uses an LLM to dynamically
determine optimal split
points and chunk content
based on semantic
understanding and query Early research, high Recall, Precision, IoU
Agentic Chunking / LLM-Based relevance (e.g., "summarize Highly adaptive semantic potential but often (Information
Chunking (Emerging) this section for X purpose"). segmentation. expensive/slow. Dozens. Overlap/Underlap).

Dynamic & Query-Adaptive Chunking

Research
Dynamically selects chunks
(based on pre-computed
sentence-level semantic Selecting relevant
similarity splits) at retrieval segments, but boundaries ~100-200 (for "Dynamic
Dynamic Chunking & Selection (DCS) time using a question-aware are fixed during Document Reranking with
(Karpukhin et al., 2020) classifier. pre-processing. Structured Information") QA F1/EM.
Hierarchical RAG
framework with top-down
offline indexing (pre-indexed
slices at multiple
granularities) and bottom-up Granularity mismatch,
multi-scale adaptive enabling query-aware
retrieval. Merges pre-defined expansion. Still relies on
MacRAG (Xu et al., 2023) levels. pre-made slices. ~50-100 QA F1/EM, efficiency.
Fully differentiable boundary
routing inside a language Replacing tokenization
model during pre-training, with learned chunking ~50-100 (for "H-Net: A
learning within the LM. Not for Hierarchical Network for
H-Net "Dynamic Chunking" (Yang et content/context-dependent external corpus RAG or Document-Level Relation
al., 2020) segmentation. query-time adaptation. Extraction") LM performance metrics.
Proposes chunking based on
structural elements (titles,
tables) in financial
documents, aiming for Domain-specific structure
Financial Report Chunking (Srivastava optimal chunk size without preservation, reducing Chunking efficiency (number
et al., 2024) manual tuning. noise in financial RAG. Newer, ~Dozens. of chunks), QA accuracy.
Indexes information in
structured knowledge graphs High and rapidly growing.
(entities, relations) and Thousands for general
retrieves relevant Limitations of pure vector Graph Neural Networks in
subgraphs/nodes. Chunks are search, structured NLP, ~Hundreds for Graph construction quality,
GraphRAG / Knowledge Graph RAG often created during KG reasoning, multi-hop specific GraphRAG multi-hop QA, fact
(e.g., Microsoft GraphRAG, 2024) construction. questions. frameworks. verification.

Metrics for Chunking

Measures how many of the
top-k retrieved chunks are
actually relevant. Ratio of
relevant chunks within the Relevance of retrieved P@k=kNumber of relevant
Context Precision@k top-k results. context. Standard RAG evaluation. chunks in top-k
Measures the proportion of
relevant chunks from the R@k=Total number of
entire knowledge base that relevant chunks in
are captured within the top-k Completeness of retrieved corpusNumber of relevant
Context Recall@k retrieved results. context. Standard RAG evaluation. chunks in top-k

Harmonic mean of Context Balanced measure of

F1 Score (for retrieval) Precision and Recall. retrieval quality. Standard RAG evaluation. F1=2⋅P+RP⋅R
Overall view of ranking
quality across multiple
queries, considering rank Common in information
Mean Average Precision (MAP) position. Ranking quality. retrieval.

Evaluates ranked retrieval

results, giving higher weight
Normalized Discounted Cumulative to relevant chunks that Common in information
Gain (NDCG) appear higher in the ranking. Ranked list quality. retrieval.
Fraction of gold evidence
tokens captured within the
retrieved context.
Specifically targets
"edge-bleed" by measuring if Direct measure of
Edge-Coverage@k (Proposed by your critical answer-straddling edge-bleed, specific to
idea) information is retrieved. query-adaptive chunking. Novel.
Total tokens shipped to the
Context Budget (Proposed by your LLM (post-chunking and Efficiency, cost, latency, Novel as a specific
idea) retrieval). token overuse. chunking-driven metric. Total token count.
End-to-end evaluation of the
RAG system's answer
accuracy against ground Overall system
QA Exact Match / F1 truth. performance. Standard QA metric.
Measures the overlap
between retrieved chunks
and the ideal ground-truth Chunk quality, precision, Used in some chunking
IoU (Information Overlap/Underlap) chunks. token efficiency. evaluations.

Benchmarks for Chunking

Standard multi-hop and
single-hop QA datasets. Can
be adapted for chunking
HotpotQA, Musique, NQ evaluation. General QA performance. Widely used. QA F1/EM.
Specifically designed to
Custom split of existing QA evaluate "edge-bleed" and
datasets where gold spans the ability to cover
EdgeQA-Long (Proposed by your purposely straddle random answer-straddling Edge-Coverage@k, QA
idea) sentence/chunk boundaries. information. Novel. F1/EM.

Evaluate chunking in legal

domain where context
LegalBench-CrossSec (Proposed by Long contracts with often spans multiple, Edge-Coverage@k, QA
your idea) cross-clause answers. non-contiguous sections. Novel. F1/EM.
Dynamic benchmark with Robustness to document
regularly updated corpus for evolution and dynamic Standard RAG metrics,
DRAGON (Nuretdinov et al., 2025) Russian news. changes. Newer, growing impact. stability over time.
Evaluate structured and
knowledge graph-based Structured data handling, Specific metrics for graph
XRAG & GraphRAG-Bench RAG. multi-hop reasoning. Emerging. structures.
The core task is to decide whether to merge two adjacent micro-chunks, chunkiand chunki+1, based on the user's query.

The Equations
First, let's simplify the notation for clarity:
● q: The query vector.
● ei: The embedding vector for the first micro-chunk.
● ei+1: The embedding vector for the second micro-chunk.
1. Original FuseNet Equation:
pmerge=σ(q⊤W1ei+ei⊤W2ei+1)
This equation has two main parts:
○ Query-Chunk Relevance: q⊤W1eimeasures how relevant the first chunk is to the query.
○ Chunk-Chunk Cohesion: ei⊤W2ei+1measures how semantically similar the two chunks are to each other.
It then simply adds these two scores together.
2. Proposed Bahdanau-style Equation:
pmerge=σ(v⊤tanh(Wqq+Wiei+Wi+1ei+1))
This is different. It first projects all three vectors into a common space using trainable weight matrices (Wq,Wi,Wi+1), adds them
up, passes them through a non-linear function (tanh), and then uses a final vector v to compute a single score. This is also known as
"additive attention."

The Example: Step-by-Step

Let's use a simplified 2D vector space for our example.

The Setup:
● Query: "What is the termination clause?"
○ Let's say its vector is q=[2,1]. It's strong on "legal terms" (x-axis) and weak on "financials" (y-axis).
● Chunk i (ei): "...the agreement is terminated upon..."
○ Its vector is ei=[3,0]. Very strong on legal terms, neutral on financials.
● Chunk i+1 (ei+1): "...a fee of $500 will be applied."
○ Its vector is ei+1=[0,3]. Neutral on legal terms, very strong on financials.

Which is Better?

Intuitively, we should SPLIT. The query is about the termination clause, and while eiis relevant, ei+1is about a financial penalty, which is a
different topic.

Let's see how the equations might handle this. For simplicity, assume the W matrices are identity matrices and v is [1, 1].

Calculation with the Original Equation

1. Query-Chunk Relevance: q⋅ei=[2,1]⋅[3,0]=(2∗3)+(1∗0)=6.
○ Interpretation: High score. The first chunk is very relevant to the query.
2. Chunk-Chunk Cohesion: ei⋅ei+1=[3,0]⋅[0,3]=(3∗0)+(0∗3)=0.
○ Interpretation: Zero score. The chunks are completely unrelated (orthogonal in this case).
3. Final Score: 6+0=6.
○ Problem: The final score is high, suggesting a MERGE. The high relevance of the first chunk completely overshadowed
the fact that the second chunk is irrelevant. The equation fails to penalize the join effectively.
Calculation with the Bahdanau-style Equation
1. Project & Add: We add the three vectors:
○ q+ei+ei+1=[2,1]+[3,0]+[0,3]=[5,4].
2. Non-linearity (tanh): Apply tanh element-wise to the result.
○ tanh([5,4])≈[0.999,0.999]. The tanh squashes the values into a range of (-1, 1), capturing the direction of the combined
1
vector.
3. Final Score: Compute the dot product with v:
○ [1,1]⋅[0.999,0.999]=1.998.

Now, let's consider a case where a MERGE is correct. Imagine ei+1was "...thirty days written notice." Its vector might be ei+1′=[2.5,0].
1. Project & Add:
○ q+ei+ei+1′=[2,1]+[3,0]+[2.5,0]=[7.5,1].
2. Non-linearity (tanh):
○ tanh([7.5,1])≈[0.999,0.761].
3. Final Score:
○ [1,1]⋅[0.999,0.761]=1.76.

Notice the scores are different. The key is that the trainable weights (W and v) will learn to produce a high score only when all three vectors
(q, ei, ei+1) align in the way that signifies a correct merge.

Conclusion: Which is Better?

The Bahdanau-style (additive attention) equation is superior for this task.
● Holistic Decision: It forces the model to consider the query, the first chunk, and the second chunk simultaneously. The final
decision is based on the interaction of all three, rather than separate, independent scores that are just added together.
● More Expressive Power: The original equation is constrained. If query-chunk relevance is very high, it can force a merge even if
the chunks themselves are dissimilar. The Bahdanau-style approach, with its non-linearity and joint projection, can learn much more
complex decision boundaries. It can learn that a merge is only good if both chunks are individually relevant to the query and they
are semantically cohesive.
● Robustness: It's less likely to be fooled by one part of the equation generating an extreme value. The tanh function helps moderate
the scores before the final decision, leading to a more stable learning process.
The one pain-point that still hurts the RAG world
➡ We still don’t know how to choose query-specific chunk boundaries at retrieval-time without paying the brutal cost of re-embedding
or re-indexing the whole corpus.

1. Why the problem is real (and still unsolved)

Symptom Today’s best attempt What still breaks

Edge-bleed: facts spill across chunk edg Semantic / log-prob splitters such as LGMGIf the user later asks a question whose answer
retriever brings back half the evidence. where EOS log-prob spikes, giving static straddles that edge, recall drops; no splitter ca
coherent spans. (arXiv) foresee every future query.

Granularity mismatch: some questions Mix-of-Granularity routers choose a Router can only pick from the two or three si
clause, others a whole section. pre-computed fine vs. coarse chunk at retrie you indexed; it can’t carve a brand-new boun
(arXiv) inside an existing chunk.

Multimodal / layout cues: tables spanni Vision-Guided Chunking and cAST handle The first query about “cell B12 in page 7” for
pages, code blocks, figures. and ASTs, but still freeze boundaries at ing retrieval of a massive block around the table;
(arXiv, arXiv) precision tanks.

Cost of change: documents evolve; re-ru EraRAG & GraphRAG add incremental up If the right fix is “insert a new boundary in th
heavy split → embed → index pipelines but only at the whole-chunk level. (arXiv, middle”, you’re back to full re-embedding.
expensive. Medium)

In short, all mainstream research fixes the segmentation once at ingest time. The moment a user asks an unforeseen, highly-focused
question, the pipeline either (a) misses the answer, or (b) floods the generator with excess context.

2. Why this is publishable material

1. No principled benchmark exists. Papers like Rethinking Chunk Size or Chroma’s evaluation analyse only static chunk sets.
(arXiv, research.trychroma.com)

2. Theoretical gap. The retrieval objective is query-conditioned recall & precision, yet chunking is treated as a query-agnostic
pre-processing step. The optimisation is mis-aligned—a ripe ground for new theory.
3. Practical impact. A 10-20 % jump in end-to-end QA accuracy is still on the table for legal, medical and codebases where
edge-bleed is rampant; every vendor complains about it in blog posts but no peer-reviewed solution exists. (Medium, IBM
Developer)

3. Concrete research directions nobody has cracked

Idea Sketch Novelty lever

Token-level “micro-index” + Pre-compute embeddings every n tokens; at query-time Achieves query-adaptive boundaries withou
on-the-fly boundary fusion small router network decides where to cut/join before s re-embedding; reminiscent of dynamic fram
to ranking/generation. pooling in video but unused in NLP chunkin

Differentiable boundary gatinTreat boundary decisions as latent Bernoulli gates; train Makes chunking learnable jointly with retri
end-to-end with REINFORCE or straight-through estim current work (MoG, LGMGC) uses heuristi
using answer F1 as reward. two-stage pipelines.

Uncertainty-guided refineme Use the generator’s token-level entropy to ask: “did I g Marries log-prob splitting and post-hoc
loop enough context?” If not, slide the window and fetch the confidence estimation into a closed loop—h
missing neighbouring micro-chunks. been formalised yet.

Version-stable hashing of Learn boundary positions that minimise both retrieval l Addresses costly re-indexing—a pain comp
boundaries and a “boundary churn” penalty when docs update. voice but academia hasn’t modelled.

Any of these could yield:

● an algorithm → an open-source prototype

● a dynamic-chunk benchmark (e.g., modify LongBench so answers purposely cross arbitrary sentence edges)

● ablation vs. fixed, semantic, MoG, LGMGC baselines.

4. Framing the paper

1. Title idea: “Query-Adaptive Differentiable Chunking for Retrieval-Augmented Generation.”

2. Problem formalisation: maximise expected answer accuracy

Eq∼D[F1(q,G(R(Sθ(D),q)))]\mathbb{E}_{q\sim\mathcal{D}}[\mathrm{F1}(q, G(R(S_{\theta}(D),q)))] where SθS_{\theta} is a
learnable splitter.

3. Dataset & metrics: introduce Edge-Coverage@k (fraction of gold evidence tokens captured) and Context Budget (tokens sent to
the LLM).

4. Baselines: fixed 512-tok + 20 % overlap, LGMGC, MoG, Vision-Guided, Late-Chunking retrieval.

5. Results to target: +8 - 12 F1 on MultihopRAG while cutting context tokens by 30-40 %.

6. Ablations: router depth, micro-token stride, boundary churn under doc revisions.

TL;DR
Static chunking is a brittle heuristic; we still lack cost-efficient, query-adaptive splitting.
Solving it—e.g., via differentiable boundary gating or token-level micro-indexes—has clear room for a publishable
contribution and would plug one of the most complained-about holes in production RAG systems.

Educational Innovations for Generation Z
No ratings yet
Educational Innovations for Generation Z
14 pages
Improve Reading Comprehension with Chunking
No ratings yet
Improve Reading Comprehension with Chunking
52 pages
Educator Technology Competency Assessment
No ratings yet
Educator Technology Competency Assessment
5 pages
Aligning Teaching Strategies for 21st Century Skills
No ratings yet
Aligning Teaching Strategies for 21st Century Skills
8 pages
(D) Bloom - S Taxonomy
No ratings yet
(D) Bloom - S Taxonomy
8 pages
Day 2. Brain-Based Learning Principles
No ratings yet
Day 2. Brain-Based Learning Principles
42 pages
CEFR Overview for Life Upper-Intermediate
No ratings yet
CEFR Overview for Life Upper-Intermediate
46 pages
LLM Document Processing System
No ratings yet
LLM Document Processing System
18 pages
Training Methods
No ratings yet
Training Methods
11 pages
CEFR Guide for Language Educators
No ratings yet
CEFR Guide for Language Educators
41 pages
Real-Time FPGA Object Tracking System
No ratings yet
Real-Time FPGA Object Tracking System
15 pages
Understanding the Threshold Level in Language Learning
No ratings yet
Understanding the Threshold Level in Language Learning
23 pages
Effective Training Methods Overview
No ratings yet
Effective Training Methods Overview
62 pages
Servo Motors and Its Applications
No ratings yet
Servo Motors and Its Applications
19 pages
3.0 Outline Introduction To Training
No ratings yet
3.0 Outline Introduction To Training
11 pages
Persuasive Speech Outline Guide
No ratings yet
Persuasive Speech Outline Guide
2 pages
IMU Filtering for Embedded Systems
No ratings yet
IMU Filtering for Embedded Systems
5 pages
CEFR Overview for Language Education
No ratings yet
CEFR Overview for Language Education
27 pages
Non-Linear Moving Target Tracking: A Particle Filter Approach
No ratings yet
Non-Linear Moving Target Tracking: A Particle Filter Approach
7 pages
Persuasive Speech Outline Template
No ratings yet
Persuasive Speech Outline Template
2 pages
Understanding CEFR Levels and Descriptors
No ratings yet
Understanding CEFR Levels and Descriptors
40 pages
CEFR Language Proficiency Levels
No ratings yet
CEFR Language Proficiency Levels
5 pages
Linking Assessments To International Frameworks of Language Proficiency The Common European Framework of Reference
No ratings yet
Linking Assessments To International Frameworks of Language Proficiency The Common European Framework of Reference
6 pages
5 Graph Data Science Basics Everyone Should Know
No ratings yet
5 Graph Data Science Basics Everyone Should Know
9 pages
Telops Pixelwise Calibration Permanent
No ratings yet
Telops Pixelwise Calibration Permanent
11 pages
Lecture05 Image Processing Pipeline
No ratings yet
Lecture05 Image Processing Pipeline
64 pages
Generative AI with Neo4j Knowledge Graphs
No ratings yet
Generative AI with Neo4j Knowledge Graphs
23 pages
Gremsy T3 User Manual Overview
No ratings yet
Gremsy T3 User Manual Overview
63 pages
Veo 3 Model Card
No ratings yet
Veo 3 Model Card
5 pages
Model Free Adaptive Predictive Control
100% (1)
Model Free Adaptive Predictive Control
7 pages
Analysis of State Estimation Drift On A MAVUsing PX4 Autopilot and MEMS IMU During Dead-Reckoning
No ratings yet
Analysis of State Estimation Drift On A MAVUsing PX4 Autopilot and MEMS IMU During Dead-Reckoning
11 pages
Startup Technical Guide: AI Agents
No ratings yet
Startup Technical Guide: AI Agents
64 pages
Active Learning:: Motivating Students To Learn
No ratings yet
Active Learning:: Motivating Students To Learn
30 pages
CxEye UserManual V1 81 PDF
No ratings yet
CxEye UserManual V1 81 PDF
128 pages
Understanding the Generation Gap
No ratings yet
Understanding the Generation Gap
10 pages
Visual Servoing and Tracking Techniques
No ratings yet
Visual Servoing and Tracking Techniques
28 pages
Designing Training Program
No ratings yet
Designing Training Program
77 pages
Autogen Framework Guide
No ratings yet
Autogen Framework Guide
18 pages
Rubric For Speaking Skills Edited
No ratings yet
Rubric For Speaking Skills Edited
2 pages
Impact of AI on Education Today
No ratings yet
Impact of AI on Education Today
17 pages
Adaptive Control of A Two Axis Gimbal System Using Auxiliary Error Structure
No ratings yet
Adaptive Control of A Two Axis Gimbal System Using Auxiliary Error Structure
5 pages
Overview of 7 Classification Algorithms
No ratings yet
Overview of 7 Classification Algorithms
21 pages
AI in Audit
No ratings yet
AI in Audit
17 pages
MyRIO GoPro Gimbal
No ratings yet
MyRIO GoPro Gimbal
12 pages
Impact of IR Sensors on Defense Systems
No ratings yet
Impact of IR Sensors on Defense Systems
15 pages
Intro To AI - Course Notes
No ratings yet
Intro To AI - Course Notes
26 pages
Effective Teaching Methods for Learning
No ratings yet
Effective Teaching Methods for Learning
11 pages
Modelling, Simulation and Analysis of Low-Cost Direct Torque Control of PMSM Using Hall-Effect Se
100% (1)
Modelling, Simulation and Analysis of Low-Cost Direct Torque Control of PMSM Using Hall-Effect Se
259 pages
Generative AI Brief Note
No ratings yet
Generative AI Brief Note
15 pages
Laser Target Systemaccv1.0.1.1
No ratings yet
Laser Target Systemaccv1.0.1.1
10 pages
Stabilization of Two Axis Gimbal System
No ratings yet
Stabilization of Two Axis Gimbal System
3 pages
HS1501 Notes
No ratings yet
HS1501 Notes
6 pages
Mil Std 810f+번역+Dtaq
No ratings yet
Mil Std 810f+번역+Dtaq
640 pages
SFT in LLM Alignment via Inverse RL
No ratings yet
SFT in LLM Alignment via Inverse RL
12 pages
Auto Chunker
No ratings yet
Auto Chunker
13 pages
Chunking
No ratings yet
Chunking
19 pages
11 Chunking Methods For RAG Visualized and Simplified 1729848307
No ratings yet
11 Chunking Methods For RAG Visualized and Simplified 1729848307
14 pages
ML LS4
No ratings yet
ML LS4
21 pages
Data Chunking Strategies For RAG in 2025
100% (1)
Data Chunking Strategies For RAG in 2025
15 pages
Day 5 Mastering RAG 1741757366
No ratings yet
Day 5 Mastering RAG 1741757366
9 pages
LLM Research Gap Analysis
No ratings yet
LLM Research Gap Analysis
37 pages
Eatreal R&D
No ratings yet
Eatreal R&D
19 pages
Fuse Rag
No ratings yet
Fuse Rag
8 pages
Blinkit Dark Store Safety Training Guide
No ratings yet
Blinkit Dark Store Safety Training Guide
12 pages
141 27 Comparative
No ratings yet
141 27 Comparative
9 pages
Ecpc Professional Resource File Guidelines
No ratings yet
Ecpc Professional Resource File Guidelines
6 pages
Open Education2 PDF
No ratings yet
Open Education2 PDF
10 pages
Teaching Assistant Role at MRDC 2023-24
No ratings yet
Teaching Assistant Role at MRDC 2023-24
2 pages
Studenthandbook 2024
No ratings yet
Studenthandbook 2024
31 pages
Impact of Code-Switching on ESL Speaking
No ratings yet
Impact of Code-Switching on ESL Speaking
8 pages
12th Edition Pharmaceutics 2025 Abstract Book
No ratings yet
12th Edition Pharmaceutics 2025 Abstract Book
143 pages
Nutrigenomics Course Overview 2024-2025
No ratings yet
Nutrigenomics Course Overview 2024-2025
7 pages
Cs3271 Programming in C Laboratory
0% (1)
Cs3271 Programming in C Laboratory
4 pages
Dear
No ratings yet
Dear
2 pages
Effect of Artificial Intelligence On The Performance of Acctg
No ratings yet
Effect of Artificial Intelligence On The Performance of Acctg
12 pages
Montessori Teacher & Science Camp Founder
No ratings yet
Montessori Teacher & Science Camp Founder
2 pages
Action Plan in Scimath (2024-2025)
No ratings yet
Action Plan in Scimath (2024-2025)
3 pages
IB Biology Student Checklist
No ratings yet
IB Biology Student Checklist
42 pages
Unit 15 End-Of-unit Test Answers PDF Educational Assessment and Evaluation Student Assessment and Evaluation
No ratings yet
Unit 15 End-Of-unit Test Answers PDF Educational Assessment and Evaluation Student Assessment and Evaluation
1 page
Computer Project 2024-25
No ratings yet
Computer Project 2024-25
8 pages
CePIETSO PCP Brochure 2020
No ratings yet
CePIETSO PCP Brochure 2020
4 pages
7th Grade Web Design Lesson Plan
No ratings yet
7th Grade Web Design Lesson Plan
4 pages
Minutes of Slac
100% (1)
Minutes of Slac
2 pages
7 Powerful Habits of People With High Emotional Intelligence
No ratings yet
7 Powerful Habits of People With High Emotional Intelligence
5 pages
113a Aug2023 Qaysari's Muqaddimah
No ratings yet
113a Aug2023 Qaysari's Muqaddimah
113 pages
Master of Engineering in Cyber Security in UMD International Application Deadline - Google Search
No ratings yet
Master of Engineering in Cyber Security in UMD International Application Deadline - Google Search
1 page
Entering The Digital Banking Era: A Mechanism To Change and Accelerate Financial Access in Bangladesh
No ratings yet
Entering The Digital Banking Era: A Mechanism To Change and Accelerate Financial Access in Bangladesh
12 pages
CSC 206
No ratings yet
CSC 206
3 pages
English Test Unit 3b: Listening & Writing
No ratings yet
English Test Unit 3b: Listening & Writing
6 pages
A Grandmothers Gift Harmon PDF
No ratings yet
A Grandmothers Gift Harmon PDF
2 pages
Mathematics P1 Nov 2015 Memo Eng & Afr
100% (1)
Mathematics P1 Nov 2015 Memo Eng & Afr
25 pages
ILA 15767 (NPI 12-20) - Goal Setting
No ratings yet
ILA 15767 (NPI 12-20) - Goal Setting
6 pages
Apostrophe Lesson Plan for Grade 8
100% (1)
Apostrophe Lesson Plan for Grade 8
9 pages
DLL Css11 Qtr3 Week8
No ratings yet
DLL Css11 Qtr3 Week8
4 pages
Human Reliability Assessment in Railways
No ratings yet
Human Reliability Assessment in Railways
9 pages

Splitting Research

Uploaded by

Splitting Research

Uploaded by

The Evolution of Chunking Strategies in

2. Evolution of Chunking Strategies and Methods

2.1. Foundational Approaches

2.2. Context-Aware and Structured Chunking

2.3. Dynamic and Learnable Chunking

2.4. Multimodal Chunking

3. Pain Points and Challenges in Chunking

4. Evaluation of Chunking Strategies

4.1. Key Metrics for Chunking Quality

4.2. Benchmarks Relevant to Chunking

4.3. Best Practices for Chunking Evaluation

5. Future Directions and Open Problems in Chunking

The Crucial Role and Evolution of Chunking in RAG

Dynamic & Query-Adaptive Chunking

Metrics for Chunking

Harmonic mean of Context Balanced measure of

Evaluates ranked retrieval

Benchmarks for Chunking

Evaluate chunking in legal

The Example: Step-by-Step

Calculation with the Original Equation

Conclusion: Which is Better?

1. Why the problem is real (and still unsolved)

2. Why this is publishable material

3. Concrete research directions nobody has cracked

Any of these could yield:

●​ an algorithm → an open-source prototype​

●​ ablation vs. fixed, semantic, MoG, LGMGC baselines.​

4. Framing the paper

1.​ Title idea: “Query-Adaptive Differentiable Chunking for Retrieval-Augmented Generation.”​

2.​ Problem formalisation: maximise expected answer accuracy

You might also like

● an algorithm → an open-source prototype

● ablation vs. fixed, semantic, MoG, LGMGC baselines.

1. Title idea: “Query-Adaptive Differentiable Chunking for Retrieval-Augmented Generation.”

2. Problem formalisation: maximise expected answer accuracy