In this blog, we break down why standard chunking fails for structured data and how to design table-preserving chunking strategies using modern RAG best practices. Each approach comes with implementation guidance, use cases, and architecture fit.
Why chunking fails on tabular data?
Tables aren’t text — they are relational knowledge graphs compressed into rows and columns. But standard chunking doesn’t know that. It simply slices text by length.
| Problem | What goes wrong |
|---|---|
| Fractured Context | Tables are split mid-row, detaching values from headers |
| Lost Schema | Column context disappears, numbers lose meaning |
| Semantic Drift | LLM fills missing data, causing hallucinations |
| Retrieval Failure | Vector search can’t retrieve row value + meaning together |
Example of a failure case:
Chunk A Chunk B
| Revenue | Growth | | $10M | 15% |
Individually, neither chunk is meaningful.
The RAG system retrieves “$10M | 15%” without knowing what it represents — leading to inaccurate answers, hallucinated summaries, and broken analytics.
The solution:
Treat tables as atomic semantic units — never split without retaining headers.
Everything that follows is built on this principle.
4 proven strategies to improve RAG performance on tabular data:
1. Page-level chunking (Best for PDFs)
Use case: Financial reports, legal contracts, research papers, scanned annual reports
Why it works: Tables usually sit inside a visually bounded block.
Instead of splitting text by tokens, we chunk based on document layout, and hence a page not raw text length, becomes the unit of retrieval.
📌 In NVIDIA’s 2024 RAG Benchmarks, page-level chunking achieved:
- 0.648 top-1 accuracy
- Lowest variance (0.107) among all chunkers
- Outperformed all token-length-based methods
This matters when building AI systems for compliance, M&A analysis, taxation, or credit risk — where a missing column is the difference between correct answer vs compliance risk.
Implementation for Production (Unstructured.io)
Using high-resolution layout extraction, identify table elements distinctly:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
"report.pdf",
strategy="hi_res",
infer_table_structure=True
)
tables = [e for e in elements if e.category == "Table"]
This keeps:
- Tables intact
- Captions + footnotes preserved\
- Adjacent context available for reasoning
2. Recursive Splitting with table-aware separators (Best for Markdown + Notion)
Markdown tables are fragile — one naive chunk split, and the schema collapses. Recursive splitting solves this only if we teach the splitter that | is not just a symbol, it is a boundary of meaning.
Core idea:
Add the markdown row separator
|as a high-priority split boundary.
Most recursive chunkers prioritize:
- Paragraphs
- Sentences
- Tokens
We modify that hierarchy:
| → Highest priority
\n\n → Second
. → Fallback only if needed
This ensures:
| Benefit | Value in RAG |
|---|---|
| Rows kept intact | No fragmented semantic units |
| Schema preserved | Less hallucination during synthesis |
| High retrieval granularity | Great for documentation & LLM notes |
Ideal for:
- Product documentation
- SaaS knowledge bases
- Notion exports
- Wiki / Markdown datasets
- LLM-generated structured text
3. Schema-aware semantic chunking (Best for CSV, SQL, Excel)
Database rows are meaningless alone.
Example:
45, 2023, Q2
The LLM can’t infer that 45 = Revenue (Millions) unless the schema is present.
The paper TableRAG demonstrated that:
Row embeddings with schema enrichment outperform raw table embeddings.
So instead of chunking raw CSV rows, we expand them into semantic sentences.
Pandas enrichment example:
import pandas as pd
df = pd.read_csv("finance.csv")
enriched = [
f"Revenue={r['revenue']}M, Year={r['year']}, Quarter={r['quarter']}"
for _, r in df.iterrows()
]
Now embeddings contain:
- Meaning
- Context
- Queryability
This enables questions like:
“What was Q2 revenue in 2023?”
to retrieve exactly the right row without any hallucination.
4.Hierarchical chunking (Best for complex multi-format documents)
Some documents have cross-page relationships. Example: A metric definition appears in Section 4, but values appear in Appendix B Table 23. Naive RAG retrieves only one, causing incorrect analysis.
Hierarchical chunking fixes this:
| Layer | Size | Purpose |
|---|---|---|
| Parent Chunk | Large | Full table, headers, surrounding text |
| Child Chunk | Small | Row-level key-value access |
Retrieval Flow
- Query → retrieve child chunk (precise match)
- Child → triggers retrieval of parent chunk
- LLM receives full context for final reasoning
This is currently one of the best patterns for:
- Audit reports
- Infrastructure documentation
- Policy + regulatory frameworks
- Research publications
- Enterprise BI knowledge graphs
It maximizes:
- Precision
- Recall
- Reasoning quality
The trifecta for RAG excellence.
If your enterprise RAG system handles:
- Financial or compliance documents
- Operational metrics + dashboards
- SQL / Excel / BI exports
- Regulatory filings
- Research + scientific data
Then your chunking strategy will determine your model accuracy, not your model size.
