Ingestion & chunking¶
from ragforge.ingest import iter_documents
from ragforge.ingest.chunking import split_documents
docs = iter_documents(["data/sample/"])
chunks = split_documents(docs, size=1024, overlap=128)
Loaders¶
| Extension | Loader | Granularity |
|---|---|---|
.pdf |
ingest.pdf.load_pdf |
one block per page |
.md / .markdown |
ingest.markdown.load_markdown |
one block per top-level heading |
.txt / .rst |
inline reader | whole file |
Directories are walked recursively. Unsupported file types are skipped.
Extending the registry is one entry in ingest/__init__.py.
Chunker¶
The default is a recursive character splitter with overlap. It tries the biggest separator first (paragraph), then falls back through line, sentence, word, character — so chunks land on semantically reasonable boundaries when possible.
from ragforge.ingest.chunking import split
chunks = split(
text,
metadata={"path": "policy.md"},
size=1024,
overlap=128,
)
Tuning¶
size=1024chars (~256 tokens) fits most retrieval models comfortably and keeps GPU memory low during reranking.overlap=128chars (~32 tokens) ensures a query that falls on a chunk boundary still hits a chunk with full context.- Larger sizes (~2048) often help long-form answers; smaller sizes (~512) help precise extractive QA.
Ids and idempotency¶
The Pipeline assigns each chunk a deterministic id built from the path, the chunk index, and a hash of the text. Re-ingesting the same document replaces its chunks instead of duplicating them.