Ingestion & chunking¶

from ragforge.ingest import iter_documents
from ragforge.ingest.chunking import split_documents

docs = iter_documents(["data/sample/"])
chunks = split_documents(docs, size=1024, overlap=128)

Loaders¶

Extension	Loader	Granularity
`.pdf`	`ingest.pdf.load_pdf`	one block per page
`.md` / `.markdown`	`ingest.markdown.load_markdown`	one block per top-level heading
`.txt` / `.rst`	inline reader	whole file

Directories are walked recursively. Unsupported file types are skipped. Extending the registry is one entry in ingest/__init__.py.

Chunker¶

The default is a recursive character splitter with overlap. It tries the biggest separator first (paragraph), then falls back through line, sentence, word, character — so chunks land on semantically reasonable boundaries when possible.

from ragforge.ingest.chunking import split

chunks = split(
    text,
    metadata={"path": "policy.md"},
    size=1024,
    overlap=128,
)

Tuning¶

size=1024 chars (~256 tokens) fits most retrieval models comfortably and keeps GPU memory low during reranking.
overlap=128 chars (~32 tokens) ensures a query that falls on a chunk boundary still hits a chunk with full context.
Larger sizes (~2048) often help long-form answers; smaller sizes (~512) help precise extractive QA.

Ids and idempotency¶

The Pipeline assigns each chunk a deterministic id built from the path, the chunk index, and a hash of the text. Re-ingesting the same document replaces its chunks instead of duplicating them.