Skip to content

Ingestion & chunking

from ragforge.ingest import iter_documents
from ragforge.ingest.chunking import split_documents

docs = iter_documents(["data/sample/"])
chunks = split_documents(docs, size=1024, overlap=128)

Loaders

Extension Loader Granularity
.pdf ingest.pdf.load_pdf one block per page
.md / .markdown ingest.markdown.load_markdown one block per top-level heading
.txt / .rst inline reader whole file

Directories are walked recursively. Unsupported file types are skipped. Extending the registry is one entry in ingest/__init__.py.

Chunker

The default is a recursive character splitter with overlap. It tries the biggest separator first (paragraph), then falls back through line, sentence, word, character — so chunks land on semantically reasonable boundaries when possible.

from ragforge.ingest.chunking import split

chunks = split(
    text,
    metadata={"path": "policy.md"},
    size=1024,
    overlap=128,
)

Tuning

  • size=1024 chars (~256 tokens) fits most retrieval models comfortably and keeps GPU memory low during reranking.
  • overlap=128 chars (~32 tokens) ensures a query that falls on a chunk boundary still hits a chunk with full context.
  • Larger sizes (~2048) often help long-form answers; smaller sizes (~512) help precise extractive QA.

Ids and idempotency

The Pipeline assigns each chunk a deterministic id built from the path, the chunk index, and a hash of the text. Re-ingesting the same document replaces its chunks instead of duplicating them.