Retrieval & reranking¶

The retrieval path is two stages:

Bi-encoder retrieval with a sentence-transformers model — fast, ANN search via Qdrant returns the top ~20-50 candidates.
Cross-encoder reranking with a BGE reranker — slower but much more accurate, narrows the candidates down to the final top 3-5 shown to the LLM.

Encoder¶

Default: BAAI/bge-small-en-v1.5 (33 M params, 384-dim). It is the smallest model that still sits in the top tier of MTEB retrieval scores and runs ~3 ms / chunk on CPU. For multilingual corpora swap to intfloat/multilingual-e5-small.

from ragforge.embed import SentenceTransformerEncoder

enc = SentenceTransformerEncoder("BAAI/bge-small-en-v1.5", device="cpu")
vecs = enc.encode(["hello world"])

Vector store¶

Two backends ship in the box; both implement the same three methods (upsert, search, count).

Qdrant (embedded — recommended)¶

No server, no Docker. The client uses a local on-disk index.

from ragforge.vectorstore import QdrantStore

store = QdrantStore(collection="demo", dim=enc.dim, path="qdrant_storage")

Qdrant (remote)¶

For multi-process serving or shared indexes, point at a running Qdrant instance:

store = QdrantStore(collection="demo", dim=enc.dim, url="http://qdrant:6333")

NumPy (in-memory)¶

For unit tests and tiny corpora (<10k vectors).

from ragforge.vectorstore import InMemoryStore

store = InMemoryStore(dim=enc.dim)

Reranker¶

The cross-encoder scores (query, candidate) pairs jointly — much higher nDCG than the bi-encoder alone, especially on multi-hop questions.

from ragforge.rerank import BGEReranker

reranker = BGEReranker("BAAI/bge-reranker-base", device="cpu")
top5 = reranker.rerank("How long is the refund window?", hits, top_k=5)

Skipping the reranker is also a valid choice when latency is paramount: pass use_reranker=False to Pipeline.from_defaults.

Tuning recipe¶

Symptom	First thing to try
Top hits are mostly irrelevant	Increase `top_k_retrieve`, keep the reranker
Top hits are correct but answer wrong	Check chunk size — too small loses context, too big confuses the LLM
Latency too high	Drop the reranker; or rerank fewer candidates
Bad on multilingual queries	Swap the encoder to `intfloat/multilingual-e5-small`
Doc-IDs duplicated	Make sure you didn't break the deterministic chunk-id (see `pipeline._chunk_id`)