Skip to content

Retrieval & reranking

The retrieval path is two stages:

  1. Bi-encoder retrieval with a sentence-transformers model — fast, ANN search via Qdrant returns the top ~20-50 candidates.
  2. Cross-encoder reranking with a BGE reranker — slower but much more accurate, narrows the candidates down to the final top 3-5 shown to the LLM.

Encoder

Default: BAAI/bge-small-en-v1.5 (33 M params, 384-dim). It is the smallest model that still sits in the top tier of MTEB retrieval scores and runs ~3 ms / chunk on CPU. For multilingual corpora swap to intfloat/multilingual-e5-small.

from ragforge.embed import SentenceTransformerEncoder

enc = SentenceTransformerEncoder("BAAI/bge-small-en-v1.5", device="cpu")
vecs = enc.encode(["hello world"])

Vector store

Two backends ship in the box; both implement the same three methods (upsert, search, count).

No server, no Docker. The client uses a local on-disk index.

from ragforge.vectorstore import QdrantStore

store = QdrantStore(collection="demo", dim=enc.dim, path="qdrant_storage")

Qdrant (remote)

For multi-process serving or shared indexes, point at a running Qdrant instance:

store = QdrantStore(collection="demo", dim=enc.dim, url="http://qdrant:6333")

NumPy (in-memory)

For unit tests and tiny corpora (<10k vectors).

from ragforge.vectorstore import InMemoryStore

store = InMemoryStore(dim=enc.dim)

Reranker

The cross-encoder scores (query, candidate) pairs jointly — much higher nDCG than the bi-encoder alone, especially on multi-hop questions.

from ragforge.rerank import BGEReranker

reranker = BGEReranker("BAAI/bge-reranker-base", device="cpu")
top5 = reranker.rerank("How long is the refund window?", hits, top_k=5)

Skipping the reranker is also a valid choice when latency is paramount: pass use_reranker=False to Pipeline.from_defaults.

Tuning recipe

Symptom First thing to try
Top hits are mostly irrelevant Increase top_k_retrieve, keep the reranker
Top hits are correct but answer wrong Check chunk size — too small loses context, too big confuses the LLM
Latency too high Drop the reranker; or rerank fewer candidates
Bad on multilingual queries Swap the encoder to intfloat/multilingual-e5-small
Doc-IDs duplicated Make sure you didn't break the deterministic chunk-id (see pipeline._chunk_id)