Evaluation¶
RAGforge's reason for being. Most RAG starter projects ship without a way to measure whether your retrieval or your prompt is actually any good — so "improvements" become vibes. RAGforge ships three embedding-based metrics that run in pure Python on CPU, no external API required.
+----------------+--------+
| metric | mean |
+----------------+--------+
| context_recall | 0.847 |
| answer_rel | 0.781 |
| faithfulness | 0.912 |
+----------------+--------+
n=120 · p50=620ms · p95=1.4s
The metrics¶
context_recall¶
What fraction of the ground-truth ngrams appear in any retrieved context block?
This isolates the retrieval stage from the generation stage: a low number means your bi-encoder + reranker are not pulling the right passages, and no amount of prompt engineering will fix it.
answer_relevance¶
Cosine similarity between the question and the answer's embedding.
Penalizes empty or "I don't know" answers by hard-capping their score at 0.2. A useful proxy for "did the model actually try to answer?".
faithfulness¶
Fraction of answer sentences that are entailed by at least one context block.
Specifically: an answer sentence is "supported" if its cosine similarity with
some retrieved context block crosses τ=0.55. Useful for hallucination
detection: a 1.0 score means every sentence has grounding.
Dataset format¶
JSONL, one object per line:
{"question": "How long is the refund window?", "ground_truth": "14 days"}
{"question": "How do I rotate an API key?", "ground_truth": "Settings > API Keys > Rotate"}
ground_truth is optional but required for context_recall.
Programmatic use¶
from ragforge.eval import evaluate
from ragforge import Pipeline
rag = Pipeline.from_defaults(model_id="Qwen/Qwen2.5-3B-Instruct")
rag.ingest(["data/sample/"])
samples = []
for q in questions:
a = rag.ask(q["question"])
samples.append({
"question": q["question"],
"answer": a.text,
"contexts": [s.text for s in a.sources],
"ground_truth": q.get("ground_truth"),
})
res = evaluate(samples, encoder=rag.encoder)
print(res["means"])
Compatibility with RAGAS¶
The metrics and dataset format match the RAGAS conventions — you can swap one for the other without touching the orchestration code. RAGAS adds NLI-based faithfulness via an LLM judge if you want stricter eval.