Evaluation¶

RAGforge's reason for being. Most RAG starter projects ship without a way to measure whether your retrieval or your prompt is actually any good — so "improvements" become vibes. RAGforge ships three embedding-based metrics that run in pure Python on CPU, no external API required.

rf eval data/sample/qa.jsonl --collection demo

+----------------+--------+
|  metric        |  mean  |
+----------------+--------+
| context_recall |  0.847 |
| answer_rel     |  0.781 |
| faithfulness   |  0.912 |
+----------------+--------+
n=120  ·  p50=620ms  ·  p95=1.4s

The metrics¶

`context_recall`¶

What fraction of the ground-truth ngrams appear in any retrieved context block?

This isolates the retrieval stage from the generation stage: a low number means your bi-encoder + reranker are not pulling the right passages, and no amount of prompt engineering will fix it.

`answer_relevance`¶

Cosine similarity between the question and the answer's embedding.

Penalizes empty or "I don't know" answers by hard-capping their score at 0.2. A useful proxy for "did the model actually try to answer?".

`faithfulness`¶

Fraction of answer sentences that are entailed by at least one context block.

Specifically: an answer sentence is "supported" if its cosine similarity with some retrieved context block crosses τ=0.55. Useful for hallucination detection: a 1.0 score means every sentence has grounding.

Dataset format¶

JSONL, one object per line:

{"question": "How long is the refund window?", "ground_truth": "14 days"}
{"question": "How do I rotate an API key?",   "ground_truth": "Settings > API Keys > Rotate"}

ground_truth is optional but required for context_recall.

Programmatic use¶

from ragforge.eval import evaluate
from ragforge import Pipeline

rag = Pipeline.from_defaults(model_id="Qwen/Qwen2.5-3B-Instruct")
rag.ingest(["data/sample/"])

samples = []
for q in questions:
    a = rag.ask(q["question"])
    samples.append({
        "question": q["question"],
        "answer": a.text,
        "contexts": [s.text for s in a.sources],
        "ground_truth": q.get("ground_truth"),
    })

res = evaluate(samples, encoder=rag.encoder)
print(res["means"])

Compatibility with RAGAS¶

The metrics and dataset format match the RAGAS conventions — you can swap one for the other without touching the orchestration code. RAGAS adds NLI-based faithfulness via an LLM judge if you want stricter eval.