LLM serving¶
RAGforge ships two LLM backends, both with the same generate(prompt, ...)
interface — anything else implementing the LLM protocol slots in.
HFLLM — plain HuggingFace¶
from ragforge.llm import HFLLM
llm = HFLLM("Qwen/Qwen2.5-3B-Instruct", dtype="auto", device_map="auto")
The wrapper applies the model's chat template when available (instruct models) and falls back to raw completion otherwise.
QuantizedHFLLM — via turboquant-ml¶
The whole point of pairing with TurboQuant. Loads the model, then quantizes
it in place using any TurboQuant method (bnb-nf4, bnb-int8, gptq, awq,
int8-dynamic for CPU, …).
from ragforge.llm import QuantizedHFLLM
llm = QuantizedHFLLM(
"meta-llama/Llama-3.2-3B-Instruct",
method="bnb-nf4",
)
Requires the [quantized] extra:
Rule of thumb for picking a method¶
| Hardware | Model size | Recommended |
|---|---|---|
| RTX 3060+ (8 GB) | 3 B | bnb-nf4 |
| RTX 4070+ (12 GB) | 7-8 B | bnb-nf4 or gptq |
| RTX 4090 / A100 | 13 B | awq or gptq |
| CPU only | <2 B | int8-dynamic (CPU GEMM path) |
| Apple Silicon | up to 13 B | bf16 (Metal works well) |
Bring your own¶
Implement the LLM protocol — just .generate() — and pass any object to
the pipeline. Useful for routing through an in-house inference server or an
SDK.