Skip to content

RAGforge

LLM serving

Ademo93/ragforge

LLM serving¶

RAGforge ships two LLM backends, both with the same generate(prompt, ...) interface — anything else implementing the LLM protocol slots in.

`HFLLM` — plain HuggingFace¶

from ragforge.llm import HFLLM

llm = HFLLM("Qwen/Qwen2.5-3B-Instruct", dtype="auto", device_map="auto")

The wrapper applies the model's chat template when available (instruct models) and falls back to raw completion otherwise.

`QuantizedHFLLM` — via `turboquant-ml`¶

The whole point of pairing with TurboQuant. Loads the model, then quantizes it in place using any TurboQuant method (bnb-nf4, bnb-int8, gptq, awq, int8-dynamic for CPU, …).

from ragforge.llm import QuantizedHFLLM

llm = QuantizedHFLLM(
    "meta-llama/Llama-3.2-3B-Instruct",
    method="bnb-nf4",
)

Requires the [quantized] extra:

pip install "ragforge-ml[quantized]"

Rule of thumb for picking a method¶

Hardware	Model size	Recommended
RTX 3060+ (8 GB)	3 B	`bnb-nf4`
RTX 4070+ (12 GB)	7-8 B	`bnb-nf4` or `gptq`
RTX 4090 / A100	13 B	`awq` or `gptq`
CPU only	<2 B	`int8-dynamic` (CPU GEMM path)
Apple Silicon	up to 13 B	`bf16` (Metal works well)

Bring your own¶

Implement the LLM protocol — just .generate() — and pass any object to the pipeline. Useful for routing through an in-house inference server or an SDK.

class MyLLM:
    def generate(self, prompt, *, max_new_tokens=256, temperature=0.0, stop=None) -> str:
        return call_my_service(prompt, max_new_tokens, temperature, stop)

rag = Pipeline.from_defaults(llm=MyLLM())