Skip to content

LLM serving

RAGforge ships two LLM backends, both with the same generate(prompt, ...) interface — anything else implementing the LLM protocol slots in.

HFLLM — plain HuggingFace

from ragforge.llm import HFLLM

llm = HFLLM("Qwen/Qwen2.5-3B-Instruct", dtype="auto", device_map="auto")

The wrapper applies the model's chat template when available (instruct models) and falls back to raw completion otherwise.

QuantizedHFLLM — via turboquant-ml

The whole point of pairing with TurboQuant. Loads the model, then quantizes it in place using any TurboQuant method (bnb-nf4, bnb-int8, gptq, awq, int8-dynamic for CPU, …).

from ragforge.llm import QuantizedHFLLM

llm = QuantizedHFLLM(
    "meta-llama/Llama-3.2-3B-Instruct",
    method="bnb-nf4",
)

Requires the [quantized] extra:

pip install "ragforge-ml[quantized]"

Rule of thumb for picking a method

Hardware Model size Recommended
RTX 3060+ (8 GB) 3 B bnb-nf4
RTX 4070+ (12 GB) 7-8 B bnb-nf4 or gptq
RTX 4090 / A100 13 B awq or gptq
CPU only <2 B int8-dynamic (CPU GEMM path)
Apple Silicon up to 13 B bf16 (Metal works well)

Bring your own

Implement the LLM protocol — just .generate() — and pass any object to the pipeline. Useful for routing through an in-house inference server or an SDK.

class MyLLM:
    def generate(self, prompt, *, max_new_tokens=256, temperature=0.0, stop=None) -> str:
        return call_my_service(prompt, max_new_tokens, temperature, stop)

rag = Pipeline.from_defaults(llm=MyLLM())