Skip to content

Benchmarks

Every quantization comparison should report at least: size, latency, memory, and a task metric. TurboQuant ships helpers for all four.

from turboquant.benchmark import compare

report = compare(
    baseline=fp16_model,
    candidate=int4_model,
    tokenizer=tok,
    prompts=["Explain quantization."],
    metrics=("latency", "memory", "size", "perplexity"),
    iters=50,
)
print(report.as_table())
report.save("benchmarks/results/llama-1b.json")

Latency

measure_latency(fn, warmup=10, iters=50) warms the kernel up, then times iters runs. On CUDA it uses cuda events with synchronization; on CPU it falls back to time.perf_counter. Reports mean, median, p95, p99.

Always discard at least 5 warmup runs — JIT compilation, kernel autotuning, and GPU clock boosting all happen on the first calls.

Memory

measure_memory() is a context manager. On CUDA it resets cuda.max_memory_allocated() and reads it at exit; on CPU it diffs psutil RSS. Use it around the call that exercises the model, not around model loading — you want peak inference-time memory.

Model size

model_size_bytes(model) serializes the state dict to a temp file and reads its size. This counts the real serialized bytes, including bitsandbytes packed quantization state — which sum(p.numel() * p.element_size()) would miss.

Perplexity

perplexity(model, tokenizer, texts, max_length=2048, stride=1024) implements the sliding-window recipe from the HF docs. Lower is better.

Reproducibility

  • Pin the seed: from turboquant.utils import seed_everything; seed_everything(0)
  • Pin the GPU clock: nvidia-smi --lock-gpu-clocks=1500,1500
  • Report median + p95, not just mean
  • Always include an FP16/BF16 baseline on the same hardware