Pruning¶

from turboquant import prune
pruned = prune(model, strategy="l1-channel", sparsity=0.3)

Unstructured vs structured¶

Unstructured sparsity (magnitude) zeros individual weights anywhere in the tensor. On dense GPU kernels this gives no speedup — it pays off only with sparse runtimes or as a regularizer during fine-tuning.

Structured sparsity (l1-channel, l2-channel, random-channel) removes whole output channels / filters. The weight tensor genuinely shrinks, FLOPs drop, and ONNX shape inference picks it up.

N:M sparsity (nm-sparsity) keeps N out of every M consecutive weights. The 2:4 pattern is accelerated by NVIDIA Ampere/Hopper sparse Tensor Cores.

Strategies¶

`magnitude`¶

prune(model, strategy="magnitude", sparsity=0.5, scope="global")

scope="global" ranks all eligible weights together (recommended). scope="layerwise" applies the ratio independently per layer.

`l1-channel` / `l2-channel`¶

Score each output channel by the Lₙ norm of its weights, drop the lowest.

prune(model, strategy="l1-channel", sparsity=0.3)

`nm-sparsity`¶

prune(model, strategy="nm-sparsity", n=2, m=4)

Typical recipe¶

Train the dense model.
Prune to target sparsity.
Fine-tune (1–10% of original schedule) to recover accuracy.
Quantize the pruned model — pruning + INT8 stacks well.

The "prune → finetune → quantize" sequence usually outperforms doing quantization first.