Pruning¶
Unstructured vs structured¶
Unstructured sparsity (magnitude) zeros individual weights anywhere in
the tensor. On dense GPU kernels this gives no speedup — it pays off only with
sparse runtimes or as a regularizer during fine-tuning.
Structured sparsity (l1-channel, l2-channel, random-channel) removes
whole output channels / filters. The weight tensor genuinely shrinks, FLOPs
drop, and ONNX shape inference picks it up.
N:M sparsity (nm-sparsity) keeps N out of every M consecutive weights.
The 2:4 pattern is accelerated by NVIDIA Ampere/Hopper sparse Tensor Cores.
Strategies¶
magnitude¶
scope="global" ranks all eligible weights together (recommended).
scope="layerwise" applies the ratio independently per layer.
l1-channel / l2-channel¶
Score each output channel by the Lₙ norm of its weights, drop the lowest.
nm-sparsity¶
Typical recipe¶
- Train the dense model.
- Prune to target sparsity.
- Fine-tune (1–10% of original schedule) to recover accuracy.
- Quantize the pruned model — pruning + INT8 stacks well.
The "prune → finetune → quantize" sequence usually outperforms doing quantization first.