-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
Description
Hi

As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:
-
Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?
-
Even so, why is the 2:4 sparsity model slower than the dense model?
-
If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?
-
Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?
Environment
NVIDIA GeForce RTX 4090, 8.9, P2`
=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35
=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)
=== cuBLASLt ===
cuBLASLt version: 0
=== TensorRT ===
TensorRT not installed
Thanks!