Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B?

## Description
Hi

<img width="1000" height="800" alt="Image" src="https://github.com/user-attachments/assets/0e08ab66-423a-4ef0-a876-8e6e735affad" />

As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:

- Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?

- Even so, why is the 2:4 sparsity model slower than the dense model?

- If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?

- Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?

## Environment
NVIDIA GeForce RTX 4090, 8.9, P2`

=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35

=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)

=== cuBLASLt ===
cuBLASLt version: 0

=== TensorRT ===
TensorRT not installed


[2to4_sparsity.zip](https://github.com/user-attachments/files/21557839/2to4_sparsity.zip)

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B? #3505

Description

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B? #3505

Description

Description

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions