Skip to content

Why am I 2:4 sparse slower than dense in the decode stage of LLaMA2‑7B? #3505

@wang-qitong

Description

@wang-qitong

Description

Hi

Image

As shown in the figure, during the decoding phase, the 2:4 sparsity model is about 12% slower than the dense model, the questions are as follows:

  • Is the decode phase dominated by GEMV / small‑N GEMM operations, which therefore cannot trigger the 2:4 sparse Tensor Core path?

  • Even so, why is the 2:4 sparsity model slower than the dense model?

  • If we increase N>1 (e.g., batch multiple requests or generate multiple tokens at once so it becomes a GEMM), can we observe measurable 2:4 sparsity speed‑up?

  • Are there any sparse kernels or recommended practices for GEMV (matrix‑vector) that can take advantage of 2:4 sparsity?

Environment

NVIDIA GeForce RTX 4090, 8.9, P2`

=== Python / OS ===
3.11.13 Linux-6.5.0-18-generic-x86_64-with-glibc2.35

=== PyTorch / CUDA / cuDNN ===
torch: 2.2.2+cu121
cuda: 12.1
cudnn: 8902
device: NVIDIA GeForce RTX 4090
sm capability: (8, 9)

=== cuBLASLt ===
cuBLASLt version: 0

=== TensorRT ===
TensorRT not installed

2to4_sparsity.zip

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions