
Lists (9)
Sort Name ascending (A-Z)
- All languages
- Arduino
- Assembly
- Astro
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Cairo
- Cuda
- Dart
- Dockerfile
- Earthly
- Emacs Lisp
- Fortran
- GDScript
- Go
- HTML
- Handlebars
- Java
- JavaScript
- Julia
- Jupyter Notebook
- Kotlin
- LLVM
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Mojo
- Move
- Nim
- Nix
- PHP
- PostScript
- Python
- QML
- R
- Ruby
- Rust
- Scala
- Shell
- Slint
- Solidity
- Swift
- Tcl
- TeX
- TypeScript
- Verilog
- Vue
- WebAssembly
- Zig
Starred repositories
Fast and memory-efficient exact attention
fastllm是后端无依赖的高性能大模型推理库。同时支持张量并行推理稠密模型和混合模式推理MOE模型,任意10G以上显卡即可推理满血DeepSeek。双路9004/9005服务器+单显卡部署DeepSeek满血满精度原版模型,单并发20tps;INT4量化模型单并发30tps,多并发可达60+。
Transformer: PyTorch Implementation of "Attention Is All You Need"
Transformer: PyTorch Implementation of "Attention Is All You Need"
coderonion / tilelang
Forked from tile-ai/tilelangDomain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
coderonion / ai-infra-hpc
Forked from jinbooooom/ai-infra-hpchpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等
hpc 教程,包含集合通信(mpi、nccl)、cuda 编程、向量化 SIMD、RDMA 通信等
Best practices & guides on how to write distributed pytorch training code
Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial
coderonion / lite_llama
Forked from harleyszhang/lite_llamaA light llama-like llm inference framework based on the triton kernel.
Examples from Programming in Parallel with CUDA
coderonion / LeetCUDA
Forked from xlite-dev/LeetCUDA📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA etc.🔥
coderonion / Qwen3
Forked from QwenLM/Qwen3Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Large-scale LLM inference engine
coderonion / SageAttention
Forked from thu-ml/SageAttentionQuantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
Yan (炎) is a high-performance CUDA operator library designed for learning purposes while emphasizing clean code and maximum performance.
Search-R1: An Efficient, Scalable RL Training Framework for Reasoning & Search Engine Calling interleaved LLM based on veRL
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
Implementing DeepSeek R1's GRPO algorithm from scratch
coderonion / Video-R1
Forked from tulerfeng/Video-R1Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]
coderonion / MAYE
Forked from GAIR-NLP/MAYERethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Video-R1: Reinforcing Video Reasoning in MLLMs [🔥the first paper to explore R1 for video]