From Zero to Hero: Systematic CUDA High-Performance Kernel Development
从零到极致:系统性学习 CUDA 高性能算子开发
A structured CUDA learning repository covering matrix multiplication, reusable kernels, advanced optimization techniques, and a lightweight inference engine. Master GPU programming from SGEMM basics to Tensor Core optimization.
| Feature | Description |
|---|---|
| Progressive Learning Path | 4 interconnected modules from basics to production |
| Performance-Focused | Real benchmarks against cuBLAS, not toy examples |
| Modern C++ | Leverages C++17/20 features for clean, safe GPU code |
| Production Patterns | Header-only library design, memory pools, stream management |
| Multi-Architecture | Supports Volta (sm_70) through Hopper (sm_90) |
| # | Project | Focus | Build |
|---|---|---|---|
| 01 | SGEMM Tutorial | Progressive SGEMM optimization | Standalone Makefile |
| 02 | TensorCraft Core | Header-only kernel library | CMake |
| 03 | HPC Advanced | Advanced CUDA/HPC techniques | CMake |
| 04 | Inference Engine | Lightweight DL inference engine | CMake |
01-SGEMM Tutorial (1-2 weeks)
↓ Master shared memory, bank conflicts, WMMA
02-TensorCraft Core (2-3 weeks)
↓ Build reusable kernels, API design
03-HPC Advanced (3-4 weeks)
↓ CUDA 13 features, FlashAttention
04-Inference Engine (2-3 weeks)
↓ Complete inference framework
Prerequisites: C/C++ basics, linear algebra fundamentals. CUDA experience helpful but not required.
git clone https://github.com/LessUp/cuda-kernel-academy.git
cd cuda-kernel-academy
cmake --preset default
cmake --build --preset default
ctest --preset defaultList available presets with:
cmake --list-presets- The root CMake build covers
02-tensorcraft-core,03-hpc-advanced,04-inference-engine,common, andexamples. 01-sgemm-tutorialis intentionally standalone and uses its ownMakefile.- GitHub Actions currently runs CPU-safe checks (formatting, docs, links, preset validation). Full CUDA builds/tests should be run on a local machine with a GPU.
| Option | Default | Description |
|---|---|---|
BUILD_TENSORCRAFT |
ON | Build TensorCraft Core |
BUILD_HPC_ADVANCED |
ON | Build HPC Advanced |
BUILD_INFERENCE_ENGINE |
ON | Build Inference Engine |
BUILD_EXAMPLES |
ON | Build examples |
BUILD_TESTS |
ON | Build tests |
BUILD_BENCHMARKS |
ON | Build benchmarks |
BUILD_PYTHON_BINDINGS |
OFF | Build optional Python bindings |
| Component | Minimum | Recommended |
|---|---|---|
| CUDA Toolkit | 11.0 | 12.x |
| CMake | 3.20 | 3.24+ |
| Compiler | GCC 9 / Clang 10 | GCC 11+ |
| GPU | Volta (sm_70) | Ampere/Ada (sm_80+) |
Supported Architectures:
| Arch | sm | GPUs |
|---|---|---|
| Volta | 70 | V100 |
| Turing | 75 | RTX 2080, T4 |
| Ampere | 80, 86 | A100, RTX 3090 |
| Ada | 89 | RTX 4090, L40 |
| Hopper | 90 | H100 |
- CUDA C++ Programming Guide
- CUTLASS - CUDA Templates for Linear Algebra
- Simon Boehm's GEMM Tutorial - Excellent optimization walkthrough
- NVIDIA Developer Blog - Latest techniques and best practices
If you find this project helpful in your research or work:
@misc{cuda-kernel-academy,
author = {CUDA Kernel Academy Contributors},
title = {CUDA Kernel Academy: A Comprehensive Learning Path for High-Performance CUDA Kernel Development},
year = {2026},
publisher = {GitHub},
url = {https://github.com/LessUp/cuda-kernel-academy}
}MIT License