Skip to content

LessUp/cuda-kernel-academy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

CUDA Kernel Academy

From Zero to Hero: Systematic CUDA High-Performance Kernel Development
从零到极致:系统性学习 CUDA 高性能算子开发

CI Docs Release CUDA C++ License Stars

English | 简体中文


A structured CUDA learning repository covering matrix multiplication, reusable kernels, advanced optimization techniques, and a lightweight inference engine. Master GPU programming from SGEMM basics to Tensor Core optimization.

Features

Feature Description
Progressive Learning Path 4 interconnected modules from basics to production
Performance-Focused Real benchmarks against cuBLAS, not toy examples
Modern C++ Leverages C++17/20 features for clean, safe GPU code
Production Patterns Header-only library design, memory pools, stream management
Multi-Architecture Supports Volta (sm_70) through Hopper (sm_90)

Documentation

Sub-projects

# Project Focus Build
01 SGEMM Tutorial Progressive SGEMM optimization Standalone Makefile
02 TensorCraft Core Header-only kernel library CMake
03 HPC Advanced Advanced CUDA/HPC techniques CMake
04 Inference Engine Lightweight DL inference engine CMake

Learning path

01-SGEMM Tutorial (1-2 weeks)
        ↓  Master shared memory, bank conflicts, WMMA
02-TensorCraft Core (2-3 weeks)
        ↓  Build reusable kernels, API design
03-HPC Advanced (3-4 weeks)
        ↓  CUDA 13 features, FlashAttention
04-Inference Engine (2-3 weeks)
        ↓  Complete inference framework

Prerequisites: C/C++ basics, linear algebra fundamentals. CUDA experience helpful but not required.

Quick start

git clone https://github.com/LessUp/cuda-kernel-academy.git
cd cuda-kernel-academy

cmake --preset default
cmake --build --preset default
ctest --preset default

List available presets with:

cmake --list-presets

Build notes

  • The root CMake build covers 02-tensorcraft-core, 03-hpc-advanced, 04-inference-engine, common, and examples.
  • 01-sgemm-tutorial is intentionally standalone and uses its own Makefile.
  • GitHub Actions currently runs CPU-safe checks (formatting, docs, links, preset validation). Full CUDA builds/tests should be run on a local machine with a GPU.

Build options

Option Default Description
BUILD_TENSORCRAFT ON Build TensorCraft Core
BUILD_HPC_ADVANCED ON Build HPC Advanced
BUILD_INFERENCE_ENGINE ON Build Inference Engine
BUILD_EXAMPLES ON Build examples
BUILD_TESTS ON Build tests
BUILD_BENCHMARKS ON Build benchmarks
BUILD_PYTHON_BINDINGS OFF Build optional Python bindings

Requirements

Component Minimum Recommended
CUDA Toolkit 11.0 12.x
CMake 3.20 3.24+
Compiler GCC 9 / Clang 10 GCC 11+
GPU Volta (sm_70) Ampere/Ada (sm_80+)

Supported Architectures:

Arch sm GPUs
Volta 70 V100
Turing 75 RTX 2080, T4
Ampere 80, 86 A100, RTX 3090
Ada 89 RTX 4090, L40
Hopper 90 H100

References

Citation

If you find this project helpful in your research or work:

@misc{cuda-kernel-academy,
  author = {CUDA Kernel Academy Contributors},
  title = {CUDA Kernel Academy: A Comprehensive Learning Path for High-Performance CUDA Kernel Development},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/LessUp/cuda-kernel-academy}
}

License

MIT License

About

CUDA Kernel Academy: A systematic learning path for high-performance CUDA kernel development. From SGEMM basics to Tensor Core mastery with 4 progressive modules. | CUDA 高性能算子开发:从 SGEMM 基础到 Tensor Core 精通,4 模块渐进式学习路径

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors