Skip to content
#

tensor-core

Here are 17 public repositories matching this topic...

🎓 CUDA HPC Kernel Optimization Lab: Progressive GEMM, FlashAttention, Tensor Core & CUDA 13 Features | 从朴素到 Tensor Core 的 CUDA 高性能算子优化实验室

  • Updated Apr 22, 2026
  • Cuda

⚡ LLM-Speed: High-performance CUDA kernels for LLM inference — FlashAttention with O(N) memory, Tensor Core GEMM (95% cuBLAS), and seamless PyTorch integration. Supports Volta to Hopper GPUs.

  • Updated Apr 22, 2026
  • Python

A reproducible GPU benchmarking lab that compares FP16 vs FP32 training on MNIST using PyTorch, CuPy, and Nsight profiling tools. This project blends performance engineering with cinematic storytelling—featuring NVTX-tagged training loops, fused CuPy kernels, and a profiler-driven README that narrates the GPU’s inner workings frame by frame.

  • Updated Sep 5, 2025
  • Python

CUDA Kernel Academy: A systematic learning path for high-performance CUDA kernel development. From SGEMM basics to Tensor Core mastery with 4 progressive modules. | CUDA 高性能算子开发:从 SGEMM 基础到 Tensor Core 精通,4 模块渐进式学习路径

  • Updated Apr 22, 2026
  • C++

Improve this page

Add a description, image, and links to the tensor-core topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-core topic, visit your repo's landing page and select "manage topics."

Learn more