Tiny-LLM Inference Engine

A lightweight, high-performance CUDA C++ inference engine for Transformer models.

English • 简体中文 • Documentation • API Reference

🎯 Overview

Tiny-LLM is a high-performance CUDA C++ inference engine designed for efficient Transformer model deployment on NVIDIA GPUs. It delivers production-ready inference with:

~50% memory reduction via W8A16 quantization (INT8 weights + FP16 activations)
Efficient KV Cache management for long-context generation
Optimized CUDA kernels with Tensor Core support
Zero dependencies — pure CUDA C++ with no external runtime

Built for developers who need fast, predictable, and portable LLM inference without the overhead of heavy frameworks.

✨ Features

Feature	Description	Status
W8A16 Quantization	INT8 weights + FP16 activations for ~50% memory reduction	✅ Stable
KV Cache Management	Efficient incremental decoding with sequence management	✅ Stable
Optimized CUDA Kernels	Tensor Core INT8, shared memory tiling, warp shuffle reductions	✅ Stable
Multi-Sampling Strategies	Greedy, Temperature, Top-k, Top-p (nucleus) decoding	✅ Stable
Result Error Handling	Monadic error handling — no exceptions for control flow	✅ Stable
Comprehensive Testing	GoogleTest unit tests + RapidCheck property-based tests	✅ Stable
Bilingual Documentation	Full documentation in English and Chinese	✅ Complete

Roadmap

Feature	Status	Target
GGUF Runtime Loading	🚧 In Progress	v2.1
PagedAttention	📋 Planned	v2.2
Speculative Decoding	🔬 Research	v2.3
FP8 Support	🔬 Research	v3.0
Multi-GPU Parallelism	📋 Planned	v3.0

⚡ Performance

Benchmarked on NVIDIA A100-80GB (SM 8.0), CUDA 12.4:

Model	Batch Size	Tokens/sec	Memory Usage	Speedup vs FP16
LLaMA-7B	1	~45 t/s	~3.5 GB	1.8x
LLaMA-7B	4	~120 t/s	~12 GB	1.6x
LLaMA-13B	1	~25 t/s	~6.5 GB	1.7x

Full benchmarks with different configurations: Benchmarks

🚀 Quick Start

Requirements

Component	Minimum	Recommended
NVIDIA GPU	SM 7.0 (Volta)	SM 8.0+ (Ampere+)
CUDA Toolkit	11.0	12.0+
CMake	3.18	3.25+
C++ Compiler	GCC 9+ / Clang 10+	GCC 11+

Build from Source

# Clone and build
git clone https://github.com/LessUp/tiny-llm.git
cd tiny-llm
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

# Run tests
ctest --output-on-failure

Usage Example

#include <tiny_llm/inference_engine.h>

int main() {
    // Configure model
    ModelConfig config;
    config.vocab_size = 32000;
    config.hidden_dim = 4096;
    config.num_layers = 32;

    // Load model with W8A16 weights
    auto engine = InferenceEngine::load("model.bin", config).value();

    // Generate text
    GenerationConfig gen;
    gen.max_new_tokens = 256;
    gen.temperature = 0.7f;
    gen.top_p = 0.9f;

    auto output = engine.generate({1, 15043, 29892}, gen);
    for (int token : output.tokens) {
        std::cout << tokenizer.decode(token);
    }
}

🏗️ Architecture

graph LR
    A[Application] --> B[InferenceEngine]
    B --> C[ModelLoader]
    B --> D[KVCacheManager]
    B --> E[TransformerLayer]
    C --> F[(W8A16 Weights)]
    D --> G[GPU Memory]
    E --> H[W8A16 MatMul]
    E --> I[Attention]
    E --> J[RMSNorm]
    E --> K[Elementwise]
    H --> L[Tensor Core]
    I --> L

Core Components

Component	Responsibility
InferenceEngine	Main entry point — orchestrates loading, generation, and sampling
ModelLoader	Loads quantized weights from binary format to GPU memory
KVCacheManager	Manages key-value cache with sequence allocation and eviction
TransformerLayer	Forward pass through a single transformer layer
W8A16 MatMul	Quantized matrix multiplication using INT8 × FP16

📁 Project Structure

tiny-llm/
├── website/               # GitHub Pages site (Jekyll)
│   ├── docs/              # Documentation (EN/ZH)
│   ├── changelog/         # Version history
│   ├── assets/            # CSS, JS, images
│   └── _config.yml        # Jekyll configuration
├── specs/                 # Spec-driven development documents
│   ├── product/           # Feature requirements
│   ├── rfc/               # Architecture decisions
│   ├── api/               # API definitions
│   └── testing/           # BDD test specifications
├── include/tiny_llm/      # Public headers
├── kernels/               # CUDA kernels (.cu, .cuh)
├── src/                   # Host-side implementation (.cpp)
├── tests/                 # Unit & property-based tests
├── CMakeLists.txt         # Build configuration
└── README.md              # This file

🔌 GPU Support

Architecture	Compute Capability	Status	Notes
Volta	SM 7.0, 7.5	✅ Supported	Tensor Core INT8 available
Turing	SM 7.5	✅ Supported	Improved INT8 performance
Ampere	SM 8.0, 8.6	✅ Optimized	Best performance
Ada Lovelace	SM 8.9	✅ Optimized	Full feature support
Hopper	SM 9.0	✅ Supported	FP8 ready

🤝 Contributing

We welcome contributions! Tiny-LLM follows Spec-Driven Development — specs come before code.

Quick Start

# Fork and clone
git clone https://github.com/your-username/tiny-llm.git
cd tiny-llm

# Build with tests
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=ON
make -j$(nproc)

# Run checks
ctest --output-on-failure

Ways to Contribute

Type	How
🐛 Bug Reports	Open an Issue with reproduction steps
💡 Feature Requests	Start a Discussion
🔧 Code Contributions	Fork → Branch → PR (follow CONTRIBUTING.md)
📝 Documentation	Improve docs in `website/docs/`
🧪 Testing	Add unit tests or property-based tests

📜 License

This project is licensed under the MIT License.

🙏 Acknowledgments

Inspired by llama.cpp and vLLM
Testing with GoogleTest and RapidCheck
Built with ❤️ by the Tiny-LLM contributors

📖 Documentation • 📦 Releases • 🐛 Issues • 💬 Discussions

Made with ❤️ by the Tiny-LLM team

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
.qwen		.qwen
include/tiny_llm		include/tiny_llm
kernels		kernels
specs		specs
src		src
tests		tests
website		website
.clang-format		.clang-format
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AUTHORS.md		AUTHORS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
QWEN.md		QWEN.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tiny-LLM Inference Engine

🎯 Overview

✨ Features

Roadmap

⚡ Performance

🚀 Quick Start

Requirements

Build from Source

Usage Example

🏗️ Architecture

Core Components

📁 Project Structure

🔌 GPU Support

🤝 Contributing

Quick Start

Ways to Contribute

📜 License

🙏 Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tiny-LLM Inference Engine

🎯 Overview

✨ Features

Roadmap

⚡ Performance

🚀 Quick Start

Requirements

Build from Source

Usage Example

🏗️ Architecture

Core Components

📁 Project Structure

🔌 GPU Support

🤝 Contributing

Quick Start

Ways to Contribute

📜 License

🙏 Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages