Skip to content

feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334

Merged
cjluo-nv merged 3 commits intomainfrom
chenjiel/fix_fp8_cast
Apr 23, 2026
Merged

feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334
cjluo-nv merged 3 commits intomainfrom
chenjiel/fix_fp8_cast

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv commented Apr 23, 2026

Summary

  • Adds three built-in PTQ recipes that express the KV-cache cast variants directly in YAML, using the existing use_constant_amax: true quantizer field. These are recipe equivalents of --kv_cache_qformat=fp8_cast / nvfp4_cast:
    • general/ptq/fp8_default-fp8_cast_kv
    • general/ptq/nvfp4_default-fp8_cast_kv
    • general/ptq/nvfp4_default-nvfp4_cast_kv
  • Makes --recipe authoritative in examples/llm_ptq/hf_ptq.py: the post-hoc _set_kv_cache_constant_amax override now only runs when --recipe is None, so a recipe YAML fully determines KV-cache config instead of being silently overridden by the default --kv_cache_qformat=fp8_cast. Updated help text on both flags.
  • Extends the recipe loader smoke test to cover the three new recipes.

Motivation

Before this change, the cast variants lived only in argparse (_KV_CAST_FORMATS = {"fp8_cast", "nvfp4_cast"}) and were layered on top of any recipe-loaded config. That meant --recipe nvfp4_default-fp8_kv would silently become a cast recipe due to the --kv_cache_qformat default. Now the recipe is self-contained: its YAML either sets use_constant_amax: true on the *[kv]_bmm_quantizer entry (cast) or doesn't (data-driven calibration).

Test plan

  • pytest tests/unit/recipe/test_loader.py — all 24 tests pass, including the three new parametrized recipes.
  • Verified each new recipe round-trips through load_recipe() with use_constant_amax: True surviving Pydantic validation on the KV entry.
  • End-to-end run on /models/Qwen/Qwen3-8B (RTX 6000 Ada, 4 samples, seq_len=128) for all three new recipes:
    • After mtq.quantize(model, recipe.quantize.model_dump(), forward_loop=...), all 72 k_bmm_quantizer / v_bmm_quantizer modules have _use_constant_amax=True and _get_amax() returns 448.0 (FP8 E4M3 max).
    • Weight quantizers still calibrate from data normally (sample amax values: q_proj=0.5508, k_proj=0.6250, v_proj=0.1689, o_proj=0.7266).
  • Verified the --recipe authoritative behavior change:
    • Non-cast recipe + default --kv_cache_qformat=fp8_cast → KV entry does NOT get use_constant_amax (no silent override).
    • Cast recipe + contradictory --kv_cache_qformat=fp8 → KV entry keeps use_constant_amax=True (recipe wins).

Summary by CodeRabbit

  • Bug Fixes

    • CLI now preserves KV cache quantization settings specified in recipe YAML (no forced constant amax override).
  • New Features

    • Added three PTQ recipes for FP8 and NVFP4 with cast-KV variants and optimized KV cache handling.
  • Documentation

    • Expanded CLI help for --recipe and KV cache options with cast-recipe examples and behavior notes.
  • Tests

    • Added unit test coverage to load and validate the new PTQ recipes.

Introduce three built-in PTQ recipes that express the cast variants of
KV cache quantization directly in YAML by setting
use_constant_amax: true on the *[kv]_bmm_quantizer entry:

  - general/ptq/fp8_default-fp8_cast_kv
  - general/ptq/nvfp4_default-fp8_cast_kv
  - general/ptq/nvfp4_default-nvfp4_cast_kv

Previously, cast semantics could only be activated via
--kv_cache_qformat={fp8_cast,nvfp4_cast} in hf_ptq.py, which was
layered on top of any --recipe-supplied config and silently overrode
it. With these recipes, the YAML is self-contained and authoritative:
hf_ptq.py now skips the post-hoc _set_kv_cache_constant_amax override
when --recipe is provided, and --kv_cache_qformat is documented as
ignored in that case.

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv requested review from a team as code owners April 23, 2026 16:31
@cjluo-nv cjluo-nv requested review from Edwardf0t1 and h-guo18 April 23, 2026 16:31
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 23, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5f4ef57-918a-402e-9b4e-516819db8599

📥 Commits

Reviewing files that changed from the base of the PR and between 1c2a68f and 177ec6e.

📒 Files selected for processing (1)
  • modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml

📝 Walkthrough

Walkthrough

Adds three PTQ recipe YAMLs (FP8/NVFP4 cast-KV variants), updates the example quantization script to preserve recipe-provided KV cache settings instead of forcing constant amax, and extends the PTQ recipe loader test to include the new recipes.

Changes

Cohort / File(s) Summary
New PTQ Recipes
modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml, modelopt_recipes/general/ptq/nvfp4_default-fp8_cast_kv.yaml, modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml
Adds three PTQ recipe files that enable KV-cache quantization (cast variants), set max-based algorithms, configure FP8/KV settings with use_constant_amax: true, enable/disable targeted quantizers, and specify weight/input quantizer block/bit settings.
CLI Control Flow & Help Text
examples/llm_ptq/hf_ptq.py
Changes quantize_main to only enforce constant-aMax KV override when no recipe is provided (args.recipe is None), preserving YAML use_constant_amax when --recipe is used. Expands --recipe and --kv_cache_qformat argparse help to state that recipe-provided KV config takes precedence and --kv_cache_qformat is ignored under --recipe.
Test Parametrization
tests/unit/recipe/test_loader.py
Adds the three new recipe paths to the built-in PTQ recipe list so the loader smoke test loads and validates them as PTQ recipes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding three new KV cache cast variant recipes (fp8_cast_kv and nvfp4_cast_kv) to the recipe library.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Security scan for insecure patterns (torch.load, numpy.load with allow_pickle, trust_remote_code, eval/exec, nosec) in changed Python files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/fix_fp8_cast

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml (1)

26-33: Consider adding explicit enable: true for consistency.

The *input_quantizer and *weight_quantizer entries omit the explicit enable: true field, relying on the default. However, the NVFP4 cast recipes (e.g., nvfp4_default-nvfp4_cast_kv.yaml) explicitly set enable: true for these quantizers. For consistency across the recipe family, consider adding enable: true here as well.

🔧 Proposed fix for consistency
     - quantizer_name: '*input_quantizer'
+      enable: true
       cfg:
         num_bits: e4m3
         axis:
     - quantizer_name: '*weight_quantizer'
+      enable: true
       cfg:
         num_bits: e4m3
         axis:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml` around lines 26 -
33, Add an explicit enable: true to the quantizer anchors *input_quantizer and
*weight_quantizer in the fp8_default-fp8_cast_kv recipe so they match the NVFP4
cast recipes' explicit enable behavior; locate the YAML blocks that define
quantizer_name: '*input_quantizer' and quantizer_name: '*weight_quantizer' and
insert enable: true under their cfg entries to ensure consistency across recipe
family.
tests/unit/recipe/test_loader.py (1)

107-115: Consider adding a test to verify use_constant_amax is preserved in cast recipes.

The smoke test validates that recipes load successfully, but for the new *_cast_kv recipes, it would be valuable to verify that use_constant_amax: true is correctly preserved after loading. This ensures the core feature of these recipes—skipping KV calibration—works as intended.

🧪 Proposed test for use_constant_amax preservation
_BUILTIN_CAST_KV_RECIPES = [
    "general/ptq/fp8_default-fp8_cast_kv",
    "general/ptq/nvfp4_default-fp8_cast_kv",
    "general/ptq/nvfp4_default-nvfp4_cast_kv",
]


`@pytest.mark.parametrize`("recipe_path", _BUILTIN_CAST_KV_RECIPES)
def test_cast_kv_recipes_preserve_constant_amax(recipe_path):
    """Cast KV recipes must preserve use_constant_amax: True on KV quantizers."""
    recipe = load_recipe(recipe_path)
    kv_entries = [
        e for e in recipe.quantize.quant_cfg
        if e.get("quantizer_name") == "*[kv]_bmm_quantizer"
    ]
    assert len(kv_entries) == 1, "Expected exactly one KV BMM quantizer entry"
    cfg = kv_entries[0].get("cfg", {})
    assert cfg.get("use_constant_amax") is True, "use_constant_amax must be True for cast recipes"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/recipe/test_loader.py` around lines 107 - 115, Add a parametrized
unit test that loads each cast-KV recipe (use the same recipe names as in
_BUILTIN_CAST_KV_RECIPES), calls load_recipe(recipe_path), filters
recipe.quantize.quant_cfg for entries where get("quantizer_name") ==
"*[kv]_bmm_quantizer", asserts exactly one such entry exists, and then asserts
that that entry's cfg.get("use_constant_amax") is True; reference load_recipe,
recipe.quantize.quant_cfg and the quantizer_name "*[kv]_bmm_quantizer" to locate
where to inspect the loaded recipe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml`:
- Around line 26-33: Add an explicit enable: true to the quantizer anchors
*input_quantizer and *weight_quantizer in the fp8_default-fp8_cast_kv recipe so
they match the NVFP4 cast recipes' explicit enable behavior; locate the YAML
blocks that define quantizer_name: '*input_quantizer' and quantizer_name:
'*weight_quantizer' and insert enable: true under their cfg entries to ensure
consistency across recipe family.

In `@tests/unit/recipe/test_loader.py`:
- Around line 107-115: Add a parametrized unit test that loads each cast-KV
recipe (use the same recipe names as in _BUILTIN_CAST_KV_RECIPES), calls
load_recipe(recipe_path), filters recipe.quantize.quant_cfg for entries where
get("quantizer_name") == "*[kv]_bmm_quantizer", asserts exactly one such entry
exists, and then asserts that that entry's cfg.get("use_constant_amax") is True;
reference load_recipe, recipe.quantize.quant_cfg and the quantizer_name
"*[kv]_bmm_quantizer" to locate where to inspect the loaded recipe.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 235fa5cc-88a2-41ee-b58d-bab0cdf61dcc

📥 Commits

Reviewing files that changed from the base of the PR and between 8663678 and 78e9c61.

📒 Files selected for processing (5)
  • examples/llm_ptq/hf_ptq.py
  • modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml
  • modelopt_recipes/general/ptq/nvfp4_default-fp8_cast_kv.yaml
  • modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml
  • tests/unit/recipe/test_loader.py

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.63%. Comparing base (c796611) to head (177ec6e).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
+ Coverage   74.60%   75.63%   +1.02%     
==========================================
  Files         467      468       +1     
  Lines       50176    51145     +969     
==========================================
+ Hits        37435    38684    +1249     
+ Misses      12741    12461     -280     
Flag Coverage Δ
examples 41.46% <ø> (+5.86%) ⬆️
unit 52.55% <ø> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@meenchen meenchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is well-motivated and the logic is correct. The YAML recipes are consistent with existing ones, and the hf_ptq.py change correctly gates the post-hoc override. One minor issue: copyright years in new files should be 2025, not 2024. Also noting that these three new YAML files share ~90% of their content with the existing non-cast variants — the upcoming $import system (PR #1253) would be a great fit to deduplicate, but that's not blocking.

@@ -0,0 +1,70 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Copyright year should be 2025 for new files, not 2024 (same applies to the other two new YAML files).

recipe_type: ptq
description: >-
NVFP4 W4A4, NVFP4 KV cache with constant amax (skips KV calibration; amax
hardcoded to FP8 E4M3 max 448.0), max calibration.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The description says "amax hardcoded to FP8 E4M3 max 448.0" but this recipe uses NVFP4 (e2m1) for KV cache. The constant amax value for e2m1 would be different from FP8 E4M3's 448.0. Is the description accurate, or should it say something like "amax hardcoded to the format's max representable value"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on reading it literally — but 448.0 is intentional here for NVFP4 KV too, not a typo.

use_constant_amax always returns torch.finfo(torch.float8_e4m3fn).max at runtime, regardless of the quantizer's num_bits:

https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/nn/modules/tensor_quantizer.py#L644-L647

This is deliberate — at deployment, the attention kernel upcasts the NVFP4 KV cache values to FP8 before the attention math, so the scale has to land in the FP8 E4M3 range (448.0). The existing docstring on QuantizerAttributeConfig.use_constant_amax calls this out explicitly:

This is used for KV cache quantization where the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization, so the amax is hardcoded to the FP8 range.

So the YAML description is accurate — "amax hardcoded to FP8 E4M3 max 448.0" correctly reflects what happens even for the NVFP4 KV recipe. I'll add a short note in the YAML description pointing at this rationale so the next reader isn't surprised.

Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@cjluo-nv cjluo-nv enabled auto-merge (squash) April 23, 2026 17:56
@cjluo-nv cjluo-nv merged commit fda0899 into main Apr 23, 2026
45 checks passed
@cjluo-nv cjluo-nv deleted the chenjiel/fix_fp8_cast branch April 23, 2026 18:45
@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-23 18:45 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants