feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast) by cjluo-nv · Pull Request #1334 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-04-23T16:31:32Z

Summary

Adds three built-in PTQ recipes that express the KV-cache cast variants directly in YAML, using the existing use_constant_amax: true quantizer field. These are recipe equivalents of --kv_cache_qformat=fp8_cast / nvfp4_cast:
- general/ptq/fp8_default-fp8_cast_kv
- general/ptq/nvfp4_default-fp8_cast_kv
- general/ptq/nvfp4_default-nvfp4_cast_kv
Makes --recipe authoritative in examples/llm_ptq/hf_ptq.py: the post-hoc _set_kv_cache_constant_amax override now only runs when --recipe is None, so a recipe YAML fully determines KV-cache config instead of being silently overridden by the default --kv_cache_qformat=fp8_cast. Updated help text on both flags.
Extends the recipe loader smoke test to cover the three new recipes.

Motivation

Before this change, the cast variants lived only in argparse (_KV_CAST_FORMATS = {"fp8_cast", "nvfp4_cast"}) and were layered on top of any recipe-loaded config. That meant --recipe nvfp4_default-fp8_kv would silently become a cast recipe due to the --kv_cache_qformat default. Now the recipe is self-contained: its YAML either sets use_constant_amax: true on the *[kv]_bmm_quantizer entry (cast) or doesn't (data-driven calibration).

Test plan

pytest tests/unit/recipe/test_loader.py — all 24 tests pass, including the three new parametrized recipes.
Verified each new recipe round-trips through load_recipe() with use_constant_amax: True surviving Pydantic validation on the KV entry.
End-to-end run on /models/Qwen/Qwen3-8B (RTX 6000 Ada, 4 samples, seq_len=128) for all three new recipes:
- After mtq.quantize(model, recipe.quantize.model_dump(), forward_loop=...), all 72 k_bmm_quantizer / v_bmm_quantizer modules have _use_constant_amax=True and _get_amax() returns 448.0 (FP8 E4M3 max).
- Weight quantizers still calibrate from data normally (sample amax values: q_proj=0.5508, k_proj=0.6250, v_proj=0.1689, o_proj=0.7266).
Verified the --recipe authoritative behavior change:
- Non-cast recipe + default --kv_cache_qformat=fp8_cast → KV entry does NOT get use_constant_amax (no silent override).
- Cast recipe + contradictory --kv_cache_qformat=fp8 → KV entry keeps use_constant_amax=True (recipe wins).

Summary by CodeRabbit

Bug Fixes
- CLI now preserves KV cache quantization settings specified in recipe YAML (no forced constant amax override).
New Features
- Added three PTQ recipes for FP8 and NVFP4 with cast-KV variants and optimized KV cache handling.
Documentation
- Expanded CLI help for --recipe and KV cache options with cast-recipe examples and behavior notes.
Tests
- Added unit test coverage to load and validate the new PTQ recipes.

Introduce three built-in PTQ recipes that express the cast variants of KV cache quantization directly in YAML by setting use_constant_amax: true on the *[kv]_bmm_quantizer entry: - general/ptq/fp8_default-fp8_cast_kv - general/ptq/nvfp4_default-fp8_cast_kv - general/ptq/nvfp4_default-nvfp4_cast_kv Previously, cast semantics could only be activated via --kv_cache_qformat={fp8_cast,nvfp4_cast} in hf_ptq.py, which was layered on top of any --recipe-supplied config and silently overrode it. With these recipes, the YAML is self-contained and authoritative: hf_ptq.py now skips the post-hoc _set_kv_cache_constant_amax override when --recipe is provided, and --kv_cache_qformat is documented as ignored in that case. Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

coderabbitai · 2026-04-23T16:32:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5f4ef57-918a-402e-9b4e-516819db8599

📥 Commits

Reviewing files that changed from the base of the PR and between 1c2a68f and 177ec6e.

📒 Files selected for processing (1)

modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml

📝 Walkthrough

Walkthrough

Adds three PTQ recipe YAMLs (FP8/NVFP4 cast-KV variants), updates the example quantization script to preserve recipe-provided KV cache settings instead of forcing constant amax, and extends the PTQ recipe loader test to include the new recipes.

Changes

Cohort / File(s)	Summary
New PTQ Recipes `modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml`, `modelopt_recipes/general/ptq/nvfp4_default-fp8_cast_kv.yaml`, `modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml`	Adds three PTQ recipe files that enable KV-cache quantization (cast variants), set max-based algorithms, configure FP8/KV settings with `use_constant_amax: true`, enable/disable targeted quantizers, and specify weight/input quantizer block/bit settings.
CLI Control Flow & Help Text `examples/llm_ptq/hf_ptq.py`	Changes `quantize_main` to only enforce constant-aMax KV override when no recipe is provided (`args.recipe is None`), preserving YAML `use_constant_amax` when `--recipe` is used. Expands `--recipe` and `--kv_cache_qformat` argparse help to state that recipe-provided KV config takes precedence and `--kv_cache_qformat` is ignored under `--recipe`.
Test Parametrization `tests/unit/recipe/test_loader.py`	Adds the three new recipe paths to the built-in PTQ recipe list so the loader smoke test loads and validates them as PTQ recipes.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding three new KV cache cast variant recipes (fp8_cast_kv and nvfp4_cast_kv) to the recipe library.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Security scan for insecure patterns (torch.load, numpy.load with allow_pickle, trust_remote_code, eval/exec, nosec) in changed Python files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/fix_fp8_cast

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml (1)

26-33: Consider adding explicit enable: true for consistency.

The *input_quantizer and *weight_quantizer entries omit the explicit enable: true field, relying on the default. However, the NVFP4 cast recipes (e.g., nvfp4_default-nvfp4_cast_kv.yaml) explicitly set enable: true for these quantizers. For consistency across the recipe family, consider adding enable: true here as well.
🔧 Proposed fix for consistency
     - quantizer_name: '*input_quantizer'
+      enable: true
       cfg:
         num_bits: e4m3
         axis:
     - quantizer_name: '*weight_quantizer'
+      enable: true
       cfg:
         num_bits: e4m3
         axis:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml` around lines 26 -
33, Add an explicit enable: true to the quantizer anchors *input_quantizer and
*weight_quantizer in the fp8_default-fp8_cast_kv recipe so they match the NVFP4
cast recipes' explicit enable behavior; locate the YAML blocks that define
quantizer_name: '*input_quantizer' and quantizer_name: '*weight_quantizer' and
insert enable: true under their cfg entries to ensure consistency across recipe
family.

tests/unit/recipe/test_loader.py (1)

107-115: Consider adding a test to verify use_constant_amax is preserved in cast recipes.

The smoke test validates that recipes load successfully, but for the new *_cast_kv recipes, it would be valuable to verify that use_constant_amax: true is correctly preserved after loading. This ensures the core feature of these recipes—skipping KV calibration—works as intended.

🧪 Proposed test for use_constant_amax preservation

_BUILTIN_CAST_KV_RECIPES = [
    "general/ptq/fp8_default-fp8_cast_kv",
    "general/ptq/nvfp4_default-fp8_cast_kv",
    "general/ptq/nvfp4_default-nvfp4_cast_kv",
]


`@pytest.mark.parametrize`("recipe_path", _BUILTIN_CAST_KV_RECIPES)
def test_cast_kv_recipes_preserve_constant_amax(recipe_path):
    """Cast KV recipes must preserve use_constant_amax: True on KV quantizers."""
    recipe = load_recipe(recipe_path)
    kv_entries = [
        e for e in recipe.quantize.quant_cfg
        if e.get("quantizer_name") == "*[kv]_bmm_quantizer"
    ]
    assert len(kv_entries) == 1, "Expected exactly one KV BMM quantizer entry"
    cfg = kv_entries[0].get("cfg", {})
    assert cfg.get("use_constant_amax") is True, "use_constant_amax must be True for cast recipes"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/recipe/test_loader.py` around lines 107 - 115, Add a parametrized
unit test that loads each cast-KV recipe (use the same recipe names as in
_BUILTIN_CAST_KV_RECIPES), calls load_recipe(recipe_path), filters
recipe.quantize.quant_cfg for entries where get("quantizer_name") ==
"*[kv]_bmm_quantizer", asserts exactly one such entry exists, and then asserts
that that entry's cfg.get("use_constant_amax") is True; reference load_recipe,
recipe.quantize.quant_cfg and the quantizer_name "*[kv]_bmm_quantizer" to locate
where to inspect the loaded recipe.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml`:
- Around line 26-33: Add an explicit enable: true to the quantizer anchors
*input_quantizer and *weight_quantizer in the fp8_default-fp8_cast_kv recipe so
they match the NVFP4 cast recipes' explicit enable behavior; locate the YAML
blocks that define quantizer_name: '*input_quantizer' and quantizer_name:
'*weight_quantizer' and insert enable: true under their cfg entries to ensure
consistency across recipe family.

In `@tests/unit/recipe/test_loader.py`:
- Around line 107-115: Add a parametrized unit test that loads each cast-KV
recipe (use the same recipe names as in _BUILTIN_CAST_KV_RECIPES), calls
load_recipe(recipe_path), filters recipe.quantize.quant_cfg for entries where
get("quantizer_name") == "*[kv]_bmm_quantizer", asserts exactly one such entry
exists, and then asserts that that entry's cfg.get("use_constant_amax") is True;
reference load_recipe, recipe.quantize.quant_cfg and the quantizer_name
"*[kv]_bmm_quantizer" to locate where to inspect the loaded recipe.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 235fa5cc-88a2-41ee-b58d-bab0cdf61dcc

📥 Commits

Reviewing files that changed from the base of the PR and between 8663678 and 78e9c61.

📒 Files selected for processing (5)

examples/llm_ptq/hf_ptq.py
modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml
modelopt_recipes/general/ptq/nvfp4_default-fp8_cast_kv.yaml
modelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yaml
tests/unit/recipe/test_loader.py

codecov · 2026-04-23T16:48:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.63%. Comparing base (c796611) to head (177ec6e).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1334      +/-   ##
==========================================
+ Coverage   74.60%   75.63%   +1.02%     
==========================================
  Files         467      468       +1     
  Lines       50176    51145     +969     
==========================================
+ Hits        37435    38684    +1249     
+ Misses      12741    12461     -280

Flag	Coverage Δ
examples	`41.46% <ø> (+5.86%)`	⬆️
unit	`52.55% <ø> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

meenchen

The PR is well-motivated and the logic is correct. The YAML recipes are consistent with existing ones, and the hf_ptq.py change correctly gates the post-hoc override. One minor issue: copyright years in new files should be 2025, not 2024. Also noting that these three new YAML files share ~90% of their content with the existing non-cast variants — the upcoming $import system (PR #1253) would be a great fit to deduplicate, but that's not blocking.

meenchen · 2026-04-23T17:15:31Z

@@ -0,0 +1,70 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Minor: Copyright year should be 2025 for new files, not 2024 (same applies to the other two new YAML files).

meenchen · 2026-04-23T17:15:31Z

+  recipe_type: ptq
+  description: >-
+    NVFP4 W4A4, NVFP4 KV cache with constant amax (skips KV calibration; amax
+    hardcoded to FP8 E4M3 max 448.0), max calibration.


Nit: The description says "amax hardcoded to FP8 E4M3 max 448.0" but this recipe uses NVFP4 (e2m1) for KV cache. The constant amax value for e2m1 would be different from FP8 E4M3's 448.0. Is the description accurate, or should it say something like "amax hardcoded to the format's max representable value"?

Good catch on reading it literally — but 448.0 is intentional here for NVFP4 KV too, not a typo.

use_constant_amax always returns torch.finfo(torch.float8_e4m3fn).max at runtime, regardless of the quantizer's num_bits:

https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/nn/modules/tensor_quantizer.py#L644-L647

This is deliberate — at deployment, the attention kernel upcasts the NVFP4 KV cache values to FP8 before the attention math, so the scale has to land in the FP8 E4M3 range (448.0). The existing docstring on QuantizerAttributeConfig.use_constant_amax calls this out explicitly:

This is used for KV cache quantization where the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization, so the amax is hardcoded to the FP8 range.

So the YAML description is accurate — "amax hardcoded to FP8 E4M3 max 448.0" correctly reflects what happens even for the NVFP4 KV recipe. I'll add a short note in the YAML description pointing at this rationale so the next reader isn't surprised.

shengliangxu

LGTM

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

github-actions · 2026-04-23T18:45:52Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-23 18:45 UTC

cjluo-nv requested review from a team as code owners April 23, 2026 16:31

cjluo-nv requested review from Edwardf0t1 and h-guo18 April 23, 2026 16:31

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

cjluo-nv requested review from meenchen and shengliangxu April 23, 2026 17:07

meenchen reviewed Apr 23, 2026

View reviewed changes

meenchen approved these changes Apr 23, 2026

View reviewed changes

shengliangxu approved these changes Apr 23, 2026

View reviewed changes

cjluo-nv added 2 commits April 23, 2026 10:21

Reorder description

1c2a68f

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

Update

177ec6e

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv enabled auto-merge (squash) April 23, 2026 17:56

cjluo-nv merged commit fda0899 into main Apr 23, 2026
45 checks passed

cjluo-nv deleted the chenjiel/fix_fp8_cast branch April 23, 2026 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334

feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334
cjluo-nv merged 3 commits intomainfrom
chenjiel/fix_fp8_cast

cjluo-nv commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

meenchen left a comment

Uh oh!

meenchen Apr 23, 2026

Uh oh!

meenchen Apr 23, 2026

Uh oh!

cjluo-nv Apr 23, 2026

Uh oh!

shengliangxu left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,70 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Conversation

cjluo-nv commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

meenchen left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

shengliangxu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cjluo-nv commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

codecov Bot commented Apr 23, 2026 •

edited

Loading