feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334
feat(recipes): add KV cache cast variants (fp8_cast / nvfp4_cast)#1334
Conversation
Introduce three built-in PTQ recipes that express the cast variants of
KV cache quantization directly in YAML by setting
use_constant_amax: true on the *[kv]_bmm_quantizer entry:
- general/ptq/fp8_default-fp8_cast_kv
- general/ptq/nvfp4_default-fp8_cast_kv
- general/ptq/nvfp4_default-nvfp4_cast_kv
Previously, cast semantics could only be activated via
--kv_cache_qformat={fp8_cast,nvfp4_cast} in hf_ptq.py, which was
layered on top of any --recipe-supplied config and silently overrode
it. With these recipes, the YAML is self-contained and authoritative:
hf_ptq.py now skips the post-hoc _set_kv_cache_constant_amax override
when --recipe is provided, and --kv_cache_qformat is documented as
ignored in that case.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds three PTQ recipe YAMLs (FP8/NVFP4 cast-KV variants), updates the example quantization script to preserve recipe-provided KV cache settings instead of forcing constant amax, and extends the PTQ recipe loader test to include the new recipes. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml (1)
26-33: Consider adding explicitenable: truefor consistency.The
*input_quantizerand*weight_quantizerentries omit the explicitenable: truefield, relying on the default. However, the NVFP4 cast recipes (e.g.,nvfp4_default-nvfp4_cast_kv.yaml) explicitly setenable: truefor these quantizers. For consistency across the recipe family, consider addingenable: truehere as well.🔧 Proposed fix for consistency
- quantizer_name: '*input_quantizer' + enable: true cfg: num_bits: e4m3 axis: - quantizer_name: '*weight_quantizer' + enable: true cfg: num_bits: e4m3 axis:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml` around lines 26 - 33, Add an explicit enable: true to the quantizer anchors *input_quantizer and *weight_quantizer in the fp8_default-fp8_cast_kv recipe so they match the NVFP4 cast recipes' explicit enable behavior; locate the YAML blocks that define quantizer_name: '*input_quantizer' and quantizer_name: '*weight_quantizer' and insert enable: true under their cfg entries to ensure consistency across recipe family.tests/unit/recipe/test_loader.py (1)
107-115: Consider adding a test to verifyuse_constant_amaxis preserved in cast recipes.The smoke test validates that recipes load successfully, but for the new
*_cast_kvrecipes, it would be valuable to verify thatuse_constant_amax: trueis correctly preserved after loading. This ensures the core feature of these recipes—skipping KV calibration—works as intended.🧪 Proposed test for use_constant_amax preservation
_BUILTIN_CAST_KV_RECIPES = [ "general/ptq/fp8_default-fp8_cast_kv", "general/ptq/nvfp4_default-fp8_cast_kv", "general/ptq/nvfp4_default-nvfp4_cast_kv", ] `@pytest.mark.parametrize`("recipe_path", _BUILTIN_CAST_KV_RECIPES) def test_cast_kv_recipes_preserve_constant_amax(recipe_path): """Cast KV recipes must preserve use_constant_amax: True on KV quantizers.""" recipe = load_recipe(recipe_path) kv_entries = [ e for e in recipe.quantize.quant_cfg if e.get("quantizer_name") == "*[kv]_bmm_quantizer" ] assert len(kv_entries) == 1, "Expected exactly one KV BMM quantizer entry" cfg = kv_entries[0].get("cfg", {}) assert cfg.get("use_constant_amax") is True, "use_constant_amax must be True for cast recipes"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/recipe/test_loader.py` around lines 107 - 115, Add a parametrized unit test that loads each cast-KV recipe (use the same recipe names as in _BUILTIN_CAST_KV_RECIPES), calls load_recipe(recipe_path), filters recipe.quantize.quant_cfg for entries where get("quantizer_name") == "*[kv]_bmm_quantizer", asserts exactly one such entry exists, and then asserts that that entry's cfg.get("use_constant_amax") is True; reference load_recipe, recipe.quantize.quant_cfg and the quantizer_name "*[kv]_bmm_quantizer" to locate where to inspect the loaded recipe.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yaml`:
- Around line 26-33: Add an explicit enable: true to the quantizer anchors
*input_quantizer and *weight_quantizer in the fp8_default-fp8_cast_kv recipe so
they match the NVFP4 cast recipes' explicit enable behavior; locate the YAML
blocks that define quantizer_name: '*input_quantizer' and quantizer_name:
'*weight_quantizer' and insert enable: true under their cfg entries to ensure
consistency across recipe family.
In `@tests/unit/recipe/test_loader.py`:
- Around line 107-115: Add a parametrized unit test that loads each cast-KV
recipe (use the same recipe names as in _BUILTIN_CAST_KV_RECIPES), calls
load_recipe(recipe_path), filters recipe.quantize.quant_cfg for entries where
get("quantizer_name") == "*[kv]_bmm_quantizer", asserts exactly one such entry
exists, and then asserts that that entry's cfg.get("use_constant_amax") is True;
reference load_recipe, recipe.quantize.quant_cfg and the quantizer_name
"*[kv]_bmm_quantizer" to locate where to inspect the loaded recipe.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 235fa5cc-88a2-41ee-b58d-bab0cdf61dcc
📒 Files selected for processing (5)
examples/llm_ptq/hf_ptq.pymodelopt_recipes/general/ptq/fp8_default-fp8_cast_kv.yamlmodelopt_recipes/general/ptq/nvfp4_default-fp8_cast_kv.yamlmodelopt_recipes/general/ptq/nvfp4_default-nvfp4_cast_kv.yamltests/unit/recipe/test_loader.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
==========================================
+ Coverage 74.60% 75.63% +1.02%
==========================================
Files 467 468 +1
Lines 50176 51145 +969
==========================================
+ Hits 37435 38684 +1249
+ Misses 12741 12461 -280
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
meenchen
left a comment
There was a problem hiding this comment.
The PR is well-motivated and the logic is correct. The YAML recipes are consistent with existing ones, and the hf_ptq.py change correctly gates the post-hoc override. One minor issue: copyright years in new files should be 2025, not 2024. Also noting that these three new YAML files share ~90% of their content with the existing non-cast variants — the upcoming $import system (PR #1253) would be a great fit to deduplicate, but that's not blocking.
| @@ -0,0 +1,70 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Minor: Copyright year should be 2025 for new files, not 2024 (same applies to the other two new YAML files).
| recipe_type: ptq | ||
| description: >- | ||
| NVFP4 W4A4, NVFP4 KV cache with constant amax (skips KV calibration; amax | ||
| hardcoded to FP8 E4M3 max 448.0), max calibration. |
There was a problem hiding this comment.
Nit: The description says "amax hardcoded to FP8 E4M3 max 448.0" but this recipe uses NVFP4 (e2m1) for KV cache. The constant amax value for e2m1 would be different from FP8 E4M3's 448.0. Is the description accurate, or should it say something like "amax hardcoded to the format's max representable value"?
There was a problem hiding this comment.
Good catch on reading it literally — but 448.0 is intentional here for NVFP4 KV too, not a typo.
use_constant_amax always returns torch.finfo(torch.float8_e4m3fn).max at runtime, regardless of the quantizer's num_bits:
This is deliberate — at deployment, the attention kernel upcasts the NVFP4 KV cache values to FP8 before the attention math, so the scale has to land in the FP8 E4M3 range (448.0). The existing docstring on QuantizerAttributeConfig.use_constant_amax calls this out explicitly:
This is used for KV cache quantization where the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization, so the amax is hardcoded to the FP8 range.
So the YAML description is accurate — "amax hardcoded to FP8 E4M3 max 448.0" correctly reflects what happens even for the NVFP4 KV recipe. I'll add a short note in the YAML description pointing at this rationale so the next reader isn't surprised.
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Summary
use_constant_amax: truequantizer field. These are recipe equivalents of--kv_cache_qformat=fp8_cast/nvfp4_cast:general/ptq/fp8_default-fp8_cast_kvgeneral/ptq/nvfp4_default-fp8_cast_kvgeneral/ptq/nvfp4_default-nvfp4_cast_kv--recipeauthoritative inexamples/llm_ptq/hf_ptq.py: the post-hoc_set_kv_cache_constant_amaxoverride now only runs when--recipe is None, so a recipe YAML fully determines KV-cache config instead of being silently overridden by the default--kv_cache_qformat=fp8_cast. Updated help text on both flags.Motivation
Before this change, the cast variants lived only in argparse (
_KV_CAST_FORMATS = {"fp8_cast", "nvfp4_cast"}) and were layered on top of any recipe-loaded config. That meant--recipe nvfp4_default-fp8_kvwould silently become a cast recipe due to the--kv_cache_qformatdefault. Now the recipe is self-contained: its YAML either setsuse_constant_amax: trueon the*[kv]_bmm_quantizerentry (cast) or doesn't (data-driven calibration).Test plan
pytest tests/unit/recipe/test_loader.py— all 24 tests pass, including the three new parametrized recipes.load_recipe()withuse_constant_amax: Truesurviving Pydantic validation on the KV entry./models/Qwen/Qwen3-8B(RTX 6000 Ada, 4 samples, seq_len=128) for all three new recipes:mtq.quantize(model, recipe.quantize.model_dump(), forward_loop=...), all 72k_bmm_quantizer/v_bmm_quantizermodules have_use_constant_amax=Trueand_get_amax()returns448.0(FP8 E4M3 max).--recipeauthoritative behavior change:--kv_cache_qformat=fp8_cast→ KV entry does NOT getuse_constant_amax(no silent override).--kv_cache_qformat=fp8→ KV entry keepsuse_constant_amax=True(recipe wins).Summary by CodeRabbit
Bug Fixes
New Features
Documentation
Tests