Skip to content

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323

Open
erictinkeredapps wants to merge 2 commits intoNVIDIA:mainfrom
erictinkeredapps:fix-qwen3x-moe-nvfp4-export
Open

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323
erictinkeredapps wants to merge 2 commits intoNVIDIA:mainfrom
erictinkeredapps:fix-qwen3x-moe-nvfp4-export

Conversation

@erictinkeredapps
Copy link
Copy Markdown

@erictinkeredapps erictinkeredapps commented Apr 22, 2026

Summary

Four bugs prevent NVFP4 quantization from producing quantized weights for Qwen3.5/3.6 MoE models (and likely other fused MoE architectures using _QuantFusedExperts). All four produce silent failures — no errors, just bfloat16 output identical to the input model.

Test Environment

  • Model: Qwen3.6-35B-A3B (MoE, 256 experts, top-8 routing)
  • Hardware: NVIDIA DGX Spark (GB10, Blackwell)
  • ModelOpt: 0.45.0 dev (editable install)
  • Transformers: 5.5.4
  • Result: 20.5 GB NVFP4 output (down from 66 GB bfloat16), verified uint8 expert weights with float8_e4m3fn local scales + float32 global scales

Bug Details

Bug 1: is_multimodal_model() crashes on None architectures

File: modelopt/torch/export/model_utils.py
Models with config.architectures = None (common for fine-tuned checkpoints) crash when is_multimodal_model() iterates the list. One-line fix: or [] fallback.

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: get_quantization_format() does not recognize _QuantFusedExperts

File: modelopt/torch/export/quant_utils.py
The function iterates weight_attr_names(module) which returns singular attribute names. _QuantFusedExperts modules use plural ModuleList quantizers (gate_up_proj_weight_quantizers.N), so the function returns None and the module is treated as unquantized. Added a pre-check for plural ModuleList quantizers before the singular loop.

Bug 4: NVFP4 config wildcards do not match plural quantizer names

File: modelopt/torch/quantization/config.py
_nvfp4_selective_quant_cfg() generates patterns like *mlp.experts*weight_quantizer (singular). _QuantFusedExperts creates quantizers named gate_up_proj_weight_quantizers.0 (plural + index). The fnmatch fails, quantizers never receive NVFP4 config, and 100% stay at disabled default. Added wildcard entries for both plural suffix patterns.

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong export path

File: modelopt/torch/export/unified_export_hf.py
Two elif branches: one checks type name ("Llama4TextExperts" in type().__name__), the other checks hasattr("gate_up_proj_weight_quantizers"). After _QuantFusedExperts wrapping, QuantQwen3_5MoeExperts matches the type-name branch, which calls _export_quantized_weight() looking for singular attributes → AttributeError. Swapped the elif order so the plural-attribute check runs first.

Changes

File Change
modelopt/torch/export/model_utils.py +1 line: or [] fallback
modelopt/torch/export/quant_utils.py +19 lines: plural ModuleList check
modelopt/torch/export/unified_export_hf.py elif reorder (17 lines changed)
modelopt/torch/quantization/config.py +6 lines: plural wildcard patterns

Summary by CodeRabbit

  • Bug Fixes

    • Improved robustness of model architecture detection to handle missing or non-iterable configuration values.
  • New Features

    • Added support for exporting quantized models with fused expert (MoE) modules.
    • Enhanced quantization configuration to recognize and configure plural-style expert weight quantizers.
    • Adjusted export ordering to ensure fused experts are handled before other expert branches.

Four bugs prevent NVFP4 export from producing quantized weights for
Qwen3.5/3.6 MoE models (and potentially other fused MoE architectures).
All produce silent failures — no errors, just bfloat16 output identical
to input.

Bug 1: is_multimodal_model() crashes when config.architectures is None
  - model_utils.py: add 'or []' fallback for NoneType iteration

Bug 3: get_quantization_format() doesn't recognize _QuantFusedExperts
  - quant_utils.py: add check for plural ModuleList quantizers
    (gate_up_proj_weight_quantizers, down_proj_weight_quantizers)
    before the singular weight_quantizer loop

Bug 4: NVFP4 config wildcards don't match plural quantizer names
  - config.py: _nvfp4_selective_quant_cfg() only generates patterns
    for singular 'weight_quantizer', but _QuantFusedExperts creates
    plural ModuleList quantizers. Add wildcard entries for both
    gate_up_proj_weight_quantizers* and down_proj_weight_quantizers*

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong path
  - unified_export_hf.py: swap elif branches so hasattr check for
    gate_up_proj_weight_quantizers comes before type-name checks.
    Without this, QuantQwen3_5MoeExperts hits the singular-attribute
    branch and crashes with AttributeError

Tested on: Qwen3.6-35B-A3B (MoE), NVIDIA DGX Spark (GB10),
modelopt 0.45.0 dev, transformers 5.5.4
Output: 20.5 GB NVFP4 (down from 66 GB bfloat16)
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 85c2b071-7454-481e-b4c4-5e74f3e3843f

📥 Commits

Reviewing files that changed from the base of the PR and between 1b1fced and 5d5c492.

📒 Files selected for processing (1)
  • modelopt/torch/export/quant_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/quant_utils.py

📝 Walkthrough

Walkthrough

Updates quantization export pipeline to support fused MoE expert modules: normalizes multimodal model detection, adds early quantizer-format detection for fused experts, reorders export logic to prioritize fused-expert handling, and extends NVFP4 selective quant config generation for plural weight-quantizer attributes.

Changes

Cohort / File(s) Summary
Multimodal Model Detection
modelopt/torch/export/model_utils.py
Normalized config.architectures with getattr(config, "architectures", []) or [] in is_multimodal_model() to ensure safe iteration when value is missing or falsey.
Fused Expert Quantization Support
modelopt/torch/export/quant_utils.py, modelopt/torch/quantization/config.py
Added pre-check in get_quantization_format() to inspect plural ModuleList quantizer attributes (gate_up_proj_weight_quantizers, down_proj_weight_quantizers), read first enabled quantizer num_bits/scale_bits, and return QUANTIZATION_NVFP4 for matching patterns. Extended _nvfp4_selective_quant_cfg() to emit wildcard-matched entries for plural quantizer attribute names.
Export Logic Reordering
modelopt/torch/export/unified_export_hf.py
Reordered _process_quantized_modules() conditional so fused MoE expert modules detected via hasattr(sub_module, "gate_up_proj_weight_quantizers") are handled earlier (calling _export_fused_experts with reshard=False) before specific expert-type branches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and specifically describes the main purpose of the changeset: fixing NVFP4 quantization for Qwen3.x MoE models by addressing four distinct bugs across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed The PR's changes only adjust export and quantization logic and introduce no new instances of torch.load with unsafe flags, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on external inputs, "# nosec" comments, or non-permissively licensed dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 673-685: The code currently only inspects quantizer_list[0] to
detect NVFP4, which misses cases where expert 0 is disabled; update the logic in
the detection block to iterate over quantizer_list and find the first quantizer
q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled
quantizer), then read num_bits, block_sizes and compute scale_bits from that
enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and
scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens
after checking all quantizers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0da2041-19c7-4d5f-b7a8-ecfb8c942950

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 1b1fced.

📒 Files selected for processing (4)
  • modelopt/torch/export/model_utils.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py

Comment thread modelopt/torch/export/quant_utils.py Outdated
Comment on lines +673 to +685
if quantizer_list is not None and len(quantizer_list) > 0:
# Check the first quantizer in the list — all share the same config
q = quantizer_list[0]
if hasattr(q, "is_enabled") and q.is_enabled:
num_bits = getattr(q, "num_bits", None)
block_sizes = getattr(q, "block_sizes", None)
scale_bits = (
block_sizes.get("scale_bits", (8, 0))
if isinstance(block_sizes, dict) and "scale_bits" in block_sizes
else (8, 0)
)
if num_bits == (2, 1) and scale_bits == (4, 3):
return QUANTIZATION_NVFP4
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid assuming expert quantizer at index 0 is representative.

At Line 675, only quantizer_list[0] is checked. If expert 0 is disabled and another expert is enabled, format detection can fall through to QUANTIZATION_NONE, causing a silent skip of quantized export for that fused module.

🔧 Suggested fix
-        if quantizer_list is not None and len(quantizer_list) > 0:
-            # Check the first quantizer in the list — all share the same config
-            q = quantizer_list[0]
-            if hasattr(q, "is_enabled") and q.is_enabled:
+        if quantizer_list is not None and len(quantizer_list) > 0:
+            # Find the first enabled quantizer in the list
+            q = next((item for item in quantizer_list if getattr(item, "is_enabled", False)), None)
+            if q is not None:
                 num_bits = getattr(q, "num_bits", None)
                 block_sizes = getattr(q, "block_sizes", None)
                 scale_bits = (
                     block_sizes.get("scale_bits", (8, 0))
                     if isinstance(block_sizes, dict) and "scale_bits" in block_sizes
                     else (8, 0)
                 )
                 if num_bits == (2, 1) and scale_bits == (4, 3):
                     return QUANTIZATION_NVFP4
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/quant_utils.py` around lines 673 - 685, The code
currently only inspects quantizer_list[0] to detect NVFP4, which misses cases
where expert 0 is disabled; update the logic in the detection block to iterate
over quantizer_list and find the first quantizer q where hasattr(q,
"is_enabled") and q.is_enabled (or otherwise any enabled quantizer), then read
num_bits, block_sizes and compute scale_bits from that enabled q and return
QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and scale_bits == (4, 3));
ensure the fallback to QUANTIZATION_NONE only happens after checking all
quantizers.

CodeRabbit review: expert 0 may be disabled when uncalibrated, so
checking only quantizer_list[0] can miss the actual NVFP4 config.
Now iterates to find the first enabled quantizer in the list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant