Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs) by erictinkeredapps · Pull Request #1323 · NVIDIA/Model-Optimizer

erictinkeredapps · 2026-04-22T18:36:50Z

Summary

Four bugs prevent NVFP4 quantization from producing quantized weights for Qwen3.5/3.6 MoE models (and likely other fused MoE architectures using _QuantFusedExperts). All four produce silent failures — no errors, just bfloat16 output identical to the input model.

Test Environment

Model: Qwen3.6-35B-A3B (MoE, 256 experts, top-8 routing)
Hardware: NVIDIA DGX Spark (GB10, Blackwell)
ModelOpt: 0.45.0 dev (editable install)
Transformers: 5.5.4
Result: 20.5 GB NVFP4 output (down from 66 GB bfloat16), verified uint8 expert weights with float8_e4m3fn local scales + float32 global scales

Bug Details

Bug 1: `is_multimodal_model()` crashes on `None` architectures

File: modelopt/torch/export/model_utils.py
Models with config.architectures = None (common for fine-tuned checkpoints) crash when is_multimodal_model() iterates the list. One-line fix: or [] fallback.

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: `get_quantization_format()` does not recognize `_QuantFusedExperts`

File: modelopt/torch/export/quant_utils.py
The function iterates weight_attr_names(module) which returns singular attribute names. _QuantFusedExperts modules use plural ModuleList quantizers (gate_up_proj_weight_quantizers.N), so the function returns None and the module is treated as unquantized. Added a pre-check for plural ModuleList quantizers before the singular loop.

Bug 4: NVFP4 config wildcards do not match plural quantizer names

File: modelopt/torch/quantization/config.py
_nvfp4_selective_quant_cfg() generates patterns like *mlp.experts*weight_quantizer (singular). _QuantFusedExperts creates quantizers named gate_up_proj_weight_quantizers.0 (plural + index). The fnmatch fails, quantizers never receive NVFP4 config, and 100% stay at disabled default. Added wildcard entries for both plural suffix patterns.

Bug 5: `_process_quantized_modules` elif order sends fused MoE to wrong export path

File: modelopt/torch/export/unified_export_hf.py
Two elif branches: one checks type name ("Llama4TextExperts" in type().__name__), the other checks hasattr("gate_up_proj_weight_quantizers"). After _QuantFusedExperts wrapping, QuantQwen3_5MoeExperts matches the type-name branch, which calls _export_quantized_weight() looking for singular attributes → AttributeError. Swapped the elif order so the plural-attribute check runs first.

Changes

File	Change
`modelopt/torch/export/model_utils.py`	+1 line: `or []` fallback
`modelopt/torch/export/quant_utils.py`	+19 lines: plural ModuleList check
`modelopt/torch/export/unified_export_hf.py`	elif reorder (17 lines changed)
`modelopt/torch/quantization/config.py`	+6 lines: plural wildcard patterns

Summary by CodeRabbit

Bug Fixes
- Improved robustness of model architecture detection to handle missing or non-iterable configuration values.
New Features
- Added support for exporting quantized models with fused expert (MoE) modules.
- Enhanced quantization configuration to recognize and configure plural-style expert weight quantizers.
- Adjusted export ordering to ensure fused experts are handled before other expert branches.

Four bugs prevent NVFP4 export from producing quantized weights for Qwen3.5/3.6 MoE models (and potentially other fused MoE architectures). All produce silent failures — no errors, just bfloat16 output identical to input. Bug 1: is_multimodal_model() crashes when config.architectures is None - model_utils.py: add 'or []' fallback for NoneType iteration Bug 3: get_quantization_format() doesn't recognize _QuantFusedExperts - quant_utils.py: add check for plural ModuleList quantizers (gate_up_proj_weight_quantizers, down_proj_weight_quantizers) before the singular weight_quantizer loop Bug 4: NVFP4 config wildcards don't match plural quantizer names - config.py: _nvfp4_selective_quant_cfg() only generates patterns for singular 'weight_quantizer', but _QuantFusedExperts creates plural ModuleList quantizers. Add wildcard entries for both gate_up_proj_weight_quantizers* and down_proj_weight_quantizers* Bug 5: _process_quantized_modules elif order sends fused MoE to wrong path - unified_export_hf.py: swap elif branches so hasattr check for gate_up_proj_weight_quantizers comes before type-name checks. Without this, QuantQwen3_5MoeExperts hits the singular-attribute branch and crashes with AttributeError Tested on: Qwen3.6-35B-A3B (MoE), NVIDIA DGX Spark (GB10), modelopt 0.45.0 dev, transformers 5.5.4 Output: 20.5 GB NVFP4 (down from 66 GB bfloat16)

copy-pr-bot · 2026-04-22T18:36:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-22T18:39:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 85c2b071-7454-481e-b4c4-5e74f3e3843f

📥 Commits

Reviewing files that changed from the base of the PR and between 1b1fced and 5d5c492.

📒 Files selected for processing (1)

modelopt/torch/export/quant_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/quant_utils.py

📝 Walkthrough

Walkthrough

Updates quantization export pipeline to support fused MoE expert modules: normalizes multimodal model detection, adds early quantizer-format detection for fused experts, reorders export logic to prioritize fused-expert handling, and extends NVFP4 selective quant config generation for plural weight-quantizer attributes.

Changes

Cohort / File(s)	Summary
Multimodal Model Detection `modelopt/torch/export/model_utils.py`	Normalized `config.architectures` with `getattr(config, "architectures", []) or []` in `is_multimodal_model()` to ensure safe iteration when value is missing or falsey.
Fused Expert Quantization Support `modelopt/torch/export/quant_utils.py`, `modelopt/torch/quantization/config.py`	Added pre-check in `get_quantization_format()` to inspect plural `ModuleList` quantizer attributes (`gate_up_proj_weight_quantizers`, `down_proj_weight_quantizers`), read first enabled quantizer `num_bits`/`scale_bits`, and return `QUANTIZATION_NVFP4` for matching patterns. Extended `_nvfp4_selective_quant_cfg()` to emit wildcard-matched entries for plural quantizer attribute names.
Export Logic Reordering `modelopt/torch/export/unified_export_hf.py`	Reordered `_process_quantized_modules()` conditional so fused MoE expert modules detected via `hasattr(sub_module, "gate_up_proj_weight_quantizers")` are handled earlier (calling `_export_fused_experts` with `reshard=False`) before specific expert-type branches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 6

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and specifically describes the main purpose of the changeset: fixing NVFP4 quantization for Qwen3.x MoE models by addressing four distinct bugs across multiple files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	The PR's changes only adjust export and quantization logic and introduce no new instances of torch.load with unsafe flags, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on external inputs, "# nosec" comments, or non-permissively licensed dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/quant_utils.py`:
- Around line 673-685: The code currently only inspects quantizer_list[0] to
detect NVFP4, which misses cases where expert 0 is disabled; update the logic in
the detection block to iterate over quantizer_list and find the first quantizer
q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled
quantizer), then read num_bits, block_sizes and compute scale_bits from that
enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and
scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens
after checking all quantizers.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f0da2041-19c7-4d5f-b7a8-ecfb8c942950

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 1b1fced.

📒 Files selected for processing (4)

modelopt/torch/export/model_utils.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/config.py

coderabbitai · 2026-04-22T18:44:34Z

+        if quantizer_list is not None and len(quantizer_list) > 0:
+            # Check the first quantizer in the list — all share the same config
+            q = quantizer_list[0]
+            if hasattr(q, "is_enabled") and q.is_enabled:
+                num_bits = getattr(q, "num_bits", None)
+                block_sizes = getattr(q, "block_sizes", None)
+                scale_bits = (
+                    block_sizes.get("scale_bits", (8, 0))
+                    if isinstance(block_sizes, dict) and "scale_bits" in block_sizes
+                    else (8, 0)
+                )
+                if num_bits == (2, 1) and scale_bits == (4, 3):
+                    return QUANTIZATION_NVFP4


⚠️ Potential issue | 🟡 Minor

Avoid assuming expert quantizer at index 0 is representative.

At Line 675, only quantizer_list[0] is checked. If expert 0 is disabled and another expert is enabled, format detection can fall through to QUANTIZATION_NONE, causing a silent skip of quantized export for that fused module.

🔧 Suggested fix

- if quantizer_list is not None and len(quantizer_list) > 0: - # Check the first quantizer in the list — all share the same config - q = quantizer_list[0] - if hasattr(q, "is_enabled") and q.is_enabled: + if quantizer_list is not None and len(quantizer_list) > 0: + # Find the first enabled quantizer in the list + q = next((item for item in quantizer_list if getattr(item, "is_enabled", False)), None) + if q is not None: num_bits = getattr(q, "num_bits", None) block_sizes = getattr(q, "block_sizes", None) scale_bits = ( block_sizes.get("scale_bits", (8, 0)) if isinstance(block_sizes, dict) and "scale_bits" in block_sizes else (8, 0) ) if num_bits == (2, 1) and scale_bits == (4, 3): return QUANTIZATION_NVFP4

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/quant_utils.py` around lines 673 - 685, The code currently only inspects quantizer_list[0] to detect NVFP4, which misses cases where expert 0 is disabled; update the logic in the detection block to iterate over quantizer_list and find the first quantizer q where hasattr(q, "is_enabled") and q.is_enabled (or otherwise any enabled quantizer), then read num_bits, block_sizes and compute scale_bits from that enabled q and return QUANTIZATION_NVFP4 when matching (num_bits == (2, 1) and scale_bits == (4, 3)); ensure the fallback to QUANTIZATION_NONE only happens after checking all quantizers.

CodeRabbit review: expert 0 may be disabled when uncalibrated, so checking only quantizer_list[0] can miss the actual NVFP4 config. Now iterates to find the first enabled quantizer in the list.

erictinkeredapps requested review from a team as code owners April 22, 2026 18:36

erictinkeredapps requested review from ajrasane and jingyu-ml April 22, 2026 18:36

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

fix: iterate quantizer list to find first enabled quantizer

5d5c492

CodeRabbit review: expert 0 may be disabled when uncalibrated, so checking only quantizer_list[0] can miss the actual NVFP4 config. Now iterates to find the first enabled quantizer in the list.

gaby mentioned this pull request Apr 23, 2026

What’s the Correct Way to Quantize Qwen3.5 (MoE/Dense) to NVFP4? #1255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323

Fix NVFP4 quantization for Qwen3.x MoE models (4 silent-failure bugs)#1323
erictinkeredapps wants to merge 2 commits intoNVIDIA:mainfrom
erictinkeredapps:fix-qwen3x-moe-nvfp4-export

erictinkeredapps commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

erictinkeredapps commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Environment

Bug Details

Bug 1: is_multimodal_model() crashes on None architectures

Bug 2: (Usage issue, not a code bug — fixed in caller)

Bug 3: get_quantization_format() does not recognize _QuantFusedExperts

Bug 4: NVFP4 config wildcards do not match plural quantizer names

Bug 5: _process_quantized_modules elif order sends fused MoE to wrong export path

Changes

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erictinkeredapps commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Bug 1: `is_multimodal_model()` crashes on `None` architectures

Bug 3: `get_quantization_format()` does not recognize `_QuantFusedExperts`

Bug 5: `_process_quantized_modules` elif order sends fused MoE to wrong export path

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading