Gkarch/update 1314 by grzegorz-k-karch · Pull Request #1316 · NVIDIA/Model-Optimizer

grzegorz-k-karch · 2026-04-22T08:08:32Z

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Layerwise calibration for large GPU-infeasible models with intermediate checkpoint saving
- NVFP4 implicit GEMM CUDA kernel for Conv3D inference with quantization
- Skip-softmax sparse attention for diffusion models (WAN 2.2, LTX-2)
- Fused GPTQ kernel for accelerated weight quantization
- Job monitoring framework for SLURM-based submissions
- Container registry authentication validation for remote deployments
Improvements
- ~10x MMLU evaluation speedup via batched prefill
- Enhanced quantization export for vLLM with resmoothing support
- Extended tool support (nox/uv migration, improved multi-version testing)

### What does this PR do? Type of change: Bug fix During Megatron→vLLM fakequant export (`export_mcore_gpt_to_hf_vllm_fq`), the `weight_quantizer` is now applied as fake-quantization (quantize + dequantize) directly into the exported weight tensor, and its amax is no longer saved to `quantizer_state.pth`. On reload, if `weight_quantizer` keys are absent from the checkpoint (because they were folded at export time), the corresponding quantizer modules are disabled. This change is useful especially when amax across experts are not synced for `weight_quantizer`, this allows the `weight_quantizer` to keep them different for better accuracy. ### Usage ```python # Unchanged — export API is the same export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...) ``` ### Testing Step 1 — Quantize (run from Megatron-LM `examples/post_training/modelopt`): ```bash HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \ bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG ``` Step 2 — Export for vLLM fakequant: ```bash MLM_EXTRA_ARGS=--export-vllm-fq \ HF_MODEL_CKPT=<path/to/hf/weights> \ MLM_MODEL_CKPT=<quant-ckpt-name> \ EXPORT_DIR=<export-dir> \ bash export.sh <hf-model-id> ``` Step 3 — Serve (run from examples/vllm_serve): ```bash QUANT_CFG=NVFP4_DEFAULT_CFG \ QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ python3 vllm_serve_fakequant.py <export-dir> \ -tp 1 --served-model-name <model-name> \ --host 0.0.0.0 --port 8000 \ --trust-remote-code --enforce-eager \ --disable-custom-all-reduce \ --gpu-memory-utilization 0.8 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures. * **Improvements** * Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports.  --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

## Summary Automated weekly update of uv.lock file for nSpect Scanning: - `uv.lock` — upgraded all transitive dependencies to latest compatible versions Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

) ### What does this PR do? Type of change: ?   PTQ: model-specific dependency support - Add EXTRA_PIP_DEPS support to the launcher's `ptq.sh` so models requiring extra pip packages (e.g., `mamba-ssm` for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies. Container registry auth checks - Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes. - Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note. - Add auth check before remote SLURM deployment. ### Usage Set EXTRA_PIP_DEPS in the launcher YAML's environment section: ``` task_0: script: common/hf/ptq.sh args: - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - -- - --quant nvfp4 - --tasks quant environment: - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d" ``` ### Testing  Tested end-to-end: NVFP4 quantization of `NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards). ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit ## Release Notes * **Documentation** * Added container registry authentication verification workflow for SLURM deployments, including credential checks, verification commands, common failure symptoms, and remediation guidance. * Required credential validation before SLURM job submission and added SLURM-only verification steps with image fallback recommendations. * New dependency-checking step for models that use remote/trust_remote_code, plus guidance for resolving extra package requirements and tightened build-config guidance. * Updated PTQ launcher documentation to reference the new wrapper script. * **New Features** * Support for specifying extra pip dependencies during model processing via an environment variable.  --------- Signed-off-by: Kai Xu <kaix@nvidia.com>

) ### What does this PR do? Type of change: Bug fix Fixes TRT-LLM DeepEP kernel failures during LLM deployment on unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert parallelism (`ep`) to 1 instead of auto-setting it to the GPU count for MoE models. Previously, when the model config contained expert-related keys, `ep` was automatically set to `torch.cuda.device_count()`, which triggered DeepEP kernel failures on GPUs that don't support it. Now `ep` defaults to 1 while still enabling attention data parallelism for MoE models. Expert parallelism can be enabled explicitly by the caller when the environment is known to support it. ### Testing - [x] Verified that the `llm_ptq` test passes with this fix on Blackwell GPUs. - [x] 2-gpu CI test triggered: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24495054531/job/71588037727 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

…equirements (#1275) ### What does this PR do? Type of change: Bug fix Removed version fixes for torch and transformers ### Testing Tested quantization with a couple of models . Working as expected.  ## Summary by CodeRabbit * **Chores** * Relaxed dependency specs: removed strict pin for torch to allow latest compatible installs, and constrained transformers to <5.0.0 for broader compatibility and easier updates.  --------- Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com>

Dont allow more than 1% overall project coverage drop per PR. 2% was too much for such a large codebase  ## Summary by CodeRabbit * **Chores** * Updated code coverage enforcement thresholds for pull requests to maintain stricter quality standards.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

As title  ## Summary by CodeRabbit * **Chores** * Updated the release date for version 0.43 in the changelog.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…to allow user to bypass if needed (#1279) ## Summary - Remove the `kwargs.setdefault("weights_only", True)` call from `safe_load`, deferring to torch's built-in default (which is `True` for torch>=2.6) - This allows users to override via the `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1` env var when they trust a checkpoint but hit `pickle.UnpicklingError` - Add a test that verifies the default fails on unsafe objects and the env var bypass works ## Test plan - [x] `python -m pytest tests/unit/torch/utils/test_serialization.py -v` 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Serialization utility now respects PyTorch's default behavior and environment-variable configuration instead of forcibly enforcing parameter overrides, providing greater configuration flexibility. * **Tests** * Added test coverage validating environment-variable override functionality and default behavior in the serialization utility.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…iliency-ext dependency (#1285) - `megatron-core==0.17.0` released yesterday which requires nightly version of `nvidia-resiliency-ext` for an import. Pre-installed version in DLFW Pytorch container is `nvidia-resiliency-ext==0.5.0` - Temporarily pin `mcore<0.17.0` to unblock PR from merging. - Pin `pulp<4.0` as it has some breaking changes and release imminent Correct fix is to just use `nemo:26.04` container instead of PyTorch container for megatron-based tests since it always has correct combination of all packages needed for the megatron ecosystem - Done in #1286 --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…batching (#1280) ### What does this PR do? Type of change: new feature + bug fix Two improvements to Megatron inference utilities: **1. Pipeline Parallel (PP) correctness fixes** PP inference was producing garbage output (MMLU ~0.24, random chance). Two root causes: - `megatron_generate` / `megatron_prefill` used `get_forward_backward_func()` (the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication via `recv_from_prev_pipeline_rank_` / `send_to_next_pipeline_rank`, matching the `run_mcore_inference` pattern. - `import_mcore_gpt_from_hf` loads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage when `share_embeddings_and_output_weights=True`. At model init, `setup_embeddings_and_output_layer()` all-reduces from stage 0 to sync the output layer; after importing HF weights that all-reduce is stale. Fix: call `model.setup_embeddings_and_output_layer()` again after import. **2. `megatron_mmlu` speedup (~6x)** Replaces the `megatron_mmlu` implementation with a significantly faster approach that matches how `lm-evaluation-harness` scores multiple-choice questions. **Before:** autoregressive generation (`megatron_generate`, `osl=2`) per example, 114 separate `load_dataset` calls, batch_size=1 — 260s for 5% data. **After:** single prefill forward pass + argmax over {A,B,C,D} logits, 2 `load_dataset` calls, configurable batch_size — 18s for 5% data (~6x faster). ### Changes **PP fixes:** - `megatron_generate` / `megatron_prefill`: replace `get_forward_backward_func` with explicit P2P (`recv_from_prev_pipeline_rank_` / `send_to_next_pipeline_rank`) - `import_mcore_gpt_from_hf`: call `model.setup_embeddings_and_output_layer()` after HF weight import when PP>1 and `share_embeddings_and_output_weights=True` - `megatron_prefill`: add `skip_return_logits` param and VLM support (needed for PP non-last stages) **MMLU speedup:** - **Log-likelihood scoring**: replace `megatron_generate` with `megatron_prefill` — one forward pass per batch, no autoregressive decode loop - **Global batching**: collect all examples across all subjects, sort by descending sequence length, run in `batch_size` chunks - **2 dataset loads** instead of 114: use `load_dataset("cais/mmlu", "all")` with per-subject grouping; skip dev load when `few_shots=0` - **`percentage` → `fraction`** parameter rename for clarity - **tqdm progress bar** (rank-0 only) ### Testing - `test_megatron_generate_and_mmlu` parametrized over `tp` and `pp`. Accuracy assertion: `0.36 < score < 0.39`. Manually checked generated text is coherent. - Re-ran M-Bridge Minitron MMLU based pruning for Nano v2 9B -> 7B and all top 10 candidate's MMLU numbers are ballpark similar as before ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ❌ — `percentage` parameter renamed to `fraction`; `enable_kv_cache` removed from `megatron_mmlu` - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ — existing test updated and parametrized for TP+PP - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ 🤖 Generated with [Claude Code](https://claude.ai/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Improved pipeline-parallel generation and MMLU evaluation reliability; fixed output-layer synchronization in shared-embedding + pipeline setups. * **New Features** * MMLU scoring now uses batched prefill logit scoring for faster, batched evaluation. * **Behavior Changes** * Default MMLU sampling increased from 5% to 10%; calibration batch sizing adjusted and related CLI/help text updated. * **Tests** * Distributed tests cover tensor- and pipeline-parallel modes and tighten MMLU validation ranges. * **Documentation** * Updated pruning example and benchmark timing to reflect new sampling and speedup.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

### What does this PR do? Type of change: ?   ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added backend-specific GPTQ helper registration to allow backend-tailored GPTQ behavior. * **Bug Fixes** * Prevented KV-cache state from leaking across repeated per-layer forwards during calibration. * **Tests** * Added GPU-focused tests validating GPTQ combined with vector quantization, including accuracy and end-to-end comparisons.  --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>

## Summary Adds **performant layerwise calibration** for quantizing large models (e.g. DeepSeek-R1 671B) that don't fit entirely on GPU. ([Example commands](#example-commands)) 1. **Performant calibration for large models** — Each decoder layer is moved from CPU/disk to GPU (accelerate) or unsharded (FSDP2) **only once** and kept on GPU for the entire calibration step. Previously, every calibration batch triggered weight transfer for every layer — O(num_batches) weight movements per layer. Now it is O(1) per layer. This also means you can **increase batch size** since only one layer's weights occupy GPU at a time — e.g. DeepSeek-R1 on a single node (8×80GB) with `batch_size=16` and `gpu_max_mem_percentage=0.5`. 2. **Checkpoint save/resume** — Saves progress after each layer, so jobs that exceed cluster time limits (e.g. 4-hour Slurm windows for 100+ layer MoE models) can resume from the last completed layer. 3. **Rename** `sequential_calibrate` → `layerwise_calibrate` for clarity. ### Design details The existing layerwise state machine (skip/run/capture) already processes one layer at a time, but skip-mode layers still kept their parameters in the ModuleList — so frameworks transferred all weights every forward pass. This PR adds: - **`_SkipLayer`**: replaces fully-calibrated layers with a parameter-free dummy in the ModuleList, so framework hooks have nothing to transfer - **`persistent_materialization`**: keeps the active layer on GPU for the entire calibration step, avoiding repeated offload/reload cycles Checkpoint save is per-layer; restore is bulk — quantizer state and weights for layers 0..K-1 are restored once at the end of calibration, keeping the hot path fast. ### Example commands **Qwen3-8B** (NVFP4+GPTQ, single GPU): ```bash python hf_ptq.py \ --pyt_ckpt_path Qwen/Qwen3-8B \ --recipe nvfp4_gptq_sequential.yaml \ --calib_size 64 \ --batch_size 16 \ --dataset cnn_dailymail \ --export_path outputs/qwen3_8b_nvfp4_gptq_seq \ --gpu_max_mem_percentage 0.5 \ --use_seq_device_map \ --vllm_fakequant_export ``` **DeepSeek-R1** (NVFP4 experts-only + FP8 KV, 8×80GB): ```bash python hf_ptq.py \ --model unsloth/DeepSeek-R1-0528-BF16 \ --recipe ../../modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml \ --dataset cnn_dailymail \ --batch_size 16 \ --calib_size 64 \ --calib_seq 512 \ --gpu_max_mem_percentage 0.5 \ --use_seq_device_map \ --trust_remote_code \ --export_path output/DeepSeek-R1-BF16-nvfp4-experts-only-fp8-kv \ --vllm_fakequant_export ``` ### Example: NVFP4+GPTQ layerwise calibration on Qwen3-8B (36 layers, single GPU — 20 GB peak) **Initial run** (killed after layer 11): ``` Layerwise calibration: Found 36 transformer layers Calibrating layer 1/36 | capture: [1] Computing Hessians for 7 linear layers... GPTQ time: 51.39s Calibrating layer 2/36 | run: [1] | capture: [2] Checkpoint: saved layer 0 GPTQ time: 50.06s Calibrating layer 3/36 | skip: 1 | run: [2] | capture: [3] Checkpoint: saved layer 1 ... Calibrating layer 12/36 | skip: 10 | run: [11] | capture: [12] Checkpoint: saved layer 10 <killed> ``` **Resumed run** (picks up from layer 11, finishes all 36): ``` Layerwise calibration: Found 36 transformer layers Checkpoint: resuming layerwise calibration from layer 11/36 Calibrating layer 12 (resumed) GPTQ time: 51.45s Calibrating layer 13/36 | skip: 11 | run: [12] | capture: [13] Checkpoint: saved layer 11 ... Calibrating layer 36/36 | skip: 34 | run: [35] | capture: [36] Checkpoint: saved layer 34 GPTQ time: 50.33s Checkpoint: saved layer 35 (final) Checkpoint: restored 11 previously calibrated layers Layerwise calibration completed Quantized model exported to: outputs/qwen3_8b_nvfp4_gptq_seq GPU 0: Peak memory usage = 20.42 GB ``` ## TODO - [ ] Update CHANGELOG ## Test plan - `tests/unit/torch/quantization/test_layerwise_calibrate.py` — unit tests for skip/swap/restore - `tests/unit/torch/quantization/test_sequential_checkpoint.py` — checkpoint save/resume correctness - `tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py` — CPU-offloaded layerwise + GPTQ + checkpoint resume - `tests/gpu/torch/quantization/test_fsdp2.py` — FSDP2 layerwise calibration ### Verified - [x] Qwen3-8B: layerwise calibration + checkpoint save/restore + fakequantized checkpoint export + vLLM serve - [x] DeepSeek-R1: checkpoint resume tested - [x] DeepSeek-R1: fakequantized checkpoint export verified --------- Signed-off-by: realAsma <akuriparambi@nvidia.com>

## Summary - **hf_online_dflash.yaml**: Add 100K-sample training config with regression baselines (B200 loss curve), `MAX_FINAL_LOSS`/`MIN_FINAL_ACC`/`MIN_ACCEPTANCE_LENGTH` thresholds, vLLM nightly container for DFlash support - **vllm_smoke_test.sh**: Parse acceptance length from vLLM server log for regression check; `pip install pandas` workaround for broken nightly container; capture server output to temp file - **query.sh**: Detect vLLM server death during startup (PID liveness check) + 600s timeout to prevent infinite polling that wastes GPU hours; `pip install pandas` workaround - Fix empty `environment:` key in DFlash YAML causing nemo_run `ListParseError` ## Test plan - [x] E2E pipeline passed on 8x B200 (training + vLLM smoke test + AR eval) - [x] Training regression: final loss 3.82 < 5.0, acc 0.20 > 0.15 - [x] vLLM acceptance length: 1.79 >= 1.4 threshold - [x] AR evaluation: 2.02 overall on MT-Bench (8 categories) - [x] Server liveness check prevents GPU waste on vLLM crash 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added optional regression validation for vLLM acceptance metrics * Introduced configurable vLLM server startup timeout (default 600 seconds) * **Improvements** * Enhanced logging for vLLM server startup with progress tracking and waited time reporting * Faster detection of vLLM server process failures during initialization * **Configuration Updates** * Increased training dataset size and logging granularity * Scaled tensor parallelism from 4 to 8 across multiple pipelines * Expanded PTQ quantization to multi-step pipeline * Added configurable training metric thresholds  --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

### What does this PR do? Type of change: new feature, new example   ## Summary - Add skip-softmax sparse attention (BLASST) for diffusion models via dedicated Triton kernels — an inference kernel with tile skipping and a calibration kernel with vectorized multi-threshold sparsity measurement - Add `triton_skip_softmax` method with exponential model calibration (`scale_factor = a * exp(b * sparsity)`) and log-space fitting for diffusion models - Add Triton kernel backends for diffusers and LTX attention dispatch - Fix calibration to skip RULER dataset generation when user provides their own `forward_loop` (required for non-LLM models) ## Changes ### Triton kernels (`modelopt/torch/kernels/triton_fa.py`) - **`_attn_fwd`**: Forward kernel with optional tile skipping — tiles whose max attention score is far below the running softmax max are skipped entirely (no V load, no softmax, no accumulation). Runtime sparsity measurement via atomic counters. - **`_attn_fwd_calibrate`**: Calibration kernel that computes full attention while measuring how many tiles would be skipped at each of N thresholds simultaneously. Uses per-program output buffers (zero atomic contention) and vectorized multi-threshold comparison. - **`attention()`** / **`attention_calibrate()`**: Python wrappers for inference and calibration kernels. ### Kernel backends (`modelopt/torch/sparsity/attention_sparsity/kernels/`) - **`diffusers_triton_attention.py`**: Registers `modelopt_triton` backend in diffusers' attention dispatch. Handles [B, S, H, D] → varlen layout conversion, calibration/inference mode switching, thread-local configuration, and counter accumulation. - **`ltx_triton_attention.py`**: Patches `ltx_core.Attention` modules for Triton dispatch with the same calibration/inference modes. ### Method (`modelopt/torch/sparsity/attention_sparsity/methods/triton_skip_softmax.py`) - `TritonSkipSoftmaxMethod`: Context managers for calibration (→ calibration kernel) and inference (→ forward kernel with tile skipping). Three threshold priority levels: raw threshold > calibrated scale_factor > static threshold. ### Calibration (`modelopt/torch/sparsity/attention_sparsity/calibration/`) - **`calibrator.py`**: `DynamicThresholdCalibrator` with `fit_logspace` option — fits exponential model in log space (minimizes relative error) for diffusion models where scale_factors span many orders of magnitude. Records observed sparsity range for extrapolation warnings. - **`calibrate.py`**: Skips RULER dataset when `forward_loop` is provided; passes `fit_logspace` through from config. ### Config & conversion - **`config.py`**: `CalibrationConfig.fit_logspace` field (default False, recommended True for diffusion models). `skip_softmax_raw_threshold` field for direct threshold mode. - **`conversion.py`**: Auto-registers diffusers/LTX Triton backends on `sparsify()`. Updated summary display. ### Example - **`wan22_skip_softmax.py`**: End-to-end example for WAN 2.2 5B/14B with baseline, raw-threshold, and calibrated modes. Supports runtime sparsity reporting. ## Threshold modes | Mode | How it works | Use case | |------|-------------|----------| | **Raw threshold** (`--raw-threshold -0.7`) | Passed directly to kernel as `skip_threshold_log2` | Quick testing, sweeps | | **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor = a * exp(b * target)`, then `threshold = scale_factor / seq_k` at runtime | Production use with seqlen adaptation | | **Static** (default `skip_softmax_threshold=0.1`) | `log2(lambda) * sm_scale` | Fallback | ## Usage ```bash # Fixed raw threshold (no calibration) python examples/diffusers/sparsity/wan22_skip_softmax.py \ --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \ --raw-threshold -0.7 \ --prompt "A cat playing piano" --output out.mp4 # With calibration (log-space fit for diffusion models) python examples/diffusers/sparsity/wan22_skip_softmax.py \ --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \ --calibrate --target-sparsity 0.5 \ --prompt "A cat playing piano" --output out.mp4 # Dense baseline for comparison python examples/diffusers/sparsity/wan22_skip_softmax.py \ --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \ --baseline \ --prompt "A cat playing piano" --output baseline.mp4 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ❌  ### Additional Information   ## Summary by CodeRabbit ## Release Notes * **New Features** * Added skip-softmax sparse attention support for Diffusers models, enabling efficient video generation * Added support for both eager and Triton attention backends for sparse attention * Added new example script for Wan 2.2 text-to-video generation with sparse attention optimization * **Documentation** * Updated documentation with sparse attention configuration guide and usage examples * **Tests** * Added comprehensive unit tests for kernel backend registration and skip-softmax functionality  --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

### What does this PR do? Type of change: Bugfix  Add newly added quant configs to the example PTQ script. ### Testing I have locally run auto_quantize with these two quant_configs, and obtained successfully exported HF artifacts. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information  ## Summary by CodeRabbit * **New Features** * Added support for three new quantization formats: nvfp4_mse, nvfp4_local_hessian, and nvfp4_experts_only, expanding available export options when using auto-quantize. * **Bug Fixes / UX** * Updated the invalid-quantization error message to include the newly accepted format identifiers.  Signed-off-by: Bilal Kartal <bkartal@nvidia.com> Signed-off-by: bkartal-dev <bkartal@nvidia.com>

## Summary - Add end-to-end ResNet50 support in the torch_onnx quantization → ONNX export → TRT engine pipeline - Fix multiple Conv2d-related export issues that blocked Conv2d-heavy models from working with FP8/INT8/MXFP8/NVFP4/auto quantization modes - Fix `configure_linear_module_onnx_quantizers` to handle all modules with block quantization (not just `nn.Linear`), fixing NVFP4/MXFP8 export for models with quantized non-Linear modules - Add `--trt_build` flag to `torch_quant_to_onnx.py` and simplify test infrastructure ### Files Changed - `modelopt/torch/_deploy/utils/torch_onnx.py` — Disable FP8 Conv2d weight quantizers and autocast during ONNX export - `modelopt/torch/quantization/export_onnx.py` — Fix `configure_linear_module_onnx_quantizers` for all module types with block quantization - `examples/torch_onnx/torch_quant_to_onnx.py` — Add `--trt_build` flag, calibration for FP8 override quantizers, Conv2d→FP8 override for auto mode, filter_func updates - `examples/torch_onnx/README.md` — Add ResNet50 to supported models table - `tests/examples/torch_onnx/test_torch_quant_to_onnx.py` — Add ResNet50 test entry, simplify using `--trt_build` - `tests/_test_utils/torch/vision_models.py` — Add ResNet50 to timm model registry ### Quantization modes passing - ✅ FP8, INT8, MXFP8, NVFP4, Auto (all 5 modes pass export + TRT build) - INT4_AWQ excluded (pre-existing limitation for all models) ## Test plan - [x] All 5 resnet50 test modes pass: `pytest tests/examples/torch_onnx/test_torch_quant_to_onnx.py -k resnet50` (5/5 passed) - [x] Full regression: 18 passed, 2 failed (pre-existing swinv2_tiny fp8/int8 failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added ResNet50 to supported ONNX export vision models with FP8, INT8, MXFP8, and NVFP4 support. * Optional TensorRT engine build after export via a new CLI flag. * **Improvements** * Enhanced quantization calibration and export flows for FP8/INT8 models, including broader block-quantization support across module types and safer export handling. * Tests updated to include ResNet50 in the model matrix.  --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: ajrasane <arasane@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…er/experimental/DMS) (#879) ## What does this PR do? **Type of change:** ?  **Overview:** ? ## Usage  ```python # Add a code snippet demonstrating how to use this ``` ## Testing  ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No  - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No  ## Additional Information   ## Summary by CodeRabbit * **Documentation** * Updated DMS installation instructions to reflect the repository structure and correct directory navigation during setup. * Clarified the setup steps so users follow the accurate directory change before running installation commands. * Small wording improvements to reduce confusion during the installation process.  Signed-off-by: Farid Adilazuarda <42537562+faridlazuarda@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Replace mip package with more popular pulp package for puzzle mip solving. Both use the CBC solver under the hood ## Testing - Results very close for Qwen3-8B and Nemotron-Nano-12B-v2  ## Summary by CodeRabbit * **Chores** * Simplified GPU test environment setup by removing unnecessary system dependency installation * Updated internal optimization solver dependencies in the puzzletron module  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…plify CI workflows (#1286) ### What does this PR do? Type of change: New feature / infrastructure improvement Follow-up to #1285 for correct CI test environment for megatron based tests Replaces `tox` + `tox-current-env` with `nox` for all test, lint, docs, and wheel build sessions. The primary motivation was that `tox-current-env` is incompatible with uv venvs in NGC containers (e.g. NeMo's `/opt/venv`) — it picks the system Python via `sys._base_executable` instead of the container's venv Python which has megatron packages pre-installed. Key changes: - **`noxfile.py`** replaces `tox.ini` with GPU, CPU unit, partial-install, pre-commit, docs, and wheel sessions - **GPU sessions** use `venv_backend="none"` (run directly in container env) and `python -m pip/pytest` to avoid PATH mismatches - **uv** is set as the default venv backend (if available) for CPU sessions (faster installs) Also includes CI workflow simplifications: - **`_pr_gate.yml`** new reusable workflow centralizing file-change detection + linux-check wait logic (was duplicated across 3 workflow files) - **Collapsed pr/non-pr job pairs** into single jobs with conditional `runs-on` in `gpu_tests.yml`, `example_tests.yml`, `regression_tests.yml` - **Collapsed `multi-py` / `multi-torch` / `multi-transformers`** into a single `multi-version` matrix job in `unit_tests.yml` - **PR path filtering** for unit test secondary jobs (multi-version, launcher, partial-install) — skipped if no relevant files changed - **Fixed schedule/workflow_dispatch skipping** — jobs with `needs: [pr-gate]` were incorrectly skipped when all pr-gate internal jobs were skipped; fixed by making the gate job always run - **multi-version, launcher, partial-install** now also run on `schedule` / `workflow_dispatch` ### Usage ```bash python -m pip install nox uv # install nox and uv (once) nox -l # list all sessions nox -s gpu_megatron # run a GPU session (inside container) nox -s "unit-3.12(torch_211, tf_latest)" # run a specific unit test combination nox -s "unit-3.12(torch_211, tf_latest)" -R # force-recreate venv (e.g. after dep changes) COVERAGE_PROCESS_START=pyproject.toml nox -s "unit-3.12(torch_211, tf_latest)" # with coverage ``` ### Testing - Ran `nox -l` to verify all session names - Ran `gpu_megatron` session locally inside NeMo container — confirmed it uses `/opt/venv/bin/python` correctly - Manually triggered nightly-runs: - Unit: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608013657 - GPU: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608018763 - Examples: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608017322 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A — CI infrastructure only - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ (added `nox` and `uv` to `dev-test`, both Apache-2.0) - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — no user-facing changes ### Additional Information Supersedes the tox-current-env workaround in the parent branch. --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

## Summary Automated weekly update of uv.lock file for nSpect Scanning: - `uv.lock` — upgraded all transitive dependencies to latest compatible versions Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

### What does this PR do? Type of change: ?  Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills. Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time. Solution: A centralized monitor skill with: - Job registry (.claude/active_jobs.json): single source of truth for all active jobs - Durable recurring cron: polls every 15 min, survives session restarts, self-cleans when all jobs complete - User-initiated mode: works in new conversations by reading the registry - Aggregated reporting: "2 of 4 completed" instead of per-job noise ### Usage After any skill submits a job, the monitor skill automatically: 1. Registers the job in .claude/active_jobs.json 2. Sets up a durable cron to poll status every 15 minutes User can also trigger manually: User: "check my eval status" → reads registry, reports current state User: "is the PTQ done?" → finds job, checks status User: "what jobs are running?" → lists all registered jobs ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added monitor skill for tracking SLURM jobs, NEL evaluations, and launcher experiments with persistent job registry. * **Documentation** * Updated deployment, evaluation, and PTQ documentation to use the new monitor skill. * Simplified diagnostic and troubleshooting instructions.  --------- Signed-off-by: Kai Xu <kaix@nvidia.com>

### What does this PR do? Type of change: new feature  - Add Conv3D implicit GEMM kernel with BF16 WMMA tensor cores and fused NVFP4 activation quantization for video diffusion VAE layers - Integrate into _QuantConv3d via QuantModuleRegistry — automatically dispatched when NVFP4 quantization is applied to nn.Conv3d - Move kernel from `experimental/conv/ to modelopt/torch/kernels/conv/`; move tests to `tests/gpu/torch/quantization/kernels/` ### Testing  - Added test cases to measure the difference between cuDNN and our CUDA implicit GEMM kernel - Added an NVFP4 fake quantization test using CUDA code ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Per-backbone quantization/export in a single run with per-backbone checkpoints and backbone-aware quant filters * Configurable NVFP4 block-size via CLI/config; improved NVFP4 Conv3D inference path and Wan 2.2 quantization support * **Bug Fixes** * Video-model calibration now respects extra params and forces video decoding during calibration * **Documentation** * Added comprehensive Conv3D implicit‑GEMM kernel documentation; removed experimental Conv3D prototype docs/benchmark * **Tests** * New Wan 2.2 quantization/export tests and expanded Conv3D/FP4 kernel test coverage  --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>

### What does this PR do? Type of change: Bug Enables end-to-end AWQ checkpoint export and reload in the vLLM fake-quant serving path (`MODELOPT_STATE_PATH`). Previously, the `input_quantizer` was using incorrect `pre_quant_scale` especially with grouped quantizers like `qkv_proj`, using simply the first `input_quantizer.pre_quant_scale`. This MR adds `_resmooth_experts_for_export` that non-mutatively averages `pre_quant_scale` across MoE experts and unifies input `_amax`, required because vLLM uses a single input quantizer per expert group. Adds `merge_amax_tensors_for_group` (element-wise max for same-shape, `cat` for GQA, scalar-max fallback) replacing the scalar-collapsing `torch.stack().max()` that dropped per-channel `_amax` structure. ### Usage ```python # Export AWQ checkpoint from HF model from modelopt.torch.export.plugins.vllm_fakequant_hf import export_hf_vllm_fq_checkpoint export_hf_vllm_fq_checkpoint(model, export_dir="./awq_vllm_checkpoint") ``` ### Testing **Step 1 — Export the quantized checkpoint:** ```bash python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path <MODEL_PATH> \ --recipe <AWQ_RECIPE> \ --calib_size 512 \ --export_path <EXPORT_DIR> \ --vllm_fakequant_export ``` This produces `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` with the averaged per-expert pre_quant_scale and unified _amax now included. Step 2 — Serve via vLLM fakequant worker: ```bash MODELOPT_STATE_PATH=<EXPORT_DIR>/vllm_fq_modelopt_state.pth \ python examples/vllm_serve/vllm_serve_fakequant.py \ <EXPORT_DIR> --tensor-parallel-size <TP> ``` Tested for quantization configurations: ``` FP8_DEFAULT_CFG FP8_DEFAULT_CFG (input_q disabled) INT8_SMOOTHQUANT_CFG INT8_WEIGHT_ONLY_CFG NVFP4_DEFAULT_CFG NVFP4_AWQ_LITE_CFG INT4_AWQ_CFG NVFP4_AWQ_CFG NVFP4_DEFAULT_CFG (input_q disabled) ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added Nemotron-style MoE export support and group-aware AWQ resmoothing with optional requantization during export. * Improved handling for shared-input / expert groups and tensor-parallel sharding of pre-quantization scales. * **Bug Fixes** * Removed AWQ reload limitation from known issues; improved checkpoint validation and safer save/load behavior. * Better detection and handling of enabled weight-quantizers and clearer warnings for mismatched checkpoint keys.  --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>

### What does this PR do? Type of change: ?   ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Added a public backend-specific calibrator registration API to support FP8 scale-sweep calibration, allowing backends to supply custom calibrators used during FP8 tuning. * **Tests** * Added unit tests confirming registry insertion/overwrite, that registered calibrators are invoked when FP8 scale-sweep is enabled, are not invoked when disabled, and that calibration falls back to defaults when no backend is registered.  --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## Summary - `datasets`' `resolve_pattern` only matches entries with `type=="file"`, so passing a bare directory path as `data_files` to `load_dataset` results in `FileNotFoundError` even when the directory exists on disk - Detect directory paths in `ShardedDataset._load_dataset()` and pass them via `data_dir` instead of `data_files` ## Reproduction ```python from datasets import load_dataset # This fails with FileNotFoundError: load_dataset("json", data_files="/path/to/data_directory") # This works: load_dataset("json", data_dir="/path/to/data_directory") ``` ## Test plan - [ ] Verify existing EAGLE3/DFlash training pipelines that pass directory paths work - [ ] Verify file path and glob patterns still work (falls through to `data_files`) - [ ] Verify `data_files=None` (no data_files arg) still works 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit ## Bug Fixes * Fixed an issue with dataset loading that prevented proper handling of directory-based data sources. Directories are now correctly detected and processed during dataset initialization.  Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

#1293) ## Summary - **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters (TP=1, all tasks) - **quantize.sh**: Auto-find largest PP dividing model's `num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't divisible by 8, causing `AssertionError` on 8-GPU nodes - **compute_hidden_states_trtllm.py**: Use `messages` with `conversations` fallback, matching the HF version. Fixes `KeyError: 'conversations'` when data uses OpenAI `messages` format ## Test plan - [x] Qwen3-8B PTQ runs on single L40 GPU - [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs, PP=4 on 4 GPUs, PP=1 on 1 GPU) - [x] EAGLE3 offline pipeline reads data with `messages` field 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Dataset input handling now supports multiple field formats for enhanced compatibility. * **Bug Fixes** * Optimized GPU resource allocation during model quantization with improved pipeline parallelism computation. * Updated quantization configuration for more efficient resource utilization.  Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary - When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs across 2 nodes), `ParallelismConfig` raises `total_size (4) does not match num_processes (8)` because `dp_replicate_size` defaults to 1 - Auto-compute `dp_replicate_size = world_size // (dp_shard_size * cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config - This enables `dp_shard_size` to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes ## Test plan - [ ] Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged - [ ] Verify multi-node with dp_shard_size < world_size creates correct replica groups - [ ] Verify existing EAGLE3/DFlash configs still work 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Refactor** * Enhanced parallelism configuration initialization in the speculative decoding example to better handle distributed training scenarios.  Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

### What does this PR do? Add gptq fused kernel to improve speed. ### Usage check unittest ### Testing added a unittest ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **New Features** * Fused GPTQ backend for faster blockwise weight updates, toggleable via a new "fused" option. * Shared NVFP4 quantization primitives exposed for reuse. * **Refactor** * Consolidated FP4 scale/quantization logic into reusable utilities and centralized Hessian inversion handling. * **Tests** * Expanded GPU tests comparing fused vs unfused GPTQ, added Triton-availability gating and a local benchmark entrypoint.  --------- Signed-off-by: Shiyang Chen <shiychen@nvidia.com>

### What does this PR do? Type of change: Bug fix Fixes gh-pages branch bloat that grew from ~26 MB to ~441 MB in four weeks (nvbug 6099503). Three compounding causes were identified and addressed: 1. **Sphinx `.doctrees/` cache published to gh-pages** — `sphinx-build` was writing its build cache inside `build/html/` which was then uploaded verbatim. Accounts for ~3.3 GB uncompressed across history. 2. **`JamesIves/github-pages-deploy-action` appending a commit on every push** — main-site files accumulated forever with `single-commit: false` (default). 3. **PR preview deploying on every `synchronize` event for all PRs** — `rossjrw/pr-preview-action` re-deployed the full site for every push to any PR regardless of whether docs changed (e.g. PR #1128 triggered 64 preview deploys × ~11 MB each). Changes: - Pass `-d /tmp/doctrees` to `sphinx-build` so `.doctrees/` is never written into `build/html/` - Add `paths: [docs/**, modelopt/**]` filter to `pull_request` trigger so the docs workflow only runs on PRs that touch docs or source code - Set `single-commit: true` on the deploy action so main-site pushes squash into one commit - Deduplicate docs build: `deploy-preview` now downloads the artifact from `build-docs` instead of running a second `sphinx-build` - Set `retention-days: 1` on the artifact since it is only needed for the duration of the workflow run The one-time cleanup (force-push squashed orphan to gh-pages) was already applied separately — repo is now ~59 MB for a full clone vs ~441 MB before. ### Usage N/A — CI/workflow change only. ### Testing - Workflow logic reviewed manually. - The one-time cleanup was verified: `git rev-list --objects --disk-usage origin/gh-pages` now reports ~28 MB; full clone is ~59 MB. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information nvbug 6099503  ## Summary by CodeRabbit * **Chores** * Optimized documentation build and deployment workflow in CI/CD pipeline. * Improved pull request documentation preview handling with faster build timeouts and refined artifact management. * Enhanced GitHub Pages deployment configuration for better consistency.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

- Use latest containers for testing in CICD  ## Summary by CodeRabbit * **Chores** * Bumped TensorRT-LLM Docker images to 1.3.0rc12 in example and GPU test workflows. * Updated PyTorch container image from 26.01 to 26.03 for GPU tests. * Captured uv lock upgrade output to a temp file, inlined it into PR bodies, and adjusted workflow heredoc/templating and step behavior. * **Documentation** * Clarified an inline comment and simplified a warning message for an ONNX quantization extension.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

copy-pr-bot · 2026-04-22T08:08:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-22T08:09:20Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request introduces major enhancements to quantization and sparsity workflows, migrates CI/CD from tox to nox, and adds comprehensive example documentation. Key additions include layerwise calibration support (replacing sequential calibration), skip-softmax sparse attention for video models, fused GPTQ Triton kernels, and NVFP4 Conv3D implicit GEMM optimizations. The change involves 100+ files across quantization, sparsity, testing, CI, and example code.

Changes

Cohort / File(s)	Summary
Layerwise Calibration System `modelopt/torch/quantization/utils/layerwise_calib.py`, `modelopt/torch/quantization/utils/activation_collector.py`, `modelopt/torch/quantization/model_calib.py`, `modelopt/torch/quantization/config.py`, `modelopt/torch/quantization/mode.py`	Replaced sequential calibration with stateful layerwise calibration. Introduced `LayerActivationCollector`, per-layer checkpointing via `_CheckpointState` with manifest-based resume, and `layerwise_calibrate` function. Added `layerwise_checkpoint_dir` config and capability gates for algorithms supporting layerwise mode.
Fused GPTQ Kernels `modelopt/torch/quantization/triton/gptq_fused_kernel.py`, `modelopt/torch/quantization/triton/nvfp4_quant.py`, `modelopt/torch/quantization/utils/calib_utils.py`	Implemented Triton-based fused GPTQ for scalar blockwise weight updates. Added composable NVFP4 primitives (`fp4_round_magnitude`, `nvfp4_scalar_quant`, `fp8_quantize_scale`). Refactored `GPTQHelper` to support fused mode with backend registry.
Skip-Softmax Sparse Attention `modelopt/torch/kernels/triton_fa.py`, `modelopt/torch/sparsity/attention_sparsity/kernels/diffusers_triton_attention.py`, `modelopt/torch/sparsity/attention_sparsity/kernels/ltx_triton_attention.py`, `modelopt/torch/sparsity/attention_sparsity/methods/triton_skip_softmax.py`	Added skip-softmax KV tile skipping in Triton flash-attention with optional runtime sparsity measurement and calibration mode. Implemented Diffusers and LTX-2 backend wrappers with thread-local context management. Added calibration kernel `attention_calibrate` for multi-threshold sparsity statistics collection.
NVFP4 Conv3D Kernel `modelopt/torch/quantization/nn/modules/quant_conv.py`, `modelopt/torch/quantization/src/conv/implicit_gemm_kernel.cu`, `modelopt/torch/quantization/src/conv/implicit_gemm_kernel.py`, `modelopt/torch/quantization/src/conv/bench_implicit_gemm.py`	Added inference-only NVFP4 implicit GEMM CUDA kernel for Conv3D. Routes quantized Conv3D to kernel when NVFP4 quantizers are enabled (groups==1), falls back to cuDNN for grouped convolutions and training. Included kernel SM80+ conditional compilation guard.
Sparsity Calibration & Config `modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py`, `modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py`, `modelopt/torch/sparsity/attention_sparsity/config.py`	Added log-space exponential fitting for threshold calibration (`fit_logspace` option). Introduced `skip_softmax_raw_threshold` to directly pass kernel thresholds. Updated `DynamicThresholdCalibrator` to support dual fitting modes and new target sparsity range (`[0.3-0.8]`). Made tokenizer import lazy.
vLLM Export Enhancements `modelopt/torch/export/plugins/vllm_fakequant_hf.py`, `modelopt/torch/export/plugins/vllm_fakequant_megatron.py`	Expanded `export_hf_vllm_fq_checkpoint` with resmoothing for AWQ experts, weight-quantizer folding, and in-place memory-efficient mode. Added quantizer prefix remapping, weight-quantizer state filtering via regex, and GPTQ sequential layerwise support with new `inplace_mem_efficient` parameter.
Diffusers Quantization `examples/diffusers/quantization/quantize_config.py`, `examples/diffusers/quantization/quantize.py`, `examples/diffusers/quantization/models_utils.py`, `examples/diffusers/quantization/pipeline_manager.py`, `examples/diffusers/quantization/utils.py`, `examples/diffusers/quantization/calibration.py`	Replaced single-backbone with multi-backbone quantization support. Added backbone-specific VAE filter functions. Changed `PipelineManager` to use cached LTX-2 video decoder. Updated `quantize.py` to iterate backbones, apply per-backbone configs, and optionally skip VAE-related checks. Added `block_size` parameter for NVFP4.
Diffusers Sparsity Examples `examples/diffusers/sparsity/wan22_skip_softmax.py`, `examples/diffusers/sparsity/README.md`, `examples/diffusers/README.md`	Added Wan 2.2 skip-softmax sparse attention example script with calibration, runtime measurement, and baseline modes. Documented BLASST-based tile-skipping, dual runtime threshold modes, and known issues. Updated main README with sparse-attention and NVFP4 VAE PTQ subsections.
LLM PTQ Enhancements `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/example_utils.py`, `examples/llm_ptq/scripts/huggingface_example.sh`	Added `nvfp4_local_hessian` quantization format support. Implemented layerwise checkpoint-dir resolution via unique model-hash suffix. Added automatic checkpoint-directory update when layerwise calibration is detected. Included helper functions `needs_checkpoint_path_update` and `resolve_checkpoint_dir`.
Megatron Utilities `modelopt/torch/utils/plugins/megatron_mmlu.py`, `modelopt/torch/utils/plugins/megatron_generate.py`, `modelopt/torch/utils/plugins/transformers_dataset.py`	Refactored MMLU evaluation to logit-based scoring with batching and dynamic fraction sampling. Updated `megatron_prefill`/`megatron_generate` to use direct `model(...)` calls instead of `get_forward_backward_func`, with explicit PP rank communication. Fixed dataset directory loading in `ShardedDataset`.
Torch ONNX Enhancements `modelopt/torch/_deploy/utils/torch_onnx.py`, `modelopt/onnx/export/fp8_exporter.py`, `modelopt/onnx/utils.py`	Added FP8-specific Conv weight quantizer disabling during export. Implemented FP8 weight `DequantizeLinear` insertion in ONNX graphs. Added cast-folding utilities (`fold_dq_fp32_to_fp16_casts`, `fold_qdq_scale_fp16_to_fp32_casts`) for optimized scale handling.
Torch Quantization Plugins `modelopt/torch/quantization/plugins/accelerate.py`, `modelopt/torch/quantization/plugins/huggingface.py`, `modelopt/torch/quantization/plugins/diffusion/diffusers.py`	Generalized accelerate offload-hook handling for multi-hook chains. Added `_QuantDiffusersWanCausalConv3d` quantized module wrapper with implicit GEMM routing. Refactored FSDP2 weight access to iterate all `DTensor` parameters with redistribution. Added `persistent_materialization` context manager.
Torch Export Utilities `modelopt/torch/export/layer_utils.py`, `modelopt/torch/export/plugins/megatron_importer.py`, `modelopt/torch/export/unified_export_hf.py`, `modelopt/torch/export/unified_export_megatron.py`	Extended MoE detection for Nemotron-HF models. Renamed `_collect_shared_input_modules` to public API. Added embeddings re-synchronization for shared-weights Megatron models with pipeline parallelism. Updated rank gating to require both PP and TP rank checks (0 and 0, respectively).
Core Utilities `modelopt/torch/quantization/utils/core_utils.py`, `modelopt/torch/utils/network.py`, `modelopt/torch/utils/dataset_utils.py`, `modelopt/torch/utils/logging.py`, `modelopt/torch/utils/serialization.py`	Enhanced parameter setting via dotted names. Improved accelerate hook detection for execution-device queries. Added KV-cache disabling in calibration loops. Made `print_rank_0` flush configurable. Updated `safe_load` to document `weights_only` behavior and env-var override mechanism.
Build & CI Tooling `noxfile.py`, `.github/workflows/...`, `pyproject.toml`, `.github/CODEOWNERS`, `.github/codecov.yml`, `.vscode/settings.json`	Added `noxfile.py` with sessions for unit tests, GPU tests, code quality, docs, and wheel builds. Migrated 11 GitHub workflows from tox to nox (example_tests, gpu_tests, unit_tests, release, code_quality, pages, etc.). Created reusable `_pr_gate.yml` workflow for file-change gating. Updated CODEOWNERS and codecov thresholds. Removed tox references.
Quantization Recipes `modelopt_recipes/general/ptq/...yaml`	Updated recipe descriptions and configs: `nvfp4_default-none_kv_gptq.yaml` now uses `layerwise: true` with checkpoint dir. Changed `nvfp4_experts_only-fp8_kv.yaml` to use structured `{ method: max, layerwise: true }` config.
Launcher Tools & Scripts `tools/launcher/common/hf/ptq.sh`, `tools/launcher/common/megatron_lm/quantize/...`, `tools/launcher/common/vllm/...`, `tools/launcher/common/tensorrt_llm/...`, `tools/launcher/examples/Qwen/...`	Added `EXTRA_PIP_DEPS` support for model-specific dependencies in PTQ. Extended quantize/mmlu/export pipeline scripts with distributed parallelism parameters. Introduced TensorRT-LLM eval script and config. Updated vLLM startup with pandas install, timeout enforcement, and regression health checks. Scaled up Qwen examples to 8-way parallelism.
Example & Documentation Updates `examples/torch_onnx/torch_quant_to_onnx.py`, `examples/torch_onnx/README.md`, `examples/vllm_serve/...`, `examples/windows/onnx_ptq/...`, `examples/speculative_decoding/...`, `examples/megatron_bridge/...`, `experimental/conv/README.md`, `examples/pruning/README.md`, `CHANGELOG.rst`, `CLAUDE.md`, `CONTRIBUTING.md`	Added TensorRT build capability to ONNX exporter. Removed AWQ reload limitation note. Updated Conv3D/Wan example documentation. Removed experimental Conv3D README (moved to modelopt core). Updated CHANGELOG with layerwise calibration and Conv3D kernel features. Updated developer docs to reference `noxfile.py`. Updated example scripts with calibration/inference parameter changes.
Comprehensive Test Coverage `tests/...` (100+ new/updated test files)	Added layerwise calibration tests, sparsity calibration tests, fused GPTQ tests, skip-softmax attention tests (Diffusers and LTX-2), Wan 2.2 quantization/export tests, Conv3D implicit GEMM tests, vLLM export tests with offloading and checkpoint resumption, FSDP2 tests, and unit tests for new utilities. Updated test fixtures and conftest files. Removed sequential calibration test references.
Test Utilities & Fixtures `tests/_test_utils/torch/diffusers_models.py`, `tests/_test_utils/torch/vision_models.py`, `tests/examples/diffusers/conftest.py`	Added `get_tiny_wan22_transformer`, `get_tiny_wan22_vae`, and `create_tiny_wan22_pipeline_dir` fixtures for Wan 2.2 testing. Added `resnet50` to vision model benchmarks. Extended conftest with session-scoped Wan 2.2 pipeline fixture.
Documentation & Skills `.claude/skills/common/slurm-setup.md`, `.claude/skills/deployment/SKILL.md`, `.claude/skills/evaluation/SKILL.md`, `.claude/skills/monitor/SKILL.md`, `.claude/skills/ptq/SKILL.md`, `.claude/skills/ptq/references/launcher-guide.md`	Added comprehensive SLURM container-registry authentication checklist. Updated deployment/evaluation skills with Step 0 auth check and monitoring integration. Introduced new `monitor` skill for job registry and status tracking across PTQ/NEL/deployment. Updated PTQ guidance with model-specific dependency detection (Step 2.5) and monitor skill usage. Updated launcher script path references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Add the Skip softmax for diffusion #1166: Implements the same skip-softmax sparse attention feature, including Wan 2.2 examples, Triton/eager backends, calibration integration, and kernel modifications.
Add layerwise calibration for large models #1251: Implements layerwise calibration (formerly "sequential")—renaming APIs, adding checkpoint persistence, moving LayerActivationCollector to layerwise_calib module, and updating all dependent tests.

Suggested reviewers

sugunav14
Fridah-nv
kaix-nv
Edwardf0t1

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch gkarch/update_1314

github-actions · 2026-04-22T08:12:17Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-22 08:22 UTC

kinjalpatel27 and others added 30 commits April 16, 2026 09:42

grzegorz-k-karch requested review from a team as code owners April 22, 2026 08:08

grzegorz-k-karch requested review from kevalmorabia97 and removed request for a team April 22, 2026 08:08

grzegorz-k-karch requested review from ajrasane, gcunhase, meenchen and realAsma and removed request for a team April 22, 2026 08:08

grzegorz-k-karch closed this Apr 22, 2026

grzegorz-k-karch deleted the gkarch/update_1314 branch April 22, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gkarch/update 1314#1316

Gkarch/update 1314#1316
grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
gkarch/update_1314

grzegorz-k-karch commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

grzegorz-k-karch commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 22, 2026

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

grzegorz-k-karch commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading