Gkarch/update 1314#1316
Closed
grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
Closed
Gkarch/update 1314#1316grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
Conversation
### What does this PR do? Type of change: Bug fix During Megatron→vLLM fakequant export (`export_mcore_gpt_to_hf_vllm_fq`), the `weight_quantizer` is now applied as fake-quantization (quantize + dequantize) directly into the exported weight tensor, and its amax is no longer saved to `quantizer_state.pth`. On reload, if `weight_quantizer` keys are absent from the checkpoint (because they were folded at export time), the corresponding quantizer modules are disabled. This change is useful especially when amax across experts are not synced for `weight_quantizer`, this allows the `weight_quantizer` to keep them different for better accuracy. ### Usage ```python # Unchanged — export API is the same export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...) ``` ### Testing Step 1 — Quantize (run from Megatron-LM `examples/post_training/modelopt`): ```bash HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \ bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG ``` Step 2 — Export for vLLM fakequant: ```bash MLM_EXTRA_ARGS=--export-vllm-fq \ HF_MODEL_CKPT=<path/to/hf/weights> \ MLM_MODEL_CKPT=<quant-ckpt-name> \ EXPORT_DIR=<export-dir> \ bash export.sh <hf-model-id> ``` Step 3 — Serve (run from examples/vllm_serve): ```bash QUANT_CFG=NVFP4_DEFAULT_CFG \ QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ python3 vllm_serve_fakequant.py <export-dir> \ -tp 1 --served-model-name <model-name> \ --host 0.0.0.0 --port 8000 \ --trust-remote-code --enforce-eager \ --disable-custom-all-reduce \ --gpu-memory-utilization 0.8 ``` ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures. * **Improvements** * Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
## Summary Automated weekly update of uv.lock file for nSpect Scanning: - `uv.lock` — upgraded all transitive dependencies to latest compatible versions Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
) ### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> PTQ: model-specific dependency support - Add EXTRA_PIP_DEPS support to the launcher's `ptq.sh` so models requiring extra pip packages (e.g., `mamba-ssm` for hybrid Mamba architectures like Nemotron) can install them automatically before running PTQ. Also updates the PTQ skill with a new Step 2.5 for detecting model-specific dependencies. Container registry auth checks - Add new section 6 covering auth detection for enroot/pyxis, Docker, and Singularity/Apptainer. Includes credential locations, how to add them, and common failure modes. - Add Step 7.5 with NEL default image table, DockerHub-first strategy with NGC fallback, and build-config CLI note. - Add auth check before remote SLURM deployment. ### Usage Set EXTRA_PIP_DEPS in the launcher YAML's environment section: ``` task_0: script: common/hf/ptq.sh args: - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 - -- - --quant nvfp4 - --tasks quant environment: - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d" ``` ### Testing <!-- Mention how have you tested your change if applicable. --> Tested end-to-end: NVFP4 quantization of `NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` on a B200 cluster via the launcher. Job succeeded: mamba-ssm installed automatically, calibration completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor shards). ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Documentation** * Added container registry authentication verification workflow for SLURM deployments, including credential checks, verification commands, common failure symptoms, and remediation guidance. * Required credential validation before SLURM job submission and added SLURM-only verification steps with image fallback recommendations. * New dependency-checking step for models that use remote/trust_remote_code, plus guidance for resolving extra package requirements and tightened build-config guidance. * Updated PTQ launcher documentation to reference the new wrapper script. * **New Features** * Support for specifying extra pip dependencies during model processing via an environment variable. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kai Xu <kaix@nvidia.com>
) ### What does this PR do? Type of change: Bug fix Fixes TRT-LLM DeepEP kernel failures during LLM deployment on unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert parallelism (`ep`) to 1 instead of auto-setting it to the GPU count for MoE models. Previously, when the model config contained expert-related keys, `ep` was automatically set to `torch.cuda.device_count()`, which triggered DeepEP kernel failures on GPUs that don't support it. Now `ep` defaults to 1 while still enabling attention data parallelism for MoE models. Expert parallelism can be enabled explicitly by the caller when the environment is known to support it. ### Testing - [x] Verified that the `llm_ptq` test passes with this fix on Blackwell GPUs. - [x] 2-gpu CI test triggered: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24495054531/job/71588037727 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…equirements (#1275) ### What does this PR do? Type of change: Bug fix Removed version fixes for torch and transformers ### Testing Tested quantization with a couple of models . Working as expected. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Relaxed dependency specs: removed strict pin for torch to allow latest compatible installs, and constrained transformers to <5.0.0 for broader compatibility and easier updates. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com> Signed-off-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com>
Dont allow more than 1% overall project coverage drop per PR. 2% was too much for such a large codebase <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated code coverage enforcement thresholds for pull requests to maintain stricter quality standards. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
As title <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated the release date for version 0.43 in the changelog. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…to allow user to bypass if needed (#1279) ## Summary - Remove the `kwargs.setdefault("weights_only", True)` call from `safe_load`, deferring to torch's built-in default (which is `True` for torch>=2.6) - This allows users to override via the `TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1` env var when they trust a checkpoint but hit `pickle.UnpicklingError` - Add a test that verifies the default fails on unsafe objects and the env var bypass works ## Test plan - [x] `python -m pytest tests/unit/torch/utils/test_serialization.py -v` 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Serialization utility now respects PyTorch's default behavior and environment-variable configuration instead of forcibly enforcing parameter overrides, providing greater configuration flexibility. * **Tests** * Added test coverage validating environment-variable override functionality and default behavior in the serialization utility. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…iliency-ext dependency (#1285) - `megatron-core==0.17.0` released yesterday which requires nightly version of `nvidia-resiliency-ext` for an import. Pre-installed version in DLFW Pytorch container is `nvidia-resiliency-ext==0.5.0` - Temporarily pin `mcore<0.17.0` to unblock PR from merging. - Pin `pulp<4.0` as it has some breaking changes and release imminent Correct fix is to just use `nemo:26.04` container instead of PyTorch container for megatron-based tests since it always has correct combination of all packages needed for the megatron ecosystem - Done in #1286 --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…batching (#1280) ### What does this PR do? Type of change: new feature + bug fix Two improvements to Megatron inference utilities: **1. Pipeline Parallel (PP) correctness fixes** PP inference was producing garbage output (MMLU ~0.24, random chance). Two root causes: - `megatron_generate` / `megatron_prefill` used `get_forward_backward_func()` (the training pipeline scheduler), which is not designed for inference. Rewrote both functions to use explicit P2P communication via `recv_from_prev_pipeline_rank_` / `send_to_next_pipeline_rank`, matching the `run_mcore_inference` pattern. - `import_mcore_gpt_from_hf` loads HF weights into stage 0's embedding but never updates the output_layer on the last PP stage when `share_embeddings_and_output_weights=True`. At model init, `setup_embeddings_and_output_layer()` all-reduces from stage 0 to sync the output layer; after importing HF weights that all-reduce is stale. Fix: call `model.setup_embeddings_and_output_layer()` again after import. **2. `megatron_mmlu` speedup (~6x)** Replaces the `megatron_mmlu` implementation with a significantly faster approach that matches how `lm-evaluation-harness` scores multiple-choice questions. **Before:** autoregressive generation (`megatron_generate`, `osl=2`) per example, 114 separate `load_dataset` calls, batch_size=1 — 260s for 5% data. **After:** single prefill forward pass + argmax over {A,B,C,D} logits, 2 `load_dataset` calls, configurable batch_size — 18s for 5% data (~6x faster). ### Changes **PP fixes:** - `megatron_generate` / `megatron_prefill`: replace `get_forward_backward_func` with explicit P2P (`recv_from_prev_pipeline_rank_` / `send_to_next_pipeline_rank`) - `import_mcore_gpt_from_hf`: call `model.setup_embeddings_and_output_layer()` after HF weight import when PP>1 and `share_embeddings_and_output_weights=True` - `megatron_prefill`: add `skip_return_logits` param and VLM support (needed for PP non-last stages) **MMLU speedup:** - **Log-likelihood scoring**: replace `megatron_generate` with `megatron_prefill` — one forward pass per batch, no autoregressive decode loop - **Global batching**: collect all examples across all subjects, sort by descending sequence length, run in `batch_size` chunks - **2 dataset loads** instead of 114: use `load_dataset("cais/mmlu", "all")` with per-subject grouping; skip dev load when `few_shots=0` - **`percentage` → `fraction`** parameter rename for clarity - **tqdm progress bar** (rank-0 only) ### Testing - `test_megatron_generate_and_mmlu` parametrized over `tp` and `pp`. Accuracy assertion: `0.36 < score < 0.39`. Manually checked generated text is coherent. - Re-ran M-Bridge Minitron MMLU based pruning for Nano v2 9B -> 7B and all top 10 candidate's MMLU numbers are ballpark similar as before ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ❌ — `percentage` parameter renamed to `fraction`; `enable_kv_cache` removed from `megatron_mmlu` - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ — existing test updated and parametrized for TP+PP - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ 🤖 Generated with [Claude Code](https://claude.ai/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved pipeline-parallel generation and MMLU evaluation reliability; fixed output-layer synchronization in shared-embedding + pipeline setups. * **New Features** * MMLU scoring now uses batched prefill logit scoring for faster, batched evaluation. * **Behavior Changes** * Default MMLU sampling increased from 5% to 10%; calibration batch sizing adjusted and related CLI/help text updated. * **Tests** * Distributed tests cover tensor- and pipeline-parallel modes and tighten MMLU validation ranges. * **Documentation** * Updated pruning example and benchmark timing to reflect new sampling and speedup. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added backend-specific GPTQ helper registration to allow backend-tailored GPTQ behavior. * **Bug Fixes** * Prevented KV-cache state from leaking across repeated per-layer forwards during calibration. * **Tests** * Added GPU-focused tests validating GPTQ combined with vector quantization, including accuracy and end-to-end comparisons. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
## Summary
Adds **performant layerwise calibration** for quantizing large models
(e.g. DeepSeek-R1 671B) that don't fit entirely on GPU. ([Example
commands](#example-commands))
1. **Performant calibration for large models** — Each decoder layer is
moved from CPU/disk to GPU (accelerate) or unsharded (FSDP2) **only
once** and kept on GPU for the entire calibration step. Previously,
every calibration batch triggered weight transfer for every layer —
O(num_batches) weight movements per layer. Now it is O(1) per layer.
This also means you can **increase batch size** since only one layer's
weights occupy GPU at a time — e.g. DeepSeek-R1 on a single node
(8×80GB) with `batch_size=16` and `gpu_max_mem_percentage=0.5`.
2. **Checkpoint save/resume** — Saves progress after each layer, so jobs
that exceed cluster time limits (e.g. 4-hour Slurm windows for 100+
layer MoE models) can resume from the last completed layer.
3. **Rename** `sequential_calibrate` → `layerwise_calibrate` for
clarity.
### Design details
The existing layerwise state machine (skip/run/capture) already
processes one layer at a time, but skip-mode layers still kept their
parameters in the ModuleList — so frameworks transferred all weights
every forward pass. This PR adds:
- **`_SkipLayer`**: replaces fully-calibrated layers with a
parameter-free dummy in the ModuleList, so framework hooks have nothing
to transfer
- **`persistent_materialization`**: keeps the active layer on GPU for
the entire calibration step, avoiding repeated offload/reload cycles
Checkpoint save is per-layer; restore is bulk — quantizer state and
weights for layers 0..K-1 are restored once at the end of calibration,
keeping the hot path fast.
### Example commands
**Qwen3-8B** (NVFP4+GPTQ, single GPU):
```bash
python hf_ptq.py \
--pyt_ckpt_path Qwen/Qwen3-8B \
--recipe nvfp4_gptq_sequential.yaml \
--calib_size 64 \
--batch_size 16 \
--dataset cnn_dailymail \
--export_path outputs/qwen3_8b_nvfp4_gptq_seq \
--gpu_max_mem_percentage 0.5 \
--use_seq_device_map \
--vllm_fakequant_export
```
**DeepSeek-R1** (NVFP4 experts-only + FP8 KV, 8×80GB):
```bash
python hf_ptq.py \
--model unsloth/DeepSeek-R1-0528-BF16 \
--recipe ../../modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml \
--dataset cnn_dailymail \
--batch_size 16 \
--calib_size 64 \
--calib_seq 512 \
--gpu_max_mem_percentage 0.5 \
--use_seq_device_map \
--trust_remote_code \
--export_path output/DeepSeek-R1-BF16-nvfp4-experts-only-fp8-kv \
--vllm_fakequant_export
```
### Example: NVFP4+GPTQ layerwise calibration on Qwen3-8B (36 layers,
single GPU — 20 GB peak)
**Initial run** (killed after layer 11):
```
Layerwise calibration: Found 36 transformer layers
Calibrating layer 1/36 | capture: [1]
Computing Hessians for 7 linear layers...
GPTQ time: 51.39s
Calibrating layer 2/36 | run: [1] | capture: [2]
Checkpoint: saved layer 0
GPTQ time: 50.06s
Calibrating layer 3/36 | skip: 1 | run: [2] | capture: [3]
Checkpoint: saved layer 1
...
Calibrating layer 12/36 | skip: 10 | run: [11] | capture: [12]
Checkpoint: saved layer 10
<killed>
```
**Resumed run** (picks up from layer 11, finishes all 36):
```
Layerwise calibration: Found 36 transformer layers
Checkpoint: resuming layerwise calibration from layer 11/36
Calibrating layer 12 (resumed)
GPTQ time: 51.45s
Calibrating layer 13/36 | skip: 11 | run: [12] | capture: [13]
Checkpoint: saved layer 11
...
Calibrating layer 36/36 | skip: 34 | run: [35] | capture: [36]
Checkpoint: saved layer 34
GPTQ time: 50.33s
Checkpoint: saved layer 35 (final)
Checkpoint: restored 11 previously calibrated layers
Layerwise calibration completed
Quantized model exported to: outputs/qwen3_8b_nvfp4_gptq_seq
GPU 0: Peak memory usage = 20.42 GB
```
## TODO
- [ ] Update CHANGELOG
## Test plan
- `tests/unit/torch/quantization/test_layerwise_calibrate.py` — unit
tests for skip/swap/restore
- `tests/unit/torch/quantization/test_sequential_checkpoint.py` —
checkpoint save/resume correctness
- `tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py` —
CPU-offloaded layerwise + GPTQ + checkpoint resume
- `tests/gpu/torch/quantization/test_fsdp2.py` — FSDP2 layerwise
calibration
### Verified
- [x] Qwen3-8B: layerwise calibration + checkpoint save/restore +
fakequantized checkpoint export + vLLM serve
- [x] DeepSeek-R1: checkpoint resume tested
- [x] DeepSeek-R1: fakequantized checkpoint export verified
---------
Signed-off-by: realAsma <akuriparambi@nvidia.com>
## Summary - **hf_online_dflash.yaml**: Add 100K-sample training config with regression baselines (B200 loss curve), `MAX_FINAL_LOSS`/`MIN_FINAL_ACC`/`MIN_ACCEPTANCE_LENGTH` thresholds, vLLM nightly container for DFlash support - **vllm_smoke_test.sh**: Parse acceptance length from vLLM server log for regression check; `pip install pandas` workaround for broken nightly container; capture server output to temp file - **query.sh**: Detect vLLM server death during startup (PID liveness check) + 600s timeout to prevent infinite polling that wastes GPU hours; `pip install pandas` workaround - Fix empty `environment:` key in DFlash YAML causing nemo_run `ListParseError` ## Test plan - [x] E2E pipeline passed on 8x B200 (training + vLLM smoke test + AR eval) - [x] Training regression: final loss 3.82 < 5.0, acc 0.20 > 0.15 - [x] vLLM acceptance length: 1.79 >= 1.4 threshold - [x] AR evaluation: 2.02 overall on MT-Bench (8 categories) - [x] Server liveness check prevents GPU waste on vLLM crash 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added optional regression validation for vLLM acceptance metrics * Introduced configurable vLLM server startup timeout (default 600 seconds) * **Improvements** * Enhanced logging for vLLM server startup with progress tracking and waited time reporting * Faster detection of vLLM server process failures during initialization * **Configuration Updates** * Increased training dataset size and logging granularity * Scaled tensor parallelism from 4 to 8 across multiple pipelines * Expanded PTQ quantization to multi-step pipeline * Added configurable training metric thresholds <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What does this PR do?
Type of change: new feature, new example <!-- Use one of the following:
Bug fix, new feature, new example, new tests, documentation. -->
<!-- Details about the change. -->
## Summary
- Add skip-softmax sparse attention (BLASST) for diffusion models via
dedicated Triton kernels — an inference kernel with tile skipping and a
calibration kernel with vectorized multi-threshold sparsity measurement
- Add `triton_skip_softmax` method with exponential model calibration
(`scale_factor = a * exp(b * sparsity)`) and log-space fitting for
diffusion models
- Add Triton kernel backends for diffusers and LTX attention dispatch
- Fix calibration to skip RULER dataset generation when user provides
their own `forward_loop` (required for non-LLM models)
## Changes
### Triton kernels (`modelopt/torch/kernels/triton_fa.py`)
- **`_attn_fwd`**: Forward kernel with optional tile skipping — tiles
whose max attention score is far below the running softmax max are
skipped entirely (no V load, no softmax, no accumulation). Runtime
sparsity measurement via atomic counters.
- **`_attn_fwd_calibrate`**: Calibration kernel that computes full
attention while measuring how many tiles would be skipped at each of N
thresholds simultaneously. Uses per-program output buffers (zero atomic
contention) and vectorized multi-threshold comparison.
- **`attention()`** / **`attention_calibrate()`**: Python wrappers for
inference and calibration kernels.
### Kernel backends
(`modelopt/torch/sparsity/attention_sparsity/kernels/`)
- **`diffusers_triton_attention.py`**: Registers `modelopt_triton`
backend in diffusers' attention dispatch. Handles [B, S, H, D] → varlen
layout conversion, calibration/inference mode switching, thread-local
configuration, and counter accumulation.
- **`ltx_triton_attention.py`**: Patches `ltx_core.Attention` modules
for Triton dispatch with the same calibration/inference modes.
### Method
(`modelopt/torch/sparsity/attention_sparsity/methods/triton_skip_softmax.py`)
- `TritonSkipSoftmaxMethod`: Context managers for calibration (→
calibration kernel) and inference (→ forward kernel with tile skipping).
Three threshold priority levels: raw threshold > calibrated scale_factor
> static threshold.
### Calibration
(`modelopt/torch/sparsity/attention_sparsity/calibration/`)
- **`calibrator.py`**: `DynamicThresholdCalibrator` with `fit_logspace`
option — fits exponential model in log space (minimizes relative error)
for diffusion models where scale_factors span many orders of magnitude.
Records observed sparsity range for extrapolation warnings.
- **`calibrate.py`**: Skips RULER dataset when `forward_loop` is
provided; passes `fit_logspace` through from config.
### Config & conversion
- **`config.py`**: `CalibrationConfig.fit_logspace` field (default
False, recommended True for diffusion models).
`skip_softmax_raw_threshold` field for direct threshold mode.
- **`conversion.py`**: Auto-registers diffusers/LTX Triton backends on
`sparsify()`. Updated summary display.
### Example
- **`wan22_skip_softmax.py`**: End-to-end example for WAN 2.2 5B/14B
with baseline, raw-threshold, and calibrated modes. Supports runtime
sparsity reporting.
## Threshold modes
| Mode | How it works | Use case |
|------|-------------|----------|
| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly to kernel
as `skip_threshold_log2` | Quick testing, sweeps |
| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor =
a * exp(b * target)`, then `threshold = scale_factor / seq_k` at runtime
| Production use with seqlen adaptation |
| **Static** (default `skip_softmax_threshold=0.1`) | `log2(lambda) *
sm_scale` | Fallback |
## Usage
```bash
# Fixed raw threshold (no calibration)
python examples/diffusers/sparsity/wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--raw-threshold -0.7 \
--prompt "A cat playing piano" --output out.mp4
# With calibration (log-space fit for diffusion models)
python examples/diffusers/sparsity/wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--calibrate --target-sparsity 0.5 \
--prompt "A cat playing piano" --output out.mp4
# Dense baseline for comparison
python examples/diffusers/sparsity/wan22_skip_softmax.py \
--model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
--baseline \
--prompt "A cat playing piano" --output baseline.mp4
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
❌ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Added skip-softmax sparse attention support for Diffusers models,
enabling efficient video generation
* Added support for both eager and Triton attention backends for sparse
attention
* Added new example script for Wan 2.2 text-to-video generation with
sparse attention optimization
* **Documentation**
* Updated documentation with sparse attention configuration guide and
usage examples
* **Tests**
* Added comprehensive unit tests for kernel backend registration and
skip-softmax functionality
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
### What does this PR do? Type of change: Bugfix <!-- Details about the change. --> Add newly added quant configs to the example PTQ script. ### Testing I have locally run auto_quantize with these two quant_configs, and obtained successfully exported HF artifacts. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added support for three new quantization formats: nvfp4_mse, nvfp4_local_hessian, and nvfp4_experts_only, expanding available export options when using auto-quantize. * **Bug Fixes / UX** * Updated the invalid-quantization error message to include the newly accepted format identifiers. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Bilal Kartal <bkartal@nvidia.com> Signed-off-by: bkartal-dev <bkartal@nvidia.com>
## Summary - Add end-to-end ResNet50 support in the torch_onnx quantization → ONNX export → TRT engine pipeline - Fix multiple Conv2d-related export issues that blocked Conv2d-heavy models from working with FP8/INT8/MXFP8/NVFP4/auto quantization modes - Fix `configure_linear_module_onnx_quantizers` to handle all modules with block quantization (not just `nn.Linear`), fixing NVFP4/MXFP8 export for models with quantized non-Linear modules - Add `--trt_build` flag to `torch_quant_to_onnx.py` and simplify test infrastructure ### Files Changed - `modelopt/torch/_deploy/utils/torch_onnx.py` — Disable FP8 Conv2d weight quantizers and autocast during ONNX export - `modelopt/torch/quantization/export_onnx.py` — Fix `configure_linear_module_onnx_quantizers` for all module types with block quantization - `examples/torch_onnx/torch_quant_to_onnx.py` — Add `--trt_build` flag, calibration for FP8 override quantizers, Conv2d→FP8 override for auto mode, filter_func updates - `examples/torch_onnx/README.md` — Add ResNet50 to supported models table - `tests/examples/torch_onnx/test_torch_quant_to_onnx.py` — Add ResNet50 test entry, simplify using `--trt_build` - `tests/_test_utils/torch/vision_models.py` — Add ResNet50 to timm model registry ### Quantization modes passing - ✅ FP8, INT8, MXFP8, NVFP4, Auto (all 5 modes pass export + TRT build) - INT4_AWQ excluded (pre-existing limitation for all models) ## Test plan - [x] All 5 resnet50 test modes pass: `pytest tests/examples/torch_onnx/test_torch_quant_to_onnx.py -k resnet50` (5/5 passed) - [x] Full regression: 18 passed, 2 failed (pre-existing swinv2_tiny fp8/int8 failures) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added ResNet50 to supported ONNX export vision models with FP8, INT8, MXFP8, and NVFP4 support. * Optional TensorRT engine build after export via a new CLI flag. * **Improvements** * Enhanced quantization calibration and export flows for FP8/INT8 models, including broader block-quantization support across module types and safer export handling. * Tests updated to include ResNet50 in the model matrix. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: ajrasane <arasane@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…er/experimental/DMS) (#879) ## What does this PR do? **Type of change:** ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** ? ## Usage <!-- You can potentially add a usage example below. --> ```python # Add a code snippet demonstrating how to use this ``` ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated DMS installation instructions to reflect the repository structure and correct directory navigation during setup. * Clarified the setup steps so users follow the accurate directory change before running installation commands. * Small wording improvements to reduce confusion during the installation process. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Farid Adilazuarda <42537562+faridlazuarda@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Replace mip package with more popular pulp package for puzzle mip solving. Both use the CBC solver under the hood ## Testing - Results very close for Qwen3-8B and Nemotron-Nano-12B-v2 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Simplified GPU test environment setup by removing unnecessary system dependency installation * Updated internal optimization solver dependencies in the puzzletron module <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…plify CI workflows (#1286) ### What does this PR do? Type of change: New feature / infrastructure improvement Follow-up to #1285 for correct CI test environment for megatron based tests Replaces `tox` + `tox-current-env` with `nox` for all test, lint, docs, and wheel build sessions. The primary motivation was that `tox-current-env` is incompatible with uv venvs in NGC containers (e.g. NeMo's `/opt/venv`) — it picks the system Python via `sys._base_executable` instead of the container's venv Python which has megatron packages pre-installed. Key changes: - **`noxfile.py`** replaces `tox.ini` with GPU, CPU unit, partial-install, pre-commit, docs, and wheel sessions - **GPU sessions** use `venv_backend="none"` (run directly in container env) and `python -m pip/pytest` to avoid PATH mismatches - **uv** is set as the default venv backend (if available) for CPU sessions (faster installs) Also includes CI workflow simplifications: - **`_pr_gate.yml`** new reusable workflow centralizing file-change detection + linux-check wait logic (was duplicated across 3 workflow files) - **Collapsed pr/non-pr job pairs** into single jobs with conditional `runs-on` in `gpu_tests.yml`, `example_tests.yml`, `regression_tests.yml` - **Collapsed `multi-py` / `multi-torch` / `multi-transformers`** into a single `multi-version` matrix job in `unit_tests.yml` - **PR path filtering** for unit test secondary jobs (multi-version, launcher, partial-install) — skipped if no relevant files changed - **Fixed schedule/workflow_dispatch skipping** — jobs with `needs: [pr-gate]` were incorrectly skipped when all pr-gate internal jobs were skipped; fixed by making the gate job always run - **multi-version, launcher, partial-install** now also run on `schedule` / `workflow_dispatch` ### Usage ```bash python -m pip install nox uv # install nox and uv (once) nox -l # list all sessions nox -s gpu_megatron # run a GPU session (inside container) nox -s "unit-3.12(torch_211, tf_latest)" # run a specific unit test combination nox -s "unit-3.12(torch_211, tf_latest)" -R # force-recreate venv (e.g. after dep changes) COVERAGE_PROCESS_START=pyproject.toml nox -s "unit-3.12(torch_211, tf_latest)" # with coverage ``` ### Testing - Ran `nox -l` to verify all session names - Ran `gpu_megatron` session locally inside NeMo container — confirmed it uses `/opt/venv/bin/python` correctly - Manually triggered nightly-runs: - Unit: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608013657 - GPU: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608018763 - Examples: https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608017322 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A — CI infrastructure only - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ (added `nox` and `uv` to `dev-test`, both Apache-2.0) - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A — no user-facing changes ### Additional Information Supersedes the tox-current-env workaround in the parent branch. --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary Automated weekly update of uv.lock file for nSpect Scanning: - `uv.lock` — upgraded all transitive dependencies to latest compatible versions Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> Add a standalone monitor skill for persistent job tracking across sessions, and integrate it with PTQ, evaluation, and deployment skills. Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel status checks) that didn't survive session restarts and couldn't track multiple jobs. Users had to manually ask "check status" every time. Solution: A centralized monitor skill with: - Job registry (.claude/active_jobs.json): single source of truth for all active jobs - Durable recurring cron: polls every 15 min, survives session restarts, self-cleans when all jobs complete - User-initiated mode: works in new conversations by reading the registry - Aggregated reporting: "2 of 4 completed" instead of per-job noise ### Usage After any skill submits a job, the monitor skill automatically: 1. Registers the job in .claude/active_jobs.json 2. Sets up a durable cron to poll status every 15 minutes User can also trigger manually: User: "check my eval status" → reads registry, reports current state User: "is the PTQ done?" → finds job, checks status User: "what jobs are running?" → lists all registered jobs ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added monitor skill for tracking SLURM jobs, NEL evaluations, and launcher experiments with persistent job registry. * **Documentation** * Updated deployment, evaluation, and PTQ documentation to use the new monitor skill. * Simplified diagnostic and troubleshooting instructions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Kai Xu <kaix@nvidia.com>
### What does this PR do? Type of change: new feature <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> - Add Conv3D implicit GEMM kernel with BF16 WMMA tensor cores and fused NVFP4 activation quantization for video diffusion VAE layers - Integrate into _QuantConv3d via QuantModuleRegistry — automatically dispatched when NVFP4 quantization is applied to nn.Conv3d - Move kernel from `experimental/conv/ to modelopt/torch/kernels/conv/`; move tests to `tests/gpu/torch/quantization/kernels/` ### Testing <!-- Mention how have you tested your change if applicable. --> - Added test cases to measure the difference between cuDNN and our CUDA implicit GEMM kernel - Added an NVFP4 fake quantization test using CUDA code ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!--- Mandatory --> - Did you write any new necessary tests?: ✅ <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Per-backbone quantization/export in a single run with per-backbone checkpoints and backbone-aware quant filters * Configurable NVFP4 block-size via CLI/config; improved NVFP4 Conv3D inference path and Wan 2.2 quantization support * **Bug Fixes** * Video-model calibration now respects extra params and forces video decoding during calibration * **Documentation** * Added comprehensive Conv3D implicit‑GEMM kernel documentation; removed experimental Conv3D prototype docs/benchmark * **Tests** * New Wan 2.2 quantization/export tests and expanded Conv3D/FP4 kernel test coverage <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
### What does this PR do?
Type of change: Bug
Enables end-to-end AWQ checkpoint export and reload in the vLLM
fake-quant serving path (`MODELOPT_STATE_PATH`). Previously, the
`input_quantizer` was using incorrect `pre_quant_scale` especially with
grouped quantizers like `qkv_proj`, using simply the first
`input_quantizer.pre_quant_scale`. This MR adds
`_resmooth_experts_for_export` that non-mutatively averages
`pre_quant_scale` across MoE experts and unifies input `_amax`, required
because vLLM uses a single input quantizer per expert group. Adds
`merge_amax_tensors_for_group` (element-wise max for same-shape, `cat`
for GQA, scalar-max fallback) replacing the scalar-collapsing
`torch.stack().max()` that dropped per-channel `_amax` structure.
### Usage
```python
# Export AWQ checkpoint from HF model
from modelopt.torch.export.plugins.vllm_fakequant_hf import export_hf_vllm_fq_checkpoint
export_hf_vllm_fq_checkpoint(model, export_dir="./awq_vllm_checkpoint")
```
### Testing
**Step 1 — Export the quantized checkpoint:**
```bash
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
--recipe <AWQ_RECIPE> \
--calib_size 512 \
--export_path <EXPORT_DIR> \
--vllm_fakequant_export
```
This produces `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` with the averaged per-expert
pre_quant_scale and unified _amax now included.
Step 2 — Serve via vLLM fakequant worker:
```bash
MODELOPT_STATE_PATH=<EXPORT_DIR>/vllm_fq_modelopt_state.pth \
python examples/vllm_serve/vllm_serve_fakequant.py \
<EXPORT_DIR> --tensor-parallel-size <TP>
```
Tested for quantization configurations:
```
FP8_DEFAULT_CFG
FP8_DEFAULT_CFG (input_q disabled)
INT8_SMOOTHQUANT_CFG
INT8_WEIGHT_ONLY_CFG
NVFP4_DEFAULT_CFG
NVFP4_AWQ_LITE_CFG
INT4_AWQ_CFG
NVFP4_AWQ_CFG
NVFP4_DEFAULT_CFG (input_q disabled)
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit
* **New Features**
* Added Nemotron-style MoE export support and group-aware AWQ resmoothing with optional requantization during export.
* Improved handling for shared-input / expert groups and tensor-parallel sharding of pre-quantization scales.
* **Bug Fixes**
* Removed AWQ reload limitation from known issues; improved checkpoint validation and safer save/load behavior.
* Better detection and handling of enabled weight-quantizers and clearer warnings for mismatched checkpoint keys.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
### What does this PR do? Type of change: ? <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a public backend-specific calibrator registration API to support FP8 scale-sweep calibration, allowing backends to supply custom calibrators used during FP8 tuning. * **Tests** * Added unit tests confirming registry insertion/overwrite, that registered calibrators are invoked when FP8 scale-sweep is enabled, are not invoked when disabled, and that calibration falls back to defaults when no backend is registered. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Summary
- `datasets`' `resolve_pattern` only matches entries with
`type=="file"`, so passing a bare directory path as `data_files` to
`load_dataset` results in `FileNotFoundError` even when the directory
exists on disk
- Detect directory paths in `ShardedDataset._load_dataset()` and pass
them via `data_dir` instead of `data_files`
## Reproduction
```python
from datasets import load_dataset
# This fails with FileNotFoundError:
load_dataset("json", data_files="/path/to/data_directory")
# This works:
load_dataset("json", data_dir="/path/to/data_directory")
```
## Test plan
- [ ] Verify existing EAGLE3/DFlash training pipelines that pass
directory paths work
- [ ] Verify file path and glob patterns still work (falls through to
`data_files`)
- [ ] Verify `data_files=None` (no data_files arg) still works
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Bug Fixes
* Fixed an issue with dataset loading that prevented proper handling of
directory-based data sources. Directories are now correctly detected and
processed during dataset initialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
#1293) ## Summary - **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters (TP=1, all tasks) - **quantize.sh**: Auto-find largest PP dividing model's `num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't divisible by 8, causing `AssertionError` on 8-GPU nodes - **compute_hidden_states_trtllm.py**: Use `messages` with `conversations` fallback, matching the HF version. Fixes `KeyError: 'conversations'` when data uses OpenAI `messages` format ## Test plan - [x] Qwen3-8B PTQ runs on single L40 GPU - [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs, PP=4 on 4 GPUs, PP=1 on 1 GPU) - [x] EAGLE3 offline pipeline reads data with `messages` field 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Dataset input handling now supports multiple field formats for enhanced compatibility. * **Bug Fixes** * Optimized GPU resource allocation during model quantization with improved pipeline parallelism computation. * Updated quantization configuration for more efficient resource utilization. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chenhan Yu <chenhany@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs across 2 nodes), `ParallelismConfig` raises `total_size (4) does not match num_processes (8)` because `dp_replicate_size` defaults to 1 - Auto-compute `dp_replicate_size = world_size // (dp_shard_size * cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel replication works without manual config - This enables `dp_shard_size` to be set to per-node GPU count (better NVLink utilization) while automatically creating replicas across nodes ## Test plan - [ ] Verify single-node training (dp_shard_size == world_size, dp_replicate_size == 1) unchanged - [ ] Verify multi-node with dp_shard_size < world_size creates correct replica groups - [ ] Verify existing EAGLE3/DFlash configs still work 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Enhanced parallelism configuration initialization in the speculative decoding example to better handle distributed training scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What does this PR do? Add gptq fused kernel to improve speed. ### Usage check unittest ### Testing added a unittest ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Fused GPTQ backend for faster blockwise weight updates, toggleable via a new "fused" option. * Shared NVFP4 quantization primitives exposed for reuse. * **Refactor** * Consolidated FP4 scale/quantization logic into reusable utilities and centralized Hessian inversion handling. * **Tests** * Expanded GPU tests comparing fused vs unfused GPTQ, added Triton-availability gating and a local benchmark entrypoint. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
### What does this PR do? Type of change: Bug fix Fixes gh-pages branch bloat that grew from ~26 MB to ~441 MB in four weeks (nvbug 6099503). Three compounding causes were identified and addressed: 1. **Sphinx `.doctrees/` cache published to gh-pages** — `sphinx-build` was writing its build cache inside `build/html/` which was then uploaded verbatim. Accounts for ~3.3 GB uncompressed across history. 2. **`JamesIves/github-pages-deploy-action` appending a commit on every push** — main-site files accumulated forever with `single-commit: false` (default). 3. **PR preview deploying on every `synchronize` event for all PRs** — `rossjrw/pr-preview-action` re-deployed the full site for every push to any PR regardless of whether docs changed (e.g. PR #1128 triggered 64 preview deploys × ~11 MB each). Changes: - Pass `-d /tmp/doctrees` to `sphinx-build` so `.doctrees/` is never written into `build/html/` - Add `paths: [docs/**, modelopt/**]` filter to `pull_request` trigger so the docs workflow only runs on PRs that touch docs or source code - Set `single-commit: true` on the deploy action so main-site pushes squash into one commit - Deduplicate docs build: `deploy-preview` now downloads the artifact from `build-docs` instead of running a second `sphinx-build` - Set `retention-days: 1` on the artifact since it is only needed for the duration of the workflow run The one-time cleanup (force-push squashed orphan to gh-pages) was already applied separately — repo is now ~59 MB for a full clone vs ~441 MB before. ### Usage N/A — CI/workflow change only. ### Testing - Workflow logic reviewed manually. - The one-time cleanup was verified: `git rev-list --objects --disk-usage origin/gh-pages` now reports ~28 MB; full clone is ~59 MB. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A ### Additional Information nvbug 6099503 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Optimized documentation build and deployment workflow in CI/CD pipeline. * Improved pull request documentation preview handling with faster build timeouts and refined artifact management. * Enhanced GitHub Pages deployment configuration for better consistency. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
- Use latest containers for testing in CICD <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Bumped TensorRT-LLM Docker images to 1.3.0rc12 in example and GPU test workflows. * Updated PyTorch container image from 26.01 to 26.03 for GPU tests. * Captured uv lock upgrade output to a temp file, inlined it into PR bodies, and adjusted workflow heredoc/templating and step behavior. * **Documentation** * Clarified an inline comment and simplified a warning message for an ONNX quantization extension. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Contributor
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis pull request introduces major enhancements to quantization and sparsity workflows, migrates CI/CD from tox to nox, and adds comprehensive example documentation. Key additions include layerwise calibration support (replacing sequential calibration), skip-softmax sparse attention for video models, fused GPTQ Triton kernels, and NVFP4 Conv3D implicit GEMM optimizations. The change involves 100+ files across quantization, sparsity, testing, CI, and example code. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
Contributor
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements