Skip to content

Gkarch/update 1314#1316

Closed
grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
gkarch/update_1314
Closed

Gkarch/update 1314#1316
grzegorz-k-karch wants to merge 31 commits intofeature/vllm_deployment_docsfrom
gkarch/update_1314

Conversation

@grzegorz-k-karch
Copy link
Copy Markdown
Contributor

@grzegorz-k-karch grzegorz-k-karch commented Apr 22, 2026

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Layerwise calibration for large GPU-infeasible models with intermediate checkpoint saving
    • NVFP4 implicit GEMM CUDA kernel for Conv3D inference with quantization
    • Skip-softmax sparse attention for diffusion models (WAN 2.2, LTX-2)
    • Fused GPTQ kernel for accelerated weight quantization
    • Job monitoring framework for SLURM-based submissions
    • Container registry authentication validation for remote deployments
  • Improvements

    • ~10x MMLU evaluation speedup via batched prefill
    • Enhanced quantization export for vLLM with resmoothing support
    • Extended tool support (nox/uv migration, improved multi-version testing)

kinjalpatel27 and others added 30 commits April 16, 2026 09:42
### What does this PR do?

Type of change: Bug fix
During Megatron→vLLM fakequant export
(`export_mcore_gpt_to_hf_vllm_fq`), the `weight_quantizer` is now
applied as fake-quantization (quantize + dequantize) directly into the
exported weight tensor, and its amax is no longer saved to
`quantizer_state.pth`. On reload, if `weight_quantizer` keys are absent
from the checkpoint (because they were folded at export time), the
corresponding quantizer modules are disabled.
This change is useful especially when amax across experts are not synced
for `weight_quantizer`, this allows the `weight_quantizer` to keep them
different for better accuracy.

### Usage
```python                                                                                                                                                                                                    
# Unchanged — export API is the same                                                                                                                                                                         
export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...)
```
 
### Testing
Step 1 — Quantize (run from Megatron-LM
`examples/post_training/modelopt`):
  ```bash
HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \
bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG
```  

Step 2 — Export for vLLM fakequant:                                                                                                                                                                          
```bash  
MLM_EXTRA_ARGS=--export-vllm-fq \ 
HF_MODEL_CKPT=<path/to/hf/weights> \ 
MLM_MODEL_CKPT=<quant-ckpt-name> \ 
EXPORT_DIR=<export-dir> \ 
bash export.sh <hf-model-id> 
```

Step 3 — Serve (run from examples/vllm_serve):                                                                                                                                                               
```bash
 QUANT_CFG=NVFP4_DEFAULT_CFG \ 
 QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \ 
 python3 vllm_serve_fakequant.py <export-dir> \ 
 -tp 1 --served-model-name <model-name> \ 
 --host 0.0.0.0 --port 8000 \ 
--trust-remote-code --enforce-eager \ 
 --disable-custom-all-reduce \ 
--gpu-memory-utilization 0.8          
```

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ 
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A 
- Did you write any new necessary tests?: N/A 
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A 

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **Bug Fixes**
  * Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures.

* **Improvements**
  * Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
## Summary
Automated weekly update of uv.lock file for nSpect Scanning:
- `uv.lock` — upgraded all transitive dependencies to latest compatible
versions

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
)

### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->
PTQ: model-specific dependency support
- Add EXTRA_PIP_DEPS support to the launcher's `ptq.sh` so models
requiring extra pip packages (e.g., `mamba-ssm` for hybrid Mamba
architectures like Nemotron) can install them automatically before
running PTQ. Also updates the PTQ skill with a new Step 2.5 for
detecting model-specific dependencies.

Container registry auth checks
- Add new section 6 covering auth detection for enroot/pyxis, Docker,
and Singularity/Apptainer. Includes credential locations, how to add
them, and common failure modes.
- Add Step 7.5 with NEL default image table, DockerHub-first strategy
with NGC fallback, and build-config CLI note.
- Add auth check before remote SLURM deployment.

### Usage

Set EXTRA_PIP_DEPS in the launcher YAML's environment section:
```
task_0:
  script: common/hf/ptq.sh
  args:
    - --repo nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --local-dir /hf-local/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
    - --
    - --quant nvfp4
    - --tasks quant
  environment:
    - EXTRA_PIP_DEPS: "mamba-ssm causal-conv1d"
```

### Testing
<!-- Mention how have you tested your change if applicable. -->
Tested end-to-end: NVFP4 quantization of
`NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` on a B200 cluster via the
launcher. Job succeeded: mamba-ssm installed automatically, calibration
completed (512 samples, 84s), checkpoint exported (18 GB, 2 safetensor
shards).

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **Documentation**
* Added container registry authentication verification workflow for
SLURM deployments, including credential checks, verification commands,
common failure symptoms, and remediation guidance.
* Required credential validation before SLURM job submission and added
SLURM-only verification steps with image fallback recommendations.
* New dependency-checking step for models that use
remote/trust_remote_code, plus guidance for resolving extra package
requirements and tightened build-config guidance.
* Updated PTQ launcher documentation to reference the new wrapper
script.

* **New Features**
* Support for specifying extra pip dependencies during model processing
via an environment variable.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kai Xu <kaix@nvidia.com>
)

### What does this PR do?

Type of change: Bug fix

Fixes TRT-LLM DeepEP kernel failures during LLM deployment on
unsupported GPUs (e.g. Blackwell SM 12.0) by defaulting expert
parallelism (`ep`) to 1 instead of auto-setting it to the GPU count for
MoE models.

Previously, when the model config contained expert-related keys, `ep`
was automatically set to `torch.cuda.device_count()`, which triggered
DeepEP kernel failures on GPUs that don't support it. Now `ep` defaults
to 1 while still enabling attention data parallelism for MoE models.
Expert parallelism can be enabled explicitly by the caller when the
environment is known to support it.

### Testing

- [x] Verified that the `llm_ptq` test passes with this fix on Blackwell
GPUs.
- [x] 2-gpu CI test triggered:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/24495054531/job/71588037727

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
…equirements (#1275)

### What does this PR do?

Type of change: Bug fix

Removed version fixes for torch and transformers



### Testing
Tested quantization with a couple of models . Working as expected.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Relaxed dependency specs: removed strict pin for torch to allow latest
compatible installs, and constrained transformers to <5.0.0 for broader
compatibility and easier updates.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Hrishith Thadicherla <hthadicherla@nvidia.com>
Signed-off-by: Hrishith Thadicherla <99313418+hthadicherla@users.noreply.github.com>
Dont allow more than 1% overall project coverage drop per PR. 2% was too
much for such a large codebase

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated code coverage enforcement thresholds for pull requests to
maintain stricter quality standards.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
As title

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Updated the release date for version 0.43 in the changelog.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…to allow user to bypass if needed (#1279)

## Summary

- Remove the `kwargs.setdefault("weights_only", True)` call from
`safe_load`, deferring to torch's built-in default (which is `True` for
torch>=2.6)
- This allows users to override via the
`TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD=1` env var when they trust a
checkpoint but hit `pickle.UnpicklingError`
- Add a test that verifies the default fails on unsafe objects and the
env var bypass works

## Test plan

- [x] `python -m pytest tests/unit/torch/utils/test_serialization.py -v`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Serialization utility now respects PyTorch's default behavior and
environment-variable configuration instead of forcibly enforcing
parameter overrides, providing greater configuration flexibility.

* **Tests**
* Added test coverage validating environment-variable override
functionality and default behavior in the serialization utility.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…iliency-ext dependency (#1285)

- `megatron-core==0.17.0` released yesterday which requires nightly
version of `nvidia-resiliency-ext` for an import. Pre-installed version
in DLFW Pytorch container is `nvidia-resiliency-ext==0.5.0`
  - Temporarily pin `mcore<0.17.0` to unblock PR from merging. 
- Pin `pulp<4.0` as it has some breaking changes and release imminent

Correct fix is to just use `nemo:26.04` container instead of PyTorch
container for megatron-based tests since it always has correct
combination of all packages needed for the megatron ecosystem - Done in
#1286

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…batching (#1280)

### What does this PR do?

Type of change: new feature + bug fix

Two improvements to Megatron inference utilities:

**1. Pipeline Parallel (PP) correctness fixes**

PP inference was producing garbage output (MMLU ~0.24, random chance).
Two root causes:

- `megatron_generate` / `megatron_prefill` used
`get_forward_backward_func()` (the training pipeline scheduler), which
is not designed for inference. Rewrote both functions to use explicit
P2P communication via `recv_from_prev_pipeline_rank_` /
`send_to_next_pipeline_rank`, matching the `run_mcore_inference`
pattern.
- `import_mcore_gpt_from_hf` loads HF weights into stage 0's embedding
but never updates the output_layer on the last PP stage when
`share_embeddings_and_output_weights=True`. At model init,
`setup_embeddings_and_output_layer()` all-reduces from stage 0 to sync
the output layer; after importing HF weights that all-reduce is stale.
Fix: call `model.setup_embeddings_and_output_layer()` again after
import.

**2. `megatron_mmlu` speedup (~6x)**

Replaces the `megatron_mmlu` implementation with a significantly faster
approach that matches how `lm-evaluation-harness` scores multiple-choice
questions.

**Before:** autoregressive generation (`megatron_generate`, `osl=2`) per
example, 114 separate `load_dataset` calls, batch_size=1 — 260s for 5%
data.

**After:** single prefill forward pass + argmax over {A,B,C,D} logits, 2
`load_dataset` calls, configurable batch_size — 18s for 5% data (~6x
faster).

### Changes

**PP fixes:**
- `megatron_generate` / `megatron_prefill`: replace
`get_forward_backward_func` with explicit P2P
(`recv_from_prev_pipeline_rank_` / `send_to_next_pipeline_rank`)
- `import_mcore_gpt_from_hf`: call
`model.setup_embeddings_and_output_layer()` after HF weight import when
PP>1 and `share_embeddings_and_output_weights=True`
- `megatron_prefill`: add `skip_return_logits` param and VLM support
(needed for PP non-last stages)

**MMLU speedup:**
- **Log-likelihood scoring**: replace `megatron_generate` with
`megatron_prefill` — one forward pass per batch, no autoregressive
decode loop
- **Global batching**: collect all examples across all subjects, sort by
descending sequence length, run in `batch_size` chunks
- **2 dataset loads** instead of 114: use `load_dataset("cais/mmlu",
"all")` with per-subject grouping; skip dev load when `few_shots=0`
- **`percentage` → `fraction`** parameter rename for clarity
- **tqdm progress bar** (rank-0 only)

### Testing

- `test_megatron_generate_and_mmlu` parametrized over `tp` and `pp`.
Accuracy assertion: `0.36 < score < 0.39`. Manually checked generated
text is coherent.
- Re-ran M-Bridge Minitron MMLU based pruning for Nano v2 9B -> 7B and
all top 10 candidate's MMLU numbers are ballpark similar as before

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ❌ — `percentage` parameter
renamed to `fraction`; `enable_kv_cache` removed from `megatron_mmlu`
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅ — existing test updated and
parametrized for TP+PP
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅

🤖 Generated with [Claude Code](https://claude.ai/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved pipeline-parallel generation and MMLU evaluation reliability;
fixed output-layer synchronization in shared-embedding + pipeline
setups.

* **New Features**
* MMLU scoring now uses batched prefill logit scoring for faster,
batched evaluation.

* **Behavior Changes**
* Default MMLU sampling increased from 5% to 10%; calibration batch
sizing adjusted and related CLI/help text updated.

* **Tests**
* Distributed tests cover tensor- and pipeline-parallel modes and
tighten MMLU validation ranges.

* **Documentation**
* Updated pruning example and benchmark timing to reflect new sampling
and speedup.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added backend-specific GPTQ helper registration to allow
backend-tailored GPTQ behavior.

* **Bug Fixes**
* Prevented KV-cache state from leaking across repeated per-layer
forwards during calibration.

* **Tests**
* Added GPU-focused tests validating GPTQ combined with vector
quantization, including accuracy and end-to-end comparisons.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
## Summary

Adds **performant layerwise calibration** for quantizing large models
(e.g. DeepSeek-R1 671B) that don't fit entirely on GPU. ([Example
commands](#example-commands))

1. **Performant calibration for large models** — Each decoder layer is
moved from CPU/disk to GPU (accelerate) or unsharded (FSDP2) **only
once** and kept on GPU for the entire calibration step. Previously,
every calibration batch triggered weight transfer for every layer —
O(num_batches) weight movements per layer. Now it is O(1) per layer.
This also means you can **increase batch size** since only one layer's
weights occupy GPU at a time — e.g. DeepSeek-R1 on a single node
(8×80GB) with `batch_size=16` and `gpu_max_mem_percentage=0.5`.
2. **Checkpoint save/resume** — Saves progress after each layer, so jobs
that exceed cluster time limits (e.g. 4-hour Slurm windows for 100+
layer MoE models) can resume from the last completed layer.
3. **Rename** `sequential_calibrate` → `layerwise_calibrate` for
clarity.

### Design details

The existing layerwise state machine (skip/run/capture) already
processes one layer at a time, but skip-mode layers still kept their
parameters in the ModuleList — so frameworks transferred all weights
every forward pass. This PR adds:
- **`_SkipLayer`**: replaces fully-calibrated layers with a
parameter-free dummy in the ModuleList, so framework hooks have nothing
to transfer
- **`persistent_materialization`**: keeps the active layer on GPU for
the entire calibration step, avoiding repeated offload/reload cycles

Checkpoint save is per-layer; restore is bulk — quantizer state and
weights for layers 0..K-1 are restored once at the end of calibration,
keeping the hot path fast.

### Example commands

**Qwen3-8B** (NVFP4+GPTQ, single GPU):
```bash
python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-8B \
    --recipe nvfp4_gptq_sequential.yaml \
    --calib_size 64 \
    --batch_size 16 \
    --dataset cnn_dailymail \
    --export_path outputs/qwen3_8b_nvfp4_gptq_seq \
    --gpu_max_mem_percentage 0.5 \
    --use_seq_device_map \
    --vllm_fakequant_export
```

**DeepSeek-R1** (NVFP4 experts-only + FP8 KV, 8×80GB):
```bash
python hf_ptq.py \
    --model unsloth/DeepSeek-R1-0528-BF16 \
    --recipe ../../modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml \
    --dataset cnn_dailymail \
    --batch_size 16 \
    --calib_size 64 \
    --calib_seq 512 \
    --gpu_max_mem_percentage 0.5 \
    --use_seq_device_map \
    --trust_remote_code \
    --export_path output/DeepSeek-R1-BF16-nvfp4-experts-only-fp8-kv \
    --vllm_fakequant_export
```

### Example: NVFP4+GPTQ layerwise calibration on Qwen3-8B (36 layers,
single GPU — 20 GB peak)

**Initial run** (killed after layer 11):
```
Layerwise calibration: Found 36 transformer layers
Calibrating layer 1/36 | capture: [1]
Computing Hessians for 7 linear layers...
GPTQ time: 51.39s
Calibrating layer 2/36 | run: [1] | capture: [2]
Checkpoint: saved layer 0
GPTQ time: 50.06s
Calibrating layer 3/36 | skip: 1 | run: [2] | capture: [3]
Checkpoint: saved layer 1
...
Calibrating layer 12/36 | skip: 10 | run: [11] | capture: [12]
Checkpoint: saved layer 10
<killed>
```

**Resumed run** (picks up from layer 11, finishes all 36):
```
Layerwise calibration: Found 36 transformer layers
Checkpoint: resuming layerwise calibration from layer 11/36
Calibrating layer 12 (resumed)
GPTQ time: 51.45s
Calibrating layer 13/36 | skip: 11 | run: [12] | capture: [13]
Checkpoint: saved layer 11
...
Calibrating layer 36/36 | skip: 34 | run: [35] | capture: [36]
Checkpoint: saved layer 34
GPTQ time: 50.33s
Checkpoint: saved layer 35 (final)
Checkpoint: restored 11 previously calibrated layers
Layerwise calibration completed
Quantized model exported to: outputs/qwen3_8b_nvfp4_gptq_seq
GPU 0: Peak memory usage = 20.42 GB
```

## TODO
- [ ] Update CHANGELOG

## Test plan
- `tests/unit/torch/quantization/test_layerwise_calibrate.py` — unit
tests for skip/swap/restore
- `tests/unit/torch/quantization/test_sequential_checkpoint.py` —
checkpoint save/resume correctness
- `tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py` —
CPU-offloaded layerwise + GPTQ + checkpoint resume
- `tests/gpu/torch/quantization/test_fsdp2.py` — FSDP2 layerwise
calibration

### Verified
- [x] Qwen3-8B: layerwise calibration + checkpoint save/restore +
fakequantized checkpoint export + vLLM serve
- [x] DeepSeek-R1: checkpoint resume tested
- [x] DeepSeek-R1: fakequantized checkpoint export verified

---------

Signed-off-by: realAsma <akuriparambi@nvidia.com>
## Summary
- **hf_online_dflash.yaml**: Add 100K-sample training config with
regression baselines (B200 loss curve),
`MAX_FINAL_LOSS`/`MIN_FINAL_ACC`/`MIN_ACCEPTANCE_LENGTH` thresholds,
vLLM nightly container for DFlash support
- **vllm_smoke_test.sh**: Parse acceptance length from vLLM server log
for regression check; `pip install pandas` workaround for broken nightly
container; capture server output to temp file
- **query.sh**: Detect vLLM server death during startup (PID liveness
check) + 600s timeout to prevent infinite polling that wastes GPU hours;
`pip install pandas` workaround
- Fix empty `environment:` key in DFlash YAML causing nemo_run
`ListParseError`

## Test plan
- [x] E2E pipeline passed on 8x B200 (training + vLLM smoke test + AR
eval)
- [x] Training regression: final loss 3.82 < 5.0, acc 0.20 > 0.15
- [x] vLLM acceptance length: 1.79 >= 1.4 threshold
- [x] AR evaluation: 2.02 overall on MT-Bench (8 categories)
- [x] Server liveness check prevents GPU waste on vLLM crash

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
  * Added optional regression validation for vLLM acceptance metrics
* Introduced configurable vLLM server startup timeout (default 600
seconds)

* **Improvements**
* Enhanced logging for vLLM server startup with progress tracking and
waited time reporting
* Faster detection of vLLM server process failures during initialization

* **Configuration Updates**
  * Increased training dataset size and logging granularity
  * Scaled tensor parallelism from 4 to 8 across multiple pipelines
  * Expanded PTQ quantization to multi-step pipeline
  * Added configurable training metric thresholds
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
### What does this PR do?

Type of change: new feature, new example <!-- Use one of the following:
Bug fix, new feature, new example, new tests, documentation. -->

<!-- Details about the change. -->

## Summary

- Add skip-softmax sparse attention (BLASST) for diffusion models via
dedicated Triton kernels — an inference kernel with tile skipping and a
calibration kernel with vectorized multi-threshold sparsity measurement
- Add `triton_skip_softmax` method with exponential model calibration
(`scale_factor = a * exp(b * sparsity)`) and log-space fitting for
diffusion models
- Add Triton kernel backends for diffusers and LTX attention dispatch
- Fix calibration to skip RULER dataset generation when user provides
their own `forward_loop` (required for non-LLM models)

## Changes

### Triton kernels (`modelopt/torch/kernels/triton_fa.py`)
- **`_attn_fwd`**: Forward kernel with optional tile skipping — tiles
whose max attention score is far below the running softmax max are
skipped entirely (no V load, no softmax, no accumulation). Runtime
sparsity measurement via atomic counters.
- **`_attn_fwd_calibrate`**: Calibration kernel that computes full
attention while measuring how many tiles would be skipped at each of N
thresholds simultaneously. Uses per-program output buffers (zero atomic
contention) and vectorized multi-threshold comparison.
- **`attention()`** / **`attention_calibrate()`**: Python wrappers for
inference and calibration kernels.

### Kernel backends
(`modelopt/torch/sparsity/attention_sparsity/kernels/`)
- **`diffusers_triton_attention.py`**: Registers `modelopt_triton`
backend in diffusers' attention dispatch. Handles [B, S, H, D] → varlen
layout conversion, calibration/inference mode switching, thread-local
configuration, and counter accumulation.
- **`ltx_triton_attention.py`**: Patches `ltx_core.Attention` modules
for Triton dispatch with the same calibration/inference modes.

### Method
(`modelopt/torch/sparsity/attention_sparsity/methods/triton_skip_softmax.py`)
- `TritonSkipSoftmaxMethod`: Context managers for calibration (→
calibration kernel) and inference (→ forward kernel with tile skipping).
Three threshold priority levels: raw threshold > calibrated scale_factor
> static threshold.

### Calibration
(`modelopt/torch/sparsity/attention_sparsity/calibration/`)
- **`calibrator.py`**: `DynamicThresholdCalibrator` with `fit_logspace`
option — fits exponential model in log space (minimizes relative error)
for diffusion models where scale_factors span many orders of magnitude.
Records observed sparsity range for extrapolation warnings.
- **`calibrate.py`**: Skips RULER dataset when `forward_loop` is
provided; passes `fit_logspace` through from config.

### Config & conversion
- **`config.py`**: `CalibrationConfig.fit_logspace` field (default
False, recommended True for diffusion models).
`skip_softmax_raw_threshold` field for direct threshold mode.
- **`conversion.py`**: Auto-registers diffusers/LTX Triton backends on
`sparsify()`. Updated summary display.

### Example
- **`wan22_skip_softmax.py`**: End-to-end example for WAN 2.2 5B/14B
with baseline, raw-threshold, and calibrated modes. Supports runtime
sparsity reporting.

## Threshold modes

| Mode | How it works | Use case |
|------|-------------|----------|
| **Raw threshold** (`--raw-threshold -0.7`) | Passed directly to kernel
as `skip_threshold_log2` | Quick testing, sweeps |
| **Calibrated** (`--calibrate --target-sparsity 0.5`) | `scale_factor =
a * exp(b * target)`, then `threshold = scale_factor / seq_k` at runtime
| Production use with seqlen adaptation |
| **Static** (default `skip_softmax_threshold=0.1`) | `log2(lambda) *
sm_scale` | Fallback |

## Usage

```bash
# Fixed raw threshold (no calibration)
python examples/diffusers/sparsity/wan22_skip_softmax.py \
    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
    --raw-threshold -0.7 \
    --prompt "A cat playing piano" --output out.mp4

# With calibration (log-space fit for diffusion models)
python examples/diffusers/sparsity/wan22_skip_softmax.py \
    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
    --calibrate --target-sparsity 0.5 \
    --prompt "A cat playing piano" --output out.mp4

# Dense baseline for comparison
python examples/diffusers/sparsity/wan22_skip_softmax.py \
    --model-path /path/to/Wan2.2-T2V-A14B-Diffusers \
    --baseline \
    --prompt "A cat playing piano" --output baseline.mp4
```

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
❌ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added skip-softmax sparse attention support for Diffusers models,
enabling efficient video generation
* Added support for both eager and Triton attention backends for sparse
attention
* Added new example script for Wan 2.2 text-to-video generation with
sparse attention optimization

* **Documentation**
* Updated documentation with sparse attention configuration guide and
usage examples

* **Tests**
* Added comprehensive unit tests for kernel backend registration and
skip-softmax functionality
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
### What does this PR do?

Type of change: Bugfix

<!-- Details about the change. -->

Add newly added quant configs to the example PTQ script.

### Testing

I have locally run auto_quantize with these two quant_configs, and
obtained successfully exported HF artifacts.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added support for three new quantization formats: nvfp4_mse,
nvfp4_local_hessian, and nvfp4_experts_only, expanding available export
options when using auto-quantize.

* **Bug Fixes / UX**
* Updated the invalid-quantization error message to include the newly
accepted format identifiers.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Bilal Kartal <bkartal@nvidia.com>
Signed-off-by: bkartal-dev <bkartal@nvidia.com>
## Summary
- Add end-to-end ResNet50 support in the torch_onnx quantization → ONNX
export → TRT engine pipeline
- Fix multiple Conv2d-related export issues that blocked Conv2d-heavy
models from working with FP8/INT8/MXFP8/NVFP4/auto quantization modes
- Fix `configure_linear_module_onnx_quantizers` to handle all modules
with block quantization (not just `nn.Linear`), fixing NVFP4/MXFP8
export for models with quantized non-Linear modules
- Add `--trt_build` flag to `torch_quant_to_onnx.py` and simplify test
infrastructure

### Files Changed
- `modelopt/torch/_deploy/utils/torch_onnx.py` — Disable FP8 Conv2d
weight quantizers and autocast during ONNX export
- `modelopt/torch/quantization/export_onnx.py` — Fix
`configure_linear_module_onnx_quantizers` for all module types with
block quantization
- `examples/torch_onnx/torch_quant_to_onnx.py` — Add `--trt_build` flag,
calibration for FP8 override quantizers, Conv2d→FP8 override for auto
mode, filter_func updates
- `examples/torch_onnx/README.md` — Add ResNet50 to supported models
table
- `tests/examples/torch_onnx/test_torch_quant_to_onnx.py` — Add ResNet50
test entry, simplify using `--trt_build`
- `tests/_test_utils/torch/vision_models.py` — Add ResNet50 to timm
model registry

### Quantization modes passing
- ✅ FP8, INT8, MXFP8, NVFP4, Auto (all 5 modes pass export + TRT build)
- INT4_AWQ excluded (pre-existing limitation for all models)

## Test plan
- [x] All 5 resnet50 test modes pass: `pytest
tests/examples/torch_onnx/test_torch_quant_to_onnx.py -k resnet50` (5/5
passed)
- [x] Full regression: 18 passed, 2 failed (pre-existing swinv2_tiny
fp8/int8 failures)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added ResNet50 to supported ONNX export vision models with FP8, INT8,
MXFP8, and NVFP4 support.
  * Optional TensorRT engine build after export via a new CLI flag.

* **Improvements**
* Enhanced quantization calibration and export flows for FP8/INT8
models, including broader block-quantization support across module types
and safer export handling.
  * Tests updated to include ResNet50 in the model matrix.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: ajrasane <arasane@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…er/experimental/DMS) (#879)

## What does this PR do?

**Type of change:** ? <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

**Overview:** ?

## Usage
<!-- You can potentially add a usage example below. -->

```python
# Add a code snippet demonstrating how to use this
```

## Testing
<!-- Mention how have you tested your change if applicable. -->

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->

## Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Updated DMS installation instructions to reflect the repository
structure and correct directory navigation during setup.
* Clarified the setup steps so users follow the accurate directory
change before running installation commands.
* Small wording improvements to reduce confusion during the installation
process.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Farid Adilazuarda <42537562+faridlazuarda@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Replace mip package with more popular pulp package for puzzle mip
solving. Both use the CBC solver under the hood

## Testing

- Results very close for Qwen3-8B and Nemotron-Nano-12B-v2

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Simplified GPU test environment setup by removing unnecessary system
dependency installation
* Updated internal optimization solver dependencies in the puzzletron
module

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…plify CI workflows (#1286)

### What does this PR do?

Type of change: New feature / infrastructure improvement

Follow-up to #1285 for correct CI test environment for megatron based
tests

Replaces `tox` + `tox-current-env` with `nox` for all test, lint, docs,
and wheel build sessions. The primary motivation was that
`tox-current-env` is incompatible with uv venvs in NGC containers (e.g.
NeMo's `/opt/venv`) — it picks the system Python via
`sys._base_executable` instead of the container's venv Python which has
megatron packages pre-installed.

Key changes:
- **`noxfile.py`** replaces `tox.ini` with GPU, CPU unit,
partial-install, pre-commit, docs, and wheel sessions
- **GPU sessions** use `venv_backend="none"` (run directly in container
env) and `python -m pip/pytest` to avoid PATH mismatches
- **uv** is set as the default venv backend (if available) for CPU
sessions (faster installs)

Also includes CI workflow simplifications:
- **`_pr_gate.yml`** new reusable workflow centralizing file-change
detection + linux-check wait logic (was duplicated across 3 workflow
files)
- **Collapsed pr/non-pr job pairs** into single jobs with conditional
`runs-on` in `gpu_tests.yml`, `example_tests.yml`,
`regression_tests.yml`
- **Collapsed `multi-py` / `multi-torch` / `multi-transformers`** into a
single `multi-version` matrix job in `unit_tests.yml`
- **PR path filtering** for unit test secondary jobs (multi-version,
launcher, partial-install) — skipped if no relevant files changed
- **Fixed schedule/workflow_dispatch skipping** — jobs with `needs:
[pr-gate]` were incorrectly skipped when all pr-gate internal jobs were
skipped; fixed by making the gate job always run
- **multi-version, launcher, partial-install** now also run on
`schedule` / `workflow_dispatch`

### Usage

```bash
python -m pip install nox uv                                                    # install nox and uv (once)
nox -l                                                                          # list all sessions
nox -s gpu_megatron                                                             # run a GPU session (inside container)
nox -s "unit-3.12(torch_211, tf_latest)"                                        # run a specific unit test combination
nox -s "unit-3.12(torch_211, tf_latest)" -R                                     # force-recreate venv (e.g. after dep changes)
COVERAGE_PROCESS_START=pyproject.toml nox -s "unit-3.12(torch_211, tf_latest)"  # with coverage
```

### Testing
- Ran `nox -l` to verify all session names
- Ran `gpu_megatron` session locally inside NeMo container — confirmed
it uses `/opt/venv/bin/python` correctly
- Manually triggered nightly-runs:
- Unit:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608013657
- GPU:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608018763
- Examples:
https://github.com/NVIDIA/Model-Optimizer/actions/runs/24608017322

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: N/A — CI infrastructure only
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ (added `nox`
and `uv` to `dev-test`, both Apache-2.0)
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A — no user-facing changes

### Additional Information
Supersedes the tox-current-env workaround in the parent branch.

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
Automated weekly update of uv.lock file for nSpect Scanning:
- `uv.lock` — upgraded all transitive dependencies to latest compatible
versions

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->
Add a standalone monitor skill for persistent job tracking across
sessions, and integrate it with PTQ, evaluation, and deployment skills.

Problem: Each skill had ad-hoc inline monitoring (squeue polling, nel
status checks) that didn't survive session restarts and couldn't track
multiple jobs. Users had to manually ask "check status" every time.

Solution: A centralized monitor skill with:
- Job registry (.claude/active_jobs.json): single source of truth for
all active jobs
- Durable recurring cron: polls every 15 min, survives session restarts,
self-cleans when all jobs complete
- User-initiated mode: works in new conversations by reading the
registry
- Aggregated reporting: "2 of 4 completed" instead of per-job noise

### Usage
After any skill submits a job, the monitor skill automatically:

1. Registers the job in .claude/active_jobs.json
2. Sets up a durable cron to poll status every 15 minutes

User can also trigger manually:
User: "check my eval status" → reads registry, reports current state
User: "is the PTQ done?"            → finds job, checks status
User: "what jobs are running?"      → lists all registered jobs

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added monitor skill for tracking SLURM jobs, NEL evaluations, and
launcher experiments with persistent job registry.

* **Documentation**
* Updated deployment, evaluation, and PTQ documentation to use the new
monitor skill.
  * Simplified diagnostic and troubleshooting instructions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kai Xu <kaix@nvidia.com>
### What does this PR do?

Type of change: new feature <!-- Use one of the following: Bug fix, new
feature, new example, new tests, documentation. -->

- Add Conv3D implicit GEMM kernel with BF16 WMMA tensor cores and fused
NVFP4 activation quantization for video diffusion VAE layers
- Integrate into _QuantConv3d via QuantModuleRegistry — automatically
dispatched when NVFP4 quantization is applied to nn.Conv3d
- Move kernel from `experimental/conv/ to modelopt/torch/kernels/conv/`;
move tests to `tests/gpu/torch/quantization/kernels/`

### Testing
<!-- Mention how have you tested your change if applicable. -->

- Added test cases to measure the difference between cuDNN and our CUDA
implicit GEMM kernel
- Added an NVFP4 fake quantization test using CUDA code

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ <!--- If ❌, explain why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ <!---
Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Per-backbone quantization/export in a single run with per-backbone
checkpoints and backbone-aware quant filters
* Configurable NVFP4 block-size via CLI/config; improved NVFP4 Conv3D
inference path and Wan 2.2 quantization support
* **Bug Fixes**
* Video-model calibration now respects extra params and forces video
decoding during calibration
* **Documentation**
* Added comprehensive Conv3D implicit‑GEMM kernel documentation; removed
experimental Conv3D prototype docs/benchmark
* **Tests**
* New Wan 2.2 quantization/export tests and expanded Conv3D/FP4 kernel
test coverage
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
### What does this PR do?

Type of change: Bug

Enables end-to-end AWQ checkpoint export and reload in the vLLM
fake-quant serving path (`MODELOPT_STATE_PATH`). Previously, the
`input_quantizer` was using incorrect `pre_quant_scale` especially with
grouped quantizers like `qkv_proj`, using simply the first
`input_quantizer.pre_quant_scale`. This MR adds
`_resmooth_experts_for_export` that non-mutatively averages
`pre_quant_scale` across MoE experts and unifies input `_amax`, required
because vLLM uses a single input quantizer per expert group. Adds
`merge_amax_tensors_for_group` (element-wise max for same-shape, `cat`
for GQA, scalar-max fallback) replacing the scalar-collapsing
`torch.stack().max()` that dropped per-channel `_amax` structure.

### Usage

```python
# Export AWQ checkpoint from HF model
  from modelopt.torch.export.plugins.vllm_fakequant_hf import export_hf_vllm_fq_checkpoint
  export_hf_vllm_fq_checkpoint(model, export_dir="./awq_vllm_checkpoint")      
```

### Testing
**Step 1 — Export the quantized checkpoint:**
  ```bash                    
  python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
    --recipe <AWQ_RECIPE> \
--calib_size 512 \
--export_path <EXPORT_DIR> \
    --vllm_fakequant_export
```
  This produces `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` with the averaged per-expert                                                                                                                           
  pre_quant_scale and unified _amax now included.                                                                                                                                                              
   

 Step 2 — Serve via vLLM fakequant worker:                                                                                                                                                                    
```bash
  MODELOPT_STATE_PATH=<EXPORT_DIR>/vllm_fq_modelopt_state.pth \
python examples/vllm_serve/vllm_serve_fakequant.py \
      <EXPORT_DIR> --tensor-parallel-size <TP>   
```

Tested for quantization configurations:
```
FP8_DEFAULT_CFG
FP8_DEFAULT_CFG (input_q disabled)
INT8_SMOOTHQUANT_CFG
INT8_WEIGHT_ONLY_CFG
NVFP4_DEFAULT_CFG
NVFP4_AWQ_LITE_CFG
INT4_AWQ_CFG
NVFP4_AWQ_CFG
NVFP4_DEFAULT_CFG (input_q disabled)
```

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ 
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A 
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A 

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Added Nemotron-style MoE export support and group-aware AWQ resmoothing with optional requantization during export.
  * Improved handling for shared-input / expert groups and tensor-parallel sharding of pre-quantization scales.

* **Bug Fixes**
  * Removed AWQ reload limitation from known issues; improved checkpoint validation and safer save/load behavior.
  * Better detection and handling of enabled weight-quantizers and clearer warnings for mismatched checkpoint keys.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a public backend-specific calibrator registration API to support
FP8 scale-sweep calibration, allowing backends to supply custom
calibrators used during FP8 tuning.

* **Tests**
* Added unit tests confirming registry insertion/overwrite, that
registered calibrators are invoked when FP8 scale-sweep is enabled, are
not invoked when disabled, and that calibration falls back to defaults
when no backend is registered.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Summary
- `datasets`' `resolve_pattern` only matches entries with
`type=="file"`, so passing a bare directory path as `data_files` to
`load_dataset` results in `FileNotFoundError` even when the directory
exists on disk
- Detect directory paths in `ShardedDataset._load_dataset()` and pass
them via `data_dir` instead of `data_files`

## Reproduction
```python
from datasets import load_dataset
# This fails with FileNotFoundError:
load_dataset("json", data_files="/path/to/data_directory")
# This works:
load_dataset("json", data_dir="/path/to/data_directory")
```

## Test plan
- [ ] Verify existing EAGLE3/DFlash training pipelines that pass
directory paths work
- [ ] Verify file path and glob patterns still work (falls through to
`data_files`)
- [ ] Verify `data_files=None` (no data_files arg) still works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Bug Fixes

* Fixed an issue with dataset loading that prevented proper handling of
directory-based data sources. Directories are now correctly detected and
processed during dataset initialization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
#1293)

## Summary
- **megatron_lm_ptq.yaml**: Qwen3-8B PTQ to single GPU for L40 clusters
(TP=1, all tasks)
- **quantize.sh**: Auto-find largest PP dividing model's
`num_hidden_layers` for export step. Qwen3-8B has 36 layers which isn't
divisible by 8, causing `AssertionError` on 8-GPU nodes
- **compute_hidden_states_trtllm.py**: Use `messages` with
`conversations` fallback, matching the HF version. Fixes `KeyError:
'conversations'` when data uses OpenAI `messages` format

## Test plan
- [x] Qwen3-8B PTQ runs on single L40 GPU
- [x] Export PP auto-selects valid divisor (36 layers → PP=6 on 8 GPUs,
PP=4 on 4 GPUs, PP=1 on 1 GPU)
- [x] EAGLE3 offline pipeline reads data with `messages` field

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Dataset input handling now supports multiple field formats for
enhanced compatibility.

* **Bug Fixes**
* Optimized GPU resource allocation during model quantization with
improved pipeline parallelism computation.
* Updated quantization configuration for more efficient resource
utilization.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- When `dp_shard_size < world_size` (e.g., `dp_shard_size=4` on 8 GPUs
across 2 nodes), `ParallelismConfig` raises `total_size (4) does not
match num_processes (8)` because `dp_replicate_size` defaults to 1
- Auto-compute `dp_replicate_size = world_size // (dp_shard_size *
cp_size)` so intra-node FSDP2 sharding + inter-node data-parallel
replication works without manual config
- This enables `dp_shard_size` to be set to per-node GPU count (better
NVLink utilization) while automatically creating replicas across nodes

## Test plan
- [ ] Verify single-node training (dp_shard_size == world_size,
dp_replicate_size == 1) unchanged
- [ ] Verify multi-node with dp_shard_size < world_size creates correct
replica groups
- [ ] Verify existing EAGLE3/DFlash configs still work

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Enhanced parallelism configuration initialization in the speculative
decoding example to better handle distributed training scenarios.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
### What does this PR do?

Add gptq fused kernel to improve speed.

### Usage

check unittest

### Testing
added a unittest

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Fused GPTQ backend for faster blockwise weight updates, toggleable via
a new "fused" option.
  * Shared NVFP4 quantization primitives exposed for reuse.

* **Refactor**
* Consolidated FP4 scale/quantization logic into reusable utilities and
centralized Hessian inversion handling.

* **Tests**
* Expanded GPU tests comparing fused vs unfused GPTQ, added
Triton-availability gating and a local benchmark entrypoint.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Shiyang Chen <shiychen@nvidia.com>
### What does this PR do?

Type of change: Bug fix

Fixes gh-pages branch bloat that grew from ~26 MB to ~441 MB in four
weeks (nvbug 6099503). Three compounding causes were identified and
addressed:

1. **Sphinx `.doctrees/` cache published to gh-pages** — `sphinx-build`
was writing its build cache inside `build/html/` which was then uploaded
verbatim. Accounts for ~3.3 GB uncompressed across history.
2. **`JamesIves/github-pages-deploy-action` appending a commit on every
push** — main-site files accumulated forever with `single-commit: false`
(default).
3. **PR preview deploying on every `synchronize` event for all PRs** —
`rossjrw/pr-preview-action` re-deployed the full site for every push to
any PR regardless of whether docs changed (e.g. PR #1128 triggered 64
preview deploys × ~11 MB each).

Changes:
- Pass `-d /tmp/doctrees` to `sphinx-build` so `.doctrees/` is never
written into `build/html/`
- Add `paths: [docs/**, modelopt/**]` filter to `pull_request` trigger
so the docs workflow only runs on PRs that touch docs or source code
- Set `single-commit: true` on the deploy action so main-site pushes
squash into one commit
- Deduplicate docs build: `deploy-preview` now downloads the artifact
from `build-docs` instead of running a second `sphinx-build`
- Set `retention-days: 1` on the artifact since it is only needed for
the duration of the workflow run

The one-time cleanup (force-push squashed orphan to gh-pages) was
already applied separately — repo is now ~59 MB for a full clone vs ~441
MB before.

### Usage

N/A — CI/workflow change only.

### Testing

- Workflow logic reviewed manually.
- The one-time cleanup was verified: `git rev-list --objects
--disk-usage origin/gh-pages` now reports ~28 MB; full clone is ~59 MB.

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

### Additional Information

nvbug 6099503

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Optimized documentation build and deployment workflow in CI/CD
pipeline.
* Improved pull request documentation preview handling with faster build
timeouts and refined artifact management.
* Enhanced GitHub Pages deployment configuration for better consistency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
- Use latest containers for testing in CICD

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Bumped TensorRT-LLM Docker images to 1.3.0rc12 in example and GPU test
workflows.
  * Updated PyTorch container image from 26.01 to 26.03 for GPU tests.
* Captured uv lock upgrade output to a temp file, inlined it into PR
bodies, and adjusted workflow heredoc/templating and step behavior.

* **Documentation**
* Clarified an inline comment and simplified a warning message for an
ONNX quantization extension.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@grzegorz-k-karch grzegorz-k-karch requested review from a team as code owners April 22, 2026 08:08
@grzegorz-k-karch grzegorz-k-karch requested review from kevalmorabia97 and removed request for a team April 22, 2026 08:08
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 22, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@grzegorz-k-karch grzegorz-k-karch requested review from ajrasane, gcunhase, meenchen and realAsma and removed request for a team April 22, 2026 08:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This pull request introduces major enhancements to quantization and sparsity workflows, migrates CI/CD from tox to nox, and adds comprehensive example documentation. Key additions include layerwise calibration support (replacing sequential calibration), skip-softmax sparse attention for video models, fused GPTQ Triton kernels, and NVFP4 Conv3D implicit GEMM optimizations. The change involves 100+ files across quantization, sparsity, testing, CI, and example code.

Changes

Cohort / File(s) Summary
Layerwise Calibration System
modelopt/torch/quantization/utils/layerwise_calib.py, modelopt/torch/quantization/utils/activation_collector.py, modelopt/torch/quantization/model_calib.py, modelopt/torch/quantization/config.py, modelopt/torch/quantization/mode.py
Replaced sequential calibration with stateful layerwise calibration. Introduced LayerActivationCollector, per-layer checkpointing via _CheckpointState with manifest-based resume, and layerwise_calibrate function. Added layerwise_checkpoint_dir config and capability gates for algorithms supporting layerwise mode.
Fused GPTQ Kernels
modelopt/torch/quantization/triton/gptq_fused_kernel.py, modelopt/torch/quantization/triton/nvfp4_quant.py, modelopt/torch/quantization/utils/calib_utils.py
Implemented Triton-based fused GPTQ for scalar blockwise weight updates. Added composable NVFP4 primitives (fp4_round_magnitude, nvfp4_scalar_quant, fp8_quantize_scale). Refactored GPTQHelper to support fused mode with backend registry.
Skip-Softmax Sparse Attention
modelopt/torch/kernels/triton_fa.py, modelopt/torch/sparsity/attention_sparsity/kernels/diffusers_triton_attention.py, modelopt/torch/sparsity/attention_sparsity/kernels/ltx_triton_attention.py, modelopt/torch/sparsity/attention_sparsity/methods/triton_skip_softmax.py
Added skip-softmax KV tile skipping in Triton flash-attention with optional runtime sparsity measurement and calibration mode. Implemented Diffusers and LTX-2 backend wrappers with thread-local context management. Added calibration kernel attention_calibrate for multi-threshold sparsity statistics collection.
NVFP4 Conv3D Kernel
modelopt/torch/quantization/nn/modules/quant_conv.py, modelopt/torch/quantization/src/conv/implicit_gemm_kernel.cu, modelopt/torch/quantization/src/conv/implicit_gemm_kernel.py, modelopt/torch/quantization/src/conv/bench_implicit_gemm.py
Added inference-only NVFP4 implicit GEMM CUDA kernel for Conv3D. Routes quantized Conv3D to kernel when NVFP4 quantizers are enabled (groups==1), falls back to cuDNN for grouped convolutions and training. Included kernel SM80+ conditional compilation guard.
Sparsity Calibration & Config
modelopt/torch/sparsity/attention_sparsity/calibration/calibrate.py, modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py, modelopt/torch/sparsity/attention_sparsity/config.py
Added log-space exponential fitting for threshold calibration (fit_logspace option). Introduced skip_softmax_raw_threshold to directly pass kernel thresholds. Updated DynamicThresholdCalibrator to support dual fitting modes and new target sparsity range ([0.3-0.8]). Made tokenizer import lazy.
vLLM Export Enhancements
modelopt/torch/export/plugins/vllm_fakequant_hf.py, modelopt/torch/export/plugins/vllm_fakequant_megatron.py
Expanded export_hf_vllm_fq_checkpoint with resmoothing for AWQ experts, weight-quantizer folding, and in-place memory-efficient mode. Added quantizer prefix remapping, weight-quantizer state filtering via regex, and GPTQ sequential layerwise support with new inplace_mem_efficient parameter.
Diffusers Quantization
examples/diffusers/quantization/quantize_config.py, examples/diffusers/quantization/quantize.py, examples/diffusers/quantization/models_utils.py, examples/diffusers/quantization/pipeline_manager.py, examples/diffusers/quantization/utils.py, examples/diffusers/quantization/calibration.py
Replaced single-backbone with multi-backbone quantization support. Added backbone-specific VAE filter functions. Changed PipelineManager to use cached LTX-2 video decoder. Updated quantize.py to iterate backbones, apply per-backbone configs, and optionally skip VAE-related checks. Added block_size parameter for NVFP4.
Diffusers Sparsity Examples
examples/diffusers/sparsity/wan22_skip_softmax.py, examples/diffusers/sparsity/README.md, examples/diffusers/README.md
Added Wan 2.2 skip-softmax sparse attention example script with calibration, runtime measurement, and baseline modes. Documented BLASST-based tile-skipping, dual runtime threshold modes, and known issues. Updated main README with sparse-attention and NVFP4 VAE PTQ subsections.
LLM PTQ Enhancements
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/example_utils.py, examples/llm_ptq/scripts/huggingface_example.sh
Added nvfp4_local_hessian quantization format support. Implemented layerwise checkpoint-dir resolution via unique model-hash suffix. Added automatic checkpoint-directory update when layerwise calibration is detected. Included helper functions needs_checkpoint_path_update and resolve_checkpoint_dir.
Megatron Utilities
modelopt/torch/utils/plugins/megatron_mmlu.py, modelopt/torch/utils/plugins/megatron_generate.py, modelopt/torch/utils/plugins/transformers_dataset.py
Refactored MMLU evaluation to logit-based scoring with batching and dynamic fraction sampling. Updated megatron_prefill/megatron_generate to use direct model(...) calls instead of get_forward_backward_func, with explicit PP rank communication. Fixed dataset directory loading in ShardedDataset.
Torch ONNX Enhancements
modelopt/torch/_deploy/utils/torch_onnx.py, modelopt/onnx/export/fp8_exporter.py, modelopt/onnx/utils.py
Added FP8-specific Conv weight quantizer disabling during export. Implemented FP8 weight DequantizeLinear insertion in ONNX graphs. Added cast-folding utilities (fold_dq_fp32_to_fp16_casts, fold_qdq_scale_fp16_to_fp32_casts) for optimized scale handling.
Torch Quantization Plugins
modelopt/torch/quantization/plugins/accelerate.py, modelopt/torch/quantization/plugins/huggingface.py, modelopt/torch/quantization/plugins/diffusion/diffusers.py
Generalized accelerate offload-hook handling for multi-hook chains. Added _QuantDiffusersWanCausalConv3d quantized module wrapper with implicit GEMM routing. Refactored FSDP2 weight access to iterate all DTensor parameters with redistribution. Added persistent_materialization context manager.
Torch Export Utilities
modelopt/torch/export/layer_utils.py, modelopt/torch/export/plugins/megatron_importer.py, modelopt/torch/export/unified_export_hf.py, modelopt/torch/export/unified_export_megatron.py
Extended MoE detection for Nemotron-HF models. Renamed _collect_shared_input_modules to public API. Added embeddings re-synchronization for shared-weights Megatron models with pipeline parallelism. Updated rank gating to require both PP and TP rank checks (0 and 0, respectively).
Core Utilities
modelopt/torch/quantization/utils/core_utils.py, modelopt/torch/utils/network.py, modelopt/torch/utils/dataset_utils.py, modelopt/torch/utils/logging.py, modelopt/torch/utils/serialization.py
Enhanced parameter setting via dotted names. Improved accelerate hook detection for execution-device queries. Added KV-cache disabling in calibration loops. Made print_rank_0 flush configurable. Updated safe_load to document weights_only behavior and env-var override mechanism.
Build & CI Tooling
noxfile.py, .github/workflows/..., pyproject.toml, .github/CODEOWNERS, .github/codecov.yml, .vscode/settings.json
Added noxfile.py with sessions for unit tests, GPU tests, code quality, docs, and wheel builds. Migrated 11 GitHub workflows from tox to nox (example_tests, gpu_tests, unit_tests, release, code_quality, pages, etc.). Created reusable _pr_gate.yml workflow for file-change gating. Updated CODEOWNERS and codecov thresholds. Removed tox references.
Quantization Recipes
modelopt_recipes/general/ptq/...yaml
Updated recipe descriptions and configs: nvfp4_default-none_kv_gptq.yaml now uses layerwise: true with checkpoint dir. Changed nvfp4_experts_only-fp8_kv.yaml to use structured { method: max, layerwise: true } config.
Launcher Tools & Scripts
tools/launcher/common/hf/ptq.sh, tools/launcher/common/megatron_lm/quantize/..., tools/launcher/common/vllm/..., tools/launcher/common/tensorrt_llm/..., tools/launcher/examples/Qwen/...
Added EXTRA_PIP_DEPS support for model-specific dependencies in PTQ. Extended quantize/mmlu/export pipeline scripts with distributed parallelism parameters. Introduced TensorRT-LLM eval script and config. Updated vLLM startup with pandas install, timeout enforcement, and regression health checks. Scaled up Qwen examples to 8-way parallelism.
Example & Documentation Updates
examples/torch_onnx/torch_quant_to_onnx.py, examples/torch_onnx/README.md, examples/vllm_serve/..., examples/windows/onnx_ptq/..., examples/speculative_decoding/..., examples/megatron_bridge/..., experimental/conv/README.md, examples/pruning/README.md, CHANGELOG.rst, CLAUDE.md, CONTRIBUTING.md
Added TensorRT build capability to ONNX exporter. Removed AWQ reload limitation note. Updated Conv3D/Wan example documentation. Removed experimental Conv3D README (moved to modelopt core). Updated CHANGELOG with layerwise calibration and Conv3D kernel features. Updated developer docs to reference noxfile.py. Updated example scripts with calibration/inference parameter changes.
Comprehensive Test Coverage
tests/... (100+ new/updated test files)
Added layerwise calibration tests, sparsity calibration tests, fused GPTQ tests, skip-softmax attention tests (Diffusers and LTX-2), Wan 2.2 quantization/export tests, Conv3D implicit GEMM tests, vLLM export tests with offloading and checkpoint resumption, FSDP2 tests, and unit tests for new utilities. Updated test fixtures and conftest files. Removed sequential calibration test references.
Test Utilities & Fixtures
tests/_test_utils/torch/diffusers_models.py, tests/_test_utils/torch/vision_models.py, tests/examples/diffusers/conftest.py
Added get_tiny_wan22_transformer, get_tiny_wan22_vae, and create_tiny_wan22_pipeline_dir fixtures for Wan 2.2 testing. Added resnet50 to vision model benchmarks. Extended conftest with session-scoped Wan 2.2 pipeline fixture.
Documentation & Skills
.claude/skills/common/slurm-setup.md, .claude/skills/deployment/SKILL.md, .claude/skills/evaluation/SKILL.md, .claude/skills/monitor/SKILL.md, .claude/skills/ptq/SKILL.md, .claude/skills/ptq/references/launcher-guide.md
Added comprehensive SLURM container-registry authentication checklist. Updated deployment/evaluation skills with Step 0 auth check and monitoring integration. Introduced new monitor skill for job registry and status tracking across PTQ/NEL/deployment. Updated PTQ guidance with model-specific dependency detection (Step 2.5) and monitor skill usage. Updated launcher script path references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Add the Skip softmax for diffusion #1166: Implements the same skip-softmax sparse attention feature, including Wan 2.2 examples, Triton/eager backends, calibration integration, and kernel modifications.
  • Add layerwise calibration for large models #1251: Implements layerwise calibration (formerly "sequential")—renaming APIs, adding checkpoint persistence, moving LayerActivationCollector to layerwise_calib module, and updating all dependent tests.

Suggested reviewers

  • sugunav14
  • Fridah-nv
  • kaix-nv
  • Edwardf0t1
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gkarch/update_1314

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-22 08:22 UTC

@grzegorz-k-karch grzegorz-k-karch deleted the gkarch/update_1314 branch April 22, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.