[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9 by runwangdl · Pull Request #183 · pulp-platform/Deeploy

runwangdl · 2026-04-13T22:29:53Z

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

Deeploy/Targets/NE16/ — full Target: Platform.py (NE16Platform extends GAP9Platform, engines=[NE16Engine, GAP9ClusterEngine]), Engine.py, Bindings.py, Parsers.py, Tiler.py, Deployer.py (extends GAP9Deployer to reuse ClDma transformers), Templates/{Allocate,Conv}Template.py, TileConstraints/{NE16Pointwise,NE16Depthwise,NE16Dense}Constraint.py, TopologyOptimizationPasses/Passes.py (incl. _weightEncode ported from pulp-nnx/test/Ne16Weight.py).
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py and DeeployTest/test_gap9_ne16_tiled_config.py — new runner + minimal test config covering PW 1x1, DW 3x3, Dense 3x3 RQ Conv.
TargetLibraries/GAP9/CMakeLists.txt — for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9.

Changed

DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, and mapDeployer.
DeeployTest/testMVP.py — include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it, NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks).
DeeployTest/testUtils/core/execution.py — build the GAP9 SDK image target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
Deeploy/Targets/NE16/Templates/ConvTemplate.py — swap Neureka-inherited subtile constants for NE16 spec per ne16_task_defs.h: CIN_SUBTILE 32/28 → 16 (single mode), output 6 → 3, per-(cout,cinMajor) weight bytes PW 16, DW/Dense 144.

Fixed

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Ran on gvsoc gap9.evk (inside ghcr.io/pulp-platform/deeploy-gap9:devel), full pipeline gen -> parse -> lower -> codegen -> CMake -> build -> gapy image -> gvsoc flash run:

Test	Errors	Runtime (cycles)
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	0 / 1152	901 917
`Kernels/Integer/Conv/DW_2D_RQ` (`--enable-3x3`)	0 / 1280	27 339
`Kernels/Integer/Conv/Regular_2D_RQ` (`--enable-3x3`)	0 / 6372	244 595

Follow-up (out of scope for this PR): PW Unsigned_RQ variant currently produces non-zero errors — likely unsigned weight-offset or out_type conf0 handling; tracked separately.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

coderabbitai · 2026-04-13T22:38:44Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds NE16 accelerator support and a new platform variant GAP9_w_NE16: new NE16 engine/platform/deployer, parsers, templates, bindings, tiler constraints, topology/optimization passes, CMake/SDK integration, tests, and CI workflows enabling tiled NE16-enabled execution.

Changes

Cohort / File(s)	Summary
Top-level build & CI `CMakeLists.txt`, `DeeployTest/CMakeLists.txt`, `TargetLibraries/GAP9/CMakeLists.txt`, `.github/workflows/.../*.yml`	Add `GAP9_w_NE16` platform option and build branch, enable `USE_NE16` and `pulp-nnx` linkage, adjust GAP9 targets and C flags, and add NE16 CI workflows and runner.
NE16 package & exports `Deeploy/Targets/NE16/__init__.py`, `Deeploy/Targets/NE16/Templates/__init__.py`, `Deeploy/Targets/NE16/TileConstraints/__init__.py`, `Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py`, `Deeploy/Targets/NE16/OptimizationPasses/__init__.py`	New NE16 package and subpackage initializers that wildcard re-export submodule symbols.
Engine / Platform / Deployer `Deeploy/Targets/NE16/Engine.py`, `Deeploy/Targets/NE16/Platform.py`, `Deeploy/Targets/NE16/Deployer.py`	Add `NE16Engine`, `NE16Platform`/memory wrappers, `NE16Deployer`, NE16 mapping/optimizer, engine feature flags, and lowering-pass adjustments.
Parsers `Deeploy/Targets/NE16/Parsers.py`, `Deeploy/Targets/GAP9/Parsers.py`	Add NE16-specific Conv2D parsers (base + PW/DW/Dense and RQ variants) and `NE16GEMMParser` for RequantizedGemm.
Bindings / Templates / Tiler-ready bindings `Deeploy/Targets/NE16/Bindings.py`, `Deeploy/Targets/GAP9/Bindings.py`, `Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`, `Deeploy/Targets/NE16/Templates/...`, `Deeploy/Targets/NE16/Tiler.py`, `Deeploy/Targets/GAP9/Tiler.py`	Add NE16 NodeBindings, GAP9 NE16 GEMM templates, NE16 convolution templates (init+exec), and expose tiling-ready bindings for PW/DW/Dense (RQ/non-RQ) and GEMM.
Tile constraints & requant helpers `Deeploy/Targets/NE16/TileConstraints/*`, `Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py`	Add pointwise/depthwise/dense tiling constraints (and RQ variants), requant helper utilities, and NE16 GEMM tile constraint with serialization.
Topology & lowering optimization passes `Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`, `Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py`, `Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`, `Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`	Add weight encoding/layout passes, NE16 reshape/merge passes, GEMM weight adjustment pass, Dequant/Quant merge pass, and a memory-level weight annotation pass.
Type checkers and generic bindings `Deeploy/Targets/Generic/TypeCheckers.py`, `Deeploy/Targets/Generic/Bindings.py`, `Deeploy/Targets/PULPOpen/Bindings.py`, `Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py`	Add QuantChecker output-type signedness check, extend quant/dequant bindings (add uint8 variants), and add GAP9 SDK quant/dequant templates.
Optimization/SDK helpers & libs `TargetLibraries/GAP9/inc/ne16_utils.h`, `TargetLibraries/GAP9/src/ne16_utils.c`, `Deeploy/Targets/GAP9/DMA/L3Dma.py`	Add NE16 utility int8→uint8 kernel and adjust L3 DMA template casts; remove old gap9L3DmaHack in favor of GAP9L3Dma.
Tests & test tooling `DeeployTest/`, `DeeployTest/test_gap9_ne16_tiled_config.py`, `DeeployTest/test_platforms.py`, `DeeployTest/conftest.py`, `DeeployTest/testUtils/`	Add NE16 test configs, runner script, pytest marker `gap9_w_ne16_tiled`, platform mapping to NE16 platform/deployer, test cases, and CMake/test build adjustments.
Misc small changes `Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`, `Deeploy/Targets/GAP9/Bindings.py`, `Deeploy/Targets/GAP9/Tiler.py`, `Deeploy/Targets/GAP9/DMA/L3Dma.py`	Minor pointer-type alias change and various GAP9 binding/tiler updates to integrate NE16 GEMM, SDK quant/dequant templates, and L3 DMA template refinements.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Graph as gs.Graph
    participant Parser as NE16 Parser
    participant Optimizer as TopologyOptimizer / Passes
    participant Tiler as Tiler / TileConstraints
    participant Templates as Conv/GEMM Templates
    participant Deployer as NE16Deployer / Engine
    participant Build as CMake / pulp-nnx

    Graph->>Parser: parse nodes (Conv/Gemm) with NE16 constraints & RQ fields
    Parser->>Optimizer: emit matches for NE16 topology passes (weight/layout, reshape)
    Optimizer->>Templates: request weight encoding and template alignment
    Templates->>Tiler: provide tiling metadata (counters, strides, conf0)
    Tiler->>Deployer: produce VariableReplacementScheme & TilingSchedule
    Deployer->>Build: integrate generated sources, link pulp-nnx, produce image

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Deeploy-GAP9 Platform #143: closely related GAP9 + NE16 platform and build wiring changes.
PyTest Migration #144: related CI and pytest workflow/test wiring changes.
Refactor tiling code generation #105: related DMA/tiling and L3 DMA abstraction changes.

Suggested labels

Feature

Suggested reviewers

Xeratec
Victor-Jung

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR is linked to issue `#39` regarding dynamic I/O tensor allocation, but the changeset focuses entirely on adding NE16 accelerator support without addressing any I/O tensor allocation requirements from that issue.	Either link this PR to a different issue aligned with NE16 platform support, or address I/O tensor allocation objectives if that was the intended scope.
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.61% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9' clearly and specifically describes the main change: adding NE16 as an accelerator on the GAP9 platform.
Description check	✅ Passed	The PR description provides detailed, relevant context about the NE16 integration: architecture overview, new components added, changes made, and a test plan with actual cycle counts.
Out of Scope Changes check	✅ Passed	All changes are scoped to adding the NE16 platform and its integration with GAP9; no unrelated refactoring or extraneous modifications were introduced.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 11

🧹 Nitpick comments (4)

Deeploy/Targets/NE16/TileConstraints/__init__.py (1)

5-5: Replace self-referential wildcard import with explicit exports.

from . import * on Line 5 makes exported symbols unclear. Replace with explicit re-exports plus __all__ to clarify the public API.

Proposed fix

-from . import *
+from .NE16DenseConstraint import NE16DenseConv2DTileConstraint, NE16RQSDenseConv2DTileConstraint
+from .NE16DepthwiseConstraint import NE16DWConv2DTileConstraint, NE16RQSDWConv2DTileConstraint
+from .NE16PointwiseConstraint import NE16PWConv2DTileConstraint, NE16RQSPWConv2DTileConstraint
+
+__all__ = [
+    "NE16DenseConv2DTileConstraint",
+    "NE16RQSDenseConv2DTileConstraint",
+    "NE16DWConv2DTileConstraint",
+    "NE16RQSDWConv2DTileConstraint",
+    "NE16PWConv2DTileConstraint",
+    "NE16RQSPWConv2DTileConstraint",
+]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
self-referential wildcard import in __init__.py with explicit re-exports: remove
"from . import *" and import the specific public symbols (e.g., classes,
functions, constants) from the package submodules using explicit imports (for
example: from .submodule import ClassX, func_y), then declare __all__ =
["ClassX", "func_y", ...] listing those exported names; ensure you reference the
actual symbol names defined in the package (replace placeholders with the real
identifiers) and avoid introducing circular imports when reorganizing the
imports.

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

The BatchOffset, HOffset, WOffset, and BatchSize variables are unpacked for structural clarity but never used. Prefix them with _ to indicate intentional disuse.

🔧 Proposed fix

         for cube in outputCubes:
-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims

Similarly at line 173:

-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint.py currently bind unused
names BatchOffset, HOffset, WOffset, COffset and BatchSize (and similarly at the
later unpacking around line 173); update those unpackings to prefix unused
variables with an underscore (e.g., _BatchOffset, _HOffset, _WOffset,
_BatchSize) so the intent of intentional non-use is clear; locate the unpacking
expressions that assign to cube.offset and cube.dims and rename only the unused
identifiers accordingly while keeping used names (COffset, HSize, WSize, CSize)
unchanged.

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

125-136: Use integer division to avoid floating-point precision issues.

Line 131 uses float division (/), which can cause precision loss for large integers. Use integer division (//) instead.

🔧 Proposed fix

 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
     for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
+            tmpX = tmpX // i

     if x // math.prod(multiplicands) > 1:
         multiplicands.append(x // math.prod(multiplicands))

     return multiplicands

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 125 -
136, In _findAllMultiplicands, replace the floating-point division on tmpX (tmpX
= tmpX / i) with integer division (tmpX = tmpX // i) and ensure tmpX remains an
int so subsequent modulus and math.prod operations are integer-safe; update any
related expressions that rely on tmpX's type (e.g., the while loop condition
tmpX % i == 0 and the final math.prod usage) to work with the integer result.

Deeploy/Targets/NE16/Parsers.py (1)

55-62: Use not channels_first instead of == False.

Per PEP 8 and Ruff E712, avoid direct comparisons to False.

🔧 Proposed fix

         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,
                 # LMACAN: weight shape should be equal to 3 because we have to do the ne16's
                 #         special weight encoding
                 len(weight.shape) == 3,
         ]):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 55 - 62, In the conditional
inside Parsers.py (the if checking channels_first, data_in.shape, weight.shape)
replace the direct boolean comparison channels_first == False with the idiomatic
not channels_first (i.e., use not channels_first in the all([...]) list) so the
condition follows PEP8/Ruff E712; adjust the list element in the same if that
currently reads channels_first == False to read not channels_first.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: The __init__.py currently uses explicit imports but omitted the Tiler
module, causing Engine.py's internal imports to fail; update
Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module
(the Tiler.py symbol) alongside the other exported modules so it is available
when the package is imported, ensuring Engine.py's references to Tiler resolve
correctly (alternatively, use a package import loop such as using
pkgutil.iter_modules to dynamically import all submodules if you prefer
automatic inclusion).

In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor currently uses a mutable default
inputOffsets = {} which causes shared state; change the signature for the
relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use
inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets
is None before passing it to super().__init__ (or before any mutations), so each
instance gets its own dict and avoids state leakage when SignPropDeployer
populates the keys.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-81: The conv-classification is too loose: tighten isPWConv to
reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1,
and tighten isDWConv to only accept true depthwise convolutions by inspecting
the conv weight tensor (node.inputs[1].values) and requiring the per-group
in_channels == 1 and that the number of groups equals the number of output
channels (e.g., weights.shape[1] == 1 and weights.shape[0] ==
node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and
thereby canExecute) accordingly so only bona fide pointwise (group==1) and true
depthwise convs are classified for NE16 coloring.

In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Around line 27-44: The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 118-121: The parsers map only 4 inputs into
self.operatorRepresentation without bounds checking and always set shift to
'NULL', causing IndexError and mismatched scale_shift_addr; update parseNode()
in the affected parsers (the current loop that assigns inputs via inputs =
['data_in','weight','mul','add'] and self.operatorRepresentation[inputs[idx]] =
ctxt.lookup(inputNode.name).name) to perform explicit validation like
len(node.inputs) == 4 (following RQMatMulParser) OR handle a fifth input by
mapping node.inputs[4] into e.g. self.operatorRepresentation['shift'] (or read
shift from node.attrs['shift'] like PULPOpen) and stop hardcoding 'NULL' so the
template variable ${shift} used by ConvTemplate.py:310 gets a proper value;
apply the same change to NE16RQSPWConv2DParser and NE16RQSDenseConv2DParser.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 141-147: The conf0 computation treats bias presence
incorrectly—change the bias check to treat the "NULL" sentinel as absent: when
calling getNormQuantConf0 in the block guarded by operatorRepresentation["mul"]
!= "NULL", replace the argument "add" in operatorRepresentation with an explicit
boolean like ('add' in operatorRepresentation and operatorRepresentation["add"]
!= "NULL") so bit 25 isn't set when add == "NULL"; keep the existing mul !=
"NULL" guard as-is and pass the adjusted boolean into getNormQuantConf0
(symbols: operatorRepresentation, getNormQuantConf0, conf0, "mul", "add").
- Around line 323-327: The compound literal for ne16_cfg_t incorrectly assigns
weights_stride via an equals expression instead of a designated initializer;
change the block that sets task.data.cfg.weights_stride to use a designated
field (e.g., add a .weights_stride = { .d0 = weightStrideD0, .d1 =
weightStrideD1, .d2 = weightStrideD2 } entry) inside the ne16_cfg_t initializer
so it matches the surrounding fields like .input_stride, .output_stride and
.subtile and produces valid C code.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 81-101: The subtraction of weight_offset from values in Passes.py
can overflow if values is int8; convert values to a wider integer dtype (e.g.,
np.int16 or np.int32) before computing weight_offset and performing values =
values - weight_offset, then ensure the resulting values are clipped to the
0–255 range and finally cast to np.uint8 just before calling _weightEncode
(refer to weightTensor.values, weight_offset, values, and _weightEncode) so the
transpose and depthwise logic remain unchanged.

In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py`:
- Around line 15-16: The two parsed flags --ne16-wmem and --enable-3x3 are never
consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into
CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and
"-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same
cmake argument list used for building (the variable that assembles cmake args /
options or the call that runs cmake/configure in main()); if the project expects
OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you
reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the
code path that constructs cmake arguments in main() so these options are passed
to CMake.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 55-62: In the conditional inside Parsers.py (the if checking
channels_first, data_in.shape, weight.shape) replace the direct boolean
comparison channels_first == False with the idiomatic not channels_first (i.e.,
use not channels_first in the all([...]) list) so the condition follows
PEP8/Ruff E712; adjust the list element in the same if that currently reads
channels_first == False to read not channels_first.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the self-referential wildcard import in __init__.py with
explicit re-exports: remove "from . import *" and import the specific public
symbols (e.g., classes, functions, constants) from the package submodules using
explicit imports (for example: from .submodule import ClassX, func_y), then
declare __all__ = ["ClassX", "func_y", ...] listing those exported names; ensure
you reference the actual symbol names defined in the package (replace
placeholders with the real identifiers) and avoid introducing circular imports
when reorganizing the imports.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint.py currently
bind unused names BatchOffset, HOffset, WOffset, COffset and BatchSize (and
similarly at the later unpacking around line 173); update those unpackings to
prefix unused variables with an underscore (e.g., _BatchOffset, _HOffset,
_WOffset, _BatchSize) so the intent of intentional non-use is clear; locate the
unpacking expressions that assign to cube.offset and cube.dims and rename only
the unused identifiers accordingly while keeping used names (COffset, HSize,
WSize, CSize) unchanged.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 125-136: In _findAllMultiplicands, replace the floating-point
division on tmpX (tmpX = tmpX / i) with integer division (tmpX = tmpX // i) and
ensure tmpX remains an int so subsequent modulus and math.prod operations are
integer-safe; update any related expressions that rely on tmpX's type (e.g., the
while loop condition tmpX % i == 0 and the final math.prod usage) to work with
the integer result.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a121db6-6486-4738-9ded-bbe6729e8727

📥 Commits

Reviewing files that changed from the base of the PR and between 3b011bb and e41800e.

📒 Files selected for processing (28)

CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
TargetLibraries/GAP9/CMakeLists.txt

coderabbitai · 2026-04-13T22:38:47Z

+#
+# SPDX-License-Identifier: Apache-2.0
+
+from . import *


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify self-referential wildcard imports in NE16 package initializers. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391

🏁 Script executed:

find Deeploy/Targets/NE16 -maxdepth 1 -type d | sort

Repository: pulp-platform/Deeploy

Length of output: 241

🏁 Script executed:

ls -la Deeploy/Targets/NE16/ | head -30

Repository: pulp-platform/Deeploy

Length of output: 878

🏁 Script executed:

cat -n Deeploy/Targets/NE16/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224

🏁 Script executed:

rg "from.*NE16.*import.*Tiler|from.*NE16.*Tiler" --type py

Repository: pulp-platform/Deeploy

Length of output: 205

🏁 Script executed:

rg "\.Tiler" Deeploy/Targets/NE16 --type py

Repository: pulp-platform/Deeploy

Length of output: 836

🏁 Script executed:

find Deeploy/Targets -maxdepth 2 -name "__init__.py" -type f | head -10

Repository: pulp-platform/Deeploy

Length of output: 374

🏁 Script executed:

for dir in Deeploy/Targets/*/; do echo "=== $(basename $dir) ==="; cat "$dir/__init__.py" 2>/dev/null | head -20; done

Repository: pulp-platform/Deeploy

Length of output: 1218

🏁 Script executed:

rg "__all__" Deeploy/Targets/NE16 --type py -A 5

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

head -20 Deeploy/Targets/NE16/Tiler.py

Repository: pulp-platform/Deeploy

Length of output: 1270

🏁 Script executed:

rg "import.*Tiler" Deeploy --type py | grep -v "TilerExtension\|TilerModel" | head -20

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

rg "from.*NE16" Deeploy --type py | grep -v "^Binary" | head -30

Repository: pulp-platform/Deeploy

Length of output: 2287

The proposed fix is incomplete: Tiler.py exists in the NE16 directory but is missing from the proposed imports

While the original concern is valid—from . import * does not expose submodules/packages as intended—the proposed fix lists only 9 modules and omits Tiler.py, which exists as a module in the directory and is imported internally by Engine.py.

Corrected fix

-from . import * +from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses + +__all__ = [ + "Bindings", + "Deployer", + "Engine", + "OptimizationPasses", + "Parsers", + "Platform", + "Tiler", + "Templates", + "TileConstraints", + "TopologyOptimizationPasses", +]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from . import *

from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses

__all__ = [

"Bindings",

"Deployer",

"Engine",

"OptimizationPasses",

"Parsers",

"Platform",

"Tiler",

"Templates",

"TileConstraints",

"TopologyOptimizationPasses",

]

🧰 Tools

🪛 Ruff (0.15.9)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/__init__.py` at line 5, The __init__.py currently uses explicit imports but omitted the Tiler module, causing Engine.py's internal imports to fail; update Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module (the Tiler.py symbol) alongside the other exported modules so it is available when the package is imported, ensuring Engine.py's references to Tiler resolve correctly (alternatively, use a package import loop such as using pkgutil.iter_modules to dynamically import all submodules if you prefer automatic inclusion).

coderabbitai · 2026-04-13T22:38:47Z

+                 inputOffsets = {}):
+        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
+                         default_channels_first, deeployStateDir, inputOffsets)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether inputOffsets is mutated downstream (which would make shared default state dangerous). rg -nP --type=py -C3 'def __init__\(|inputOffsets\s*=' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py rg -nP --type=py -C2 'inputOffsets\[[^]]+\]\s*=|inputOffsets\.update\(|setdefault\(' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py

Repository: pulp-platform/Deeploy

Length of output: 12773

🏁 Script executed:

# Find the parent class of GAP9Deployer rg -n "class GAP9Deployer" Deeploy/Targets/GAP9/Deployer.py -A 5

Repository: pulp-platform/Deeploy

Length of output: 312

🏁 Script executed:

# Get the full NE16Deployer.__init__ context and check parent head -n 35 Deeploy/Targets/NE16/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 1693

🏁 Script executed:

# Get the full GAP9Deployer.__init__ context to see parent class definition head -n 70 Deeploy/Targets/GAP9/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 2726

🏁 Script executed:

# Search for any mutation of inputOffsets using broader patterns rg -n "inputOffsets" Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py -A 2 -B 2

Repository: pulp-platform/Deeploy

Length of output: 1535

🏁 Script executed:

# Find PULPDeployer class and its __init__ fd "Deployer.py" Deeploy/Targets/PULPOpen/

Repository: pulp-platform/Deeploy

Length of output: 101

🏁 Script executed:

# Search for PULPDeployer definition rg -n "class PULPDeployer" Deeploy/ -A 10

Repository: pulp-platform/Deeploy

Length of output: 974

🏁 Script executed:

# Search for how inputOffsets is used in the parent class hierarchy rg -n "self.inputOffsets" Deeploy/ -B 2 -A 2

Repository: pulp-platform/Deeploy

Length of output: 3019

🏁 Script executed:

# Read SignPropDeployer to see the full __init__ and how inputOffsets is mutated head -n 55 Deeploy/CommonExtensions/NetworkDeployers/SignPropDeployer.py

Repository: pulp-platform/Deeploy

Length of output: 2075

🏁 Script executed:

# Check if inputOffsets is mutated elsewhere besides the initial population rg -n "inputOffsets\[" Deeploy/ -B 1 -A 1

Repository: pulp-platform/Deeploy

Length of output: 1113

Use None as default for inputOffsets parameter.

The {} default on line 28 creates a shared mutable object that gets mutated in SignPropDeployer.__init__() (which populates it with keys). Multiple instances created without providing inputOffsets will share the same dict, causing unintended state leakage.

Proposed fix

- inputOffsets = {}): + inputOffsets = None): + if inputOffsets is None: + inputOffsets = {} super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, default_channels_first, deeployStateDir, inputOffsets)

Note: GAP9Deployer has the same issue on line 54 and should be fixed identically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

inputOffsets = {}):

super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,

default_channels_first, deeployStateDir, inputOffsets)

inputOffsets = None):

if inputOffsets is None:

inputOffsets = {}

super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,

default_channels_first, deeployStateDir, inputOffsets)

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 28-28: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor currently uses a mutable default inputOffsets = {} which causes shared state; change the signature for the relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets is None before passing it to super().__init__ (or before any mutations), so each instance gets its own dict and avoids state leakage when SignPropDeployer populates the keys.

coderabbitai · 2026-04-13T22:38:47Z

+    def isPWConv(self, node) -> bool:
+        return node.op in ["Conv", "RequantizedConv"] and \
+            isinstance(node.inputs[1], gs.Constant) and \
+            node.attrs['kernel_shape'] == [1, 1] and \
+            node.attrs['dilations'] == [1, 1] and \
+            (node.attrs['strides'] == [1, 1] or self.enableStrides)
+
+    def isDWConv(self, node) -> bool:
+        return node.op in ["Conv", "RequantizedConv"] and \
+            isinstance(node.inputs[1], gs.Constant) and \
+            node.attrs['kernel_shape'] == [3, 3] and \
+            node.attrs['dilations'] == [1, 1] and \
+            node.attrs['group'] != 1 and \
+            (node.attrs['strides'] == [1, 1] or self.enableStrides)
+
+    def canExecute(self, node: gs.Node) -> bool:
+        if self.enable3x3:
+            return self.isPWConv(node) or self.isDWConv(node) or self.isDenseConv(node)
+        else:
+            return self.isPWConv(node)


⚠️ Potential issue | 🟠 Major

Tighten grouped-convolution classification before coloring.

isPWConv() currently accepts grouped 1x1 convs, and isDWConv() accepts any 3x3 conv with group != 1. The downstream DW path assumes true depthwise semantics, so generic grouped convolutions can be colored for NE16 and then fail later in parsing/tiling.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 81, The conv-classification is too loose: tighten isPWConv to reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1, and tighten isDWConv to only accept true depthwise convolutions by inspecting the conv weight tensor (node.inputs[1].values) and requiring the per-group in_channels == 1 and that the number of groups equals the number of output channels (e.g., weights.shape[1] == 1 and weights.shape[0] == node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and thereby canExecute) accordingly so only bona fide pointwise (group==1) and true depthwise convs are classified for NE16 coloring.

coderabbitai · 2026-04-13T22:38:47Z

+        weightMemoryOccupation = 0
+
+        # Current weight memory occupation
+        for buffer in {**ctxt.globalObjects, **ctxt.localObjects}.values():
+            if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
+                weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+
+        ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
+        for node in ne16Nodes:
+            if node.op in ["Conv", "RequantizedConv"]:
+
+                if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)):
+                    continue
+
+                buffer = ctxt.lookup(node.inputs[1].name)
+                if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size:
+                    buffer._memoryLevel = self._weightMemoryLevel.name
+                    weightMemoryOccupation += _ne16WeightBufferSize(buffer)


⚠️ Potential issue | 🟠 Major

Avoid double-counting reused weights and allow exact-fit placement.

If two NE16 nodes share the same weight buffer, this loop charges the same buffer twice once it has already been annotated in this pass. The strict < check also skips buffers that exactly fill the remaining SRAM. Both cases leave feasible placements unassigned.

Suggested fix

ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] for node in ne16Nodes: if node.op in ["Conv", "RequantizedConv"]: if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)): continue buffer = ctxt.lookup(node.inputs[1].name) - if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size: - buffer._memoryLevel = self._weightMemoryLevel.name - weightMemoryOccupation += _ne16WeightBufferSize(buffer) + if getattr(buffer, "_memoryLevel", None) == self._weightMemoryLevel.name: + continue + + bufferSize = _ne16WeightBufferSize(buffer) + if weightMemoryOccupation + bufferSize <= self._weightMemoryLevel.size: + buffer._memoryLevel = self._weightMemoryLevel.name + weightMemoryOccupation += bufferSize return ctxt, graph

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` around lines 27 - 44, The loop that sums and assigns weight memory uses weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes reference it and uses a strict "<" so exact-fit buffers are skipped; change the logic so before calling _ne16WeightBufferSize(buffer) or adding to weightMemoryOccupation you check whether buffer._memoryLevel is already equal to self._weightMemoryLevel.name (skip counting/annotating if already placed) and use a non-strict comparison (<=) against self._weightMemoryLevel.size when deciding to assign buffer._memoryLevel; update references in this block (ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel, _ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers aren’t double-counted and exact-fit placements are allowed.

coderabbitai · 2026-04-13T22:38:47Z

+        weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0)
+        weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
+        weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
+        weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify weight tensor shape after encoding and check if dimIdx 3 causes issues echo "=== Check TilerModel.getTensorDimVar implementation ===" ast-grep --pattern 'def getTensorDimVar($$$): $$$ ' echo -e "\n=== Check if weight shape assertions exist in tiler ===" rg -n "getTensorDimVar.*weight" --type py -C 2

Repository: pulp-platform/Deeploy

Length of output: 17734

🏁 Script executed:

# Check if the variables declared at lines 41-44 are used later in the function sed -n '1,100p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 6259

🏁 Script executed:

# Check if similar variables are used in the comparable Neureka file sed -n '40,80p' Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 2692

🏁 Script executed:

# Find where NE16AdjustWeightMemoryLayoutPass is defined and check weight shape find . -name "*NE16AdjustWeightMemoryLayoutPass*" -o -name "*Parsers.py" | head -5

Repository: pulp-platform/Deeploy

Length of output: 246

🏁 Script executed:

# Search for all uses of the suspect variables in NE16DenseConstraint.py rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370

🏁 Script executed:

# Check NE16 Parsers.py for weight shape information cat -n Deeploy/Targets/NE16/Parsers.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 4765

🏁 Script executed:

# Compare with Neureka - check if those variables are used there rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370

🏁 Script executed:

# Check if getTensorDimVar creates variables on-demand or retrieves them cat -n Deeploy/TilingExtension/TilerModel.py | sed -n '125,145p'

Repository: pulp-platform/Deeploy

Length of output: 1096

🏁 Script executed:

# Check addTensorDimToModel to see what dimensions it adds to the model rg -n "def addTensorDimToModel" Deeploy/TilingExtension/TilerModel.py -A 15

Repository: pulp-platform/Deeploy

Length of output: 762

🏁 Script executed:

# Verify the runtime error scenario: check if weight is 3D and dimIdx=3 would fail python3 << 'EOF' # Simulate the getTensorDimVar call weight_shape = (64, 32*64, 128) # Example 3D weight: (cout, cinMajor, bits*spatial) num_dims = len(weight_shape) print(f"Weight shape: {weight_shape}") print(f"Number of dimensions: {num_dims}") print(f"Valid dimIdx range: 0 to {num_dims - 1}") # Try to access dimIdx=3 for dimIdx in [0, 1, 2, 3]: varName = f"weight_dim_{dimIdx}" if dimIdx < num_dims: print(f"✓ dimIdx={dimIdx}: {varName} would exist (size: {weight_shape[dimIdx]})") else: print(f"✗ dimIdx={dimIdx}: {varName} would NOT exist (KeyError)") EOF

Repository: pulp-platform/Deeploy

Length of output: 344

🏁 Script executed:

# Double-check the complete addGeometricalConstraint function to confirm these variables aren't used wc -l Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py sed -n '24,90p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | grep -E "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar"

Repository: pulp-platform/Deeploy

Length of output: 425

Remove unused weight dimension variables that cause runtime KeyError.

Lines 42-44 declare variables that are never used and access a non-existent dimension. Per Parsers.py line 60, NE16 weight tensors are 3D (cout, cinMajor, bits*spatial) after encoding, with valid dimensions 0–2. Attempting to access dimIdx=3 via getTensorDimVar will fail at runtime with a KeyError.

Remove these three lines:

weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1) weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2) weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)

Keep only line 41: weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0), which is actually used in the function.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41 - 44, The three unused weight-dimension variable declarations in NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) attempt to access non-existent dims and cause a runtime KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar and weightBandwidthVar and keep only weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is the only weight dim used in this function (adjust any references to those removed symbols if present elsewhere in the function).

coderabbitai · 2026-04-13T22:38:48Z

+        # Map output dims to inputs dims
+        tilerModel.addConstraint(outputBatchVar == inputBatchVar)
+        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
+        tilerModel.addConstraint(outputWidthVar == inputWidthVar)


⚠️ Potential issue | 🟠 Major

Model pointwise output geometry with stride, not identity.

This forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar, but NE16Engine.canExecute() explicitly allows strided 1x1 convs when enableStrides is set. Those tiles need the usual (effective_input - 1) // stride + 1 relation, otherwise the model is wrong as soon as stride > 1.

Suggested fix

- # Map output dims to inputs dims + # Map output dims to input dims tilerModel.addConstraint(outputBatchVar == inputBatchVar) - tilerModel.addConstraint(outputHeightVar == inputHeightVar) - tilerModel.addConstraint(outputWidthVar == inputWidthVar) + strides = parseDict["strides"] + padding = parseDict["pads"] + inputBuffer = ctxt.lookup(inputBufferName) + effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1])) + effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2])) + tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1) + tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines 43 - 46, The current constraints in NE16PointwiseConstraint.py bind outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true; replace those identity constraints (the tilerModel.addConstraint calls for outputHeightVar/outputWidthVar) with the proper stride-aware relation using the stride variables (e.g., strideH/strideW or the stride parameter used by NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height and width respectively, preserving the existing batch equality (outputBatchVar == inputBatchVar).

coderabbitai · 2026-04-13T22:38:48Z

+    values = weightTensor.values
+
+    # Extract weight offset and translate weights by the offset
+    weight_offset = values.min()
+    values = values - weight_offset
+    node.attrs["weight_offset"] = weight_offset
+
+    if "channels_first" in node.attrs:
+        channels_first = node.attrs["channels_first"]
+    else:
+        channels_first = default_channels_first
+
+    # Weight encode expects channels first
+    if not channels_first:
+        values = values.transpose(0, 3, 1, 2)
+
+    bits = 8  # Support only 8 bit weights for now
+    if node.attrs['group'] == 1:
+        weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)
+    else:
+        weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)


⚠️ Potential issue | 🟠 Major

Potential integer overflow when subtracting weight offset.

If values has dtype int8 (range -128 to 127), the subtraction at line 85 can overflow before the astype(np.uint8) at line 99. For example, with weight_offset = -128 and a value of 127:

127 - (-128) = 255 # overflows int8, wraps to -1

Cast to a wider type before subtraction to avoid wraparound:

🐛 Proposed fix

# Adjust N-EUREKA's weights values = weightTensor.values # Extract weight offset and translate weights by the offset weight_offset = values.min() - values = values - weight_offset + values = values.astype(np.int32) - int(weight_offset) node.attrs["weight_offset"] = weight_offset if "channels_first" in node.attrs:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

values = weightTensor.values

# Extract weight offset and translate weights by the offset

weight_offset = values.min()

values = values - weight_offset

node.attrs["weight_offset"] = weight_offset

if "channels_first" in node.attrs:

channels_first = node.attrs["channels_first"]

else:

channels_first = default_channels_first

# Weight encode expects channels first

if not channels_first:

values = values.transpose(0, 3, 1, 2)

bits = 8 # Support only 8 bit weights for now

if node.attrs['group'] == 1:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)

else:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)

values = weightTensor.values

# Extract weight offset and translate weights by the offset

weight_offset = values.min()

values = values.astype(np.int32) - int(weight_offset)

node.attrs["weight_offset"] = weight_offset

if "channels_first" in node.attrs:

channels_first = node.attrs["channels_first"]

else:

channels_first = default_channels_first

# Weight encode expects channels first

if not channels_first:

values = values.transpose(0, 3, 1, 2)

bits = 8 # Support only 8 bit weights for now

if node.attrs['group'] == 1:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)

else:

weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 81 - 101, The subtraction of weight_offset from values in Passes.py can overflow if values is int8; convert values to a wider integer dtype (e.g., np.int16 or np.int32) before computing weight_offset and performing values = values - weight_offset, then ensure the resulting values are clipped to the 0–255 range and finally cast to np.uint8 just before calling _weightEncode (refer to weightTensor.values, weight_offset, values, and _weightEncode) so the transpose and depthwise logic remain unchanged.

coderabbitai · 2026-04-13T22:38:48Z

+        parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n')
+        parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n')


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify end-to-end wiring for NE16-specific runner flags. rg -nP --type=py -C3 'ne16[_-]wmem|enable[_-]3x3|hasattr\(args,\s*.\s*ne16|hasattr\(args,\s*.\s*enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py DeeployTest/testUtils/core/execution.py rg -nP --type=cmake -C3 'NE16|3x3|WMEM|NUM_CORES' CMakeLists.txt DeeployTest/CMakeLists.txt TargetLibraries/GAP9/CMakeLists.txt

Repository: pulp-platform/Deeploy

Length of output: 6815

🏁 Script executed:

# Search for main() function and how it handles arguments rg -n 'def main' DeeployTest/testUtils/deeployRunner.py -A 20 # Look for args usage with underscore conversion (--ne16-wmem → args.ne16_wmem) rg -n 'args\.ne16|args\.enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py # Check the entire flow in the runner file cat -n DeeployTest/deeployRunner_tiled_gap9_w_ne16.py

Repository: pulp-platform/Deeploy

Length of output: 2136

🏁 Script executed:

# Get more of the main() function to see how it uses parser_setup_callback sed -n '311,450p' DeeployTest/testUtils/deeployRunner.py | cat -n # Also search for any reference to ne16_wmem or enable_3x3 after variable names are converted rg -n 'ne16_wmem|enable_3x3' DeeployTest/testUtils/deeployRunner.py DeeployTest/testUtils/core/execution.py

Repository: pulp-platform/Deeploy

Length of output: 5539

Add handling for --ne16-wmem and --enable-3x3 flags in main().

The flags are defined on Lines 15-16, but unlike --cores (which is properly handled at deeployRunner.py:90-91), there is no downstream consumption of --ne16-wmem or --enable-3x3. These arguments are parsed but never converted to CMake options or used anywhere in execution. Either wire them into CMake configuration (e.g., -DENABLE_NE16_WMEM, -DENABLE_3X3_CONV), or remove them if not yet implemented.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py` around lines 15 - 16, The two parsed flags --ne16-wmem and --enable-3x3 are never consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and "-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same cmake argument list used for building (the variable that assembles cmake args / options or the call that runs cmake/configure in main()); if the project expects OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the code path that constructs cmake arguments in main() so these options are passed to CMake.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Fix invalid designated initializer syntax for weights_stride.

The line task.data.cfg.weights_stride = (ne16_stride_t) {...} is inside the .cfg = (ne16_cfg_t) { ... } designated initializer block. Within designated initializers, you must use .fieldname = value syntax. The current syntax is invalid C and will cause compilation errors.

🐛 Proposed fix

-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Inside the designated initializer for task.data.cfg (the .cfg = (ne16_cfg_t) {
... } block) the assignment to weights_stride uses an invalid nested C
cast-style initializer; replace it with a proper designated-field initializer by
setting .weights_stride = { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1},
.d2 = ${weightStrideD2} } (or include an explicit .weights_stride =
(ne16_stride_t){ .d0 = ..., .d1 = ..., .d2 = ... } if you prefer the cast) so
the field name .weights_stride is used with a brace-enclosed initializer
compatible with ne16_stride_t within the ne16_cfg_t designated initializer.

🧹 Nitpick comments (8)

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)

7-11: Minor style nit: Consider grouping all imports before module-level assignments.

The assignment float32_tPtr = PointerClass(float32_t) on line 10 appears between import statements. Per PEP 8 conventions, imports should typically be grouped together at the top of the file before other statements.

♻️ Suggested reordering

 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
-
-float32_tPtr = PointerClass(float32_t)
 from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
+
+float32_tPtr = PointerClass(float32_t)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py` around lines 7 - 11,
Move the module-level assignment float32_tPtr = PointerClass(float32_t) so all
import statements are grouped first: keep the imports for PointerClass,
float32_t, and DeeployTypes together at the top, then place the float32_tPtr
assignment immediately after those imports; ensure you reference PointerClass
and float32_t exactly as they are imported and leave NetworkContext,
NodeTemplate, OperatorRepresentation imports unchanged.

DeeployTest/test_platforms.py (1)

1004-1006: Unused config_name variable (consistent with existing tests).

The config_name variable is unpacked but never used. This matches the pattern in all other tiled test functions in this file (e.g., lines 317, 346, 808, etc.). For consistency, you could prefix it with an underscore to signal intent:
-    test_name, l1, config_name = test_params
+    test_name, l1, _config_name = test_params
This is a low-priority style nit since the existing codebase follows the same pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/test_platforms.py` around lines 1004 - 1006, In
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the unpacked variable
config_name is unused; rename it to _config_name (or prefix with an underscore)
in the function signature to indicate it's intentionally unused and match the
existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).

.github/workflows/_runner-gap9-w-ne16-tiled.yml (1)

36-36: Consider logging pip install failures instead of silently ignoring them.

The || true suppresses all pip install failures silently. If the install genuinely fails, subsequent test runs will fail with confusing errors. Consider logging the failure while still allowing the workflow to proceed:
-          pip install -e . || true
+          pip install -e . || echo "WARNING: pip install failed, tests may not work correctly"
Alternatively, if the install is expected to always succeed in CI, remove || true to fail fast on broken dependencies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/_runner-gap9-w-ne16-tiled.yml at line 36, The workflow
step currently runs "pip install -e . || true", which silently swallows
installation failures; change this to either remove the "|| true" so the job
fails fast on pip install errors, or replace it with logic that captures the pip
exit status and logs the failure before continuing (e.g., run pip install -e .
and on failure emit a clear error log message referencing the pip install step);
locate the exact step containing "pip install -e . || true" in the workflow and
update it accordingly so failures are not silently ignored.

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

Per static analysis hints, several unpacked variables are never used. Prefix them with underscore to indicate intentional non-use.

♻️ Suggested fix

-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims

-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims

Also applies to: 173-173

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint where you extract cube.offset
into (BatchOffset, HOffset, WOffset, COffset) and cube.dims into (BatchSize,
HSize, WSize, CSize) (and the similar unpacking around line 173) include
variables that are never used; update these unpackings to prefix the unused
names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so static analysis
recognizes intentional non-use while keeping any used variables unchanged and
preserving the tuple structure in the methods within the NE16DenseConstraint
class.

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

79-80: Remove unused padding variable.

The padding variable is read from parseDict but never used in addPolicyConstraint.

♻️ Suggested fix

         strides = parseDict["strides"]
-        padding = parseDict["pads"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
79 - 80, In addPolicyConstraint (class NE16PointwiseConstraint) the local
variable padding is assigned from parseDict["pads"] but never used; remove the
unused assignment (the line `padding = parseDict["pads"]`) so there is no dead
variable left, or if padding is intended to be used, instead reference it where
needed—locate the parseDict usage in addPolicyConstraint and either delete the
padding assignment or integrate padding into the constraint logic.

Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (2)

29-31: Remove unused dilation variable.

The dilation variable is extracted but never used in the function. Since NE16 doesn't support dilation (enforced by parser's dilations == [1, 1] check), this variable is unnecessary.

♻️ Suggested fix

         strides = parseDict["strides"]
         padding = parseDict["pads"]
-        dilation = parseDict["dilations"]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
29 - 31, Remove the unused local variable `dilation` that is assigned from
`parseDict["dilations"]`; since the parser enforces `dilations == [1, 1]` and
NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.

160-162: Use already-looked-up outputBuffer instead of redundant lookup.

outputBuffer was already retrieved at line 153. Line 162 performs an unnecessary second lookup.

♻️ Suggested fix

             InCube, padding_tuple = Conv2DTileConstraint.computeInputCube((weightH, weightW), pads, strides, weightC,
                                                                           cube,
-                                                                          ctxt.lookup(varOut).shape)
+                                                                          outputBuffer.shape)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
160 - 162, Replace the redundant ctxt.lookup(varOut).shape call with the
already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.

Deeploy/Targets/NE16/Parsers.py (1)

56-56: Use idiomatic boolean check instead of equality comparison to False.

Per Ruff E712, avoid explicit equality comparisons to False.

♻️ Suggested fix

         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` at line 56, Replace the explicit equality
check "channels_first == False" with the idiomatic boolean negation "not
channels_first" wherever it appears (e.g., in the call/config constructing the
parser using the channels_first variable) to comply with Ruff E712 and avoid
direct comparison to False.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Line 34: The list comprehension that builds ne16Nodes uses
node.attrs["engine"] directly and can raise KeyError for nodes without an
"engine"; change it to safely check for the key (e.g., use
node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and
node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine
attribute are compared against self.ne16EngineName; update the expression that
defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Inside the designated initializer for task.data.cfg (the
.cfg = (ne16_cfg_t) { ... } block) the assignment to weights_stride uses an
invalid nested C cast-style initializer; replace it with a proper
designated-field initializer by setting .weights_stride = { .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} } (or
include an explicit .weights_stride = (ne16_stride_t){ .d0 = ..., .d1 = ..., .d2
= ... } if you prefer the cast) so the field name .weights_stride is used with a
brace-enclosed initializer compatible with ne16_stride_t within the ne16_cfg_t
designated initializer.

---

Nitpick comments:
In @.github/workflows/_runner-gap9-w-ne16-tiled.yml:
- Line 36: The workflow step currently runs "pip install -e . || true", which
silently swallows installation failures; change this to either remove the "||
true" so the job fails fast on pip install errors, or replace it with logic that
captures the pip exit status and logs the failure before continuing (e.g., run
pip install -e . and on failure emit a clear error log message referencing the
pip install step); locate the exact step containing "pip install -e . || true"
in the workflow and update it accordingly so failures are not silently ignored.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Line 56: Replace the explicit equality check "channels_first == False" with
the idiomatic boolean negation "not channels_first" wherever it appears (e.g.,
in the call/config constructing the parser using the channels_first variable) to
comply with Ruff E712 and avoid direct comparison to False.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint where you
extract cube.offset into (BatchOffset, HOffset, WOffset, COffset) and cube.dims
into (BatchSize, HSize, WSize, CSize) (and the similar unpacking around line
173) include variables that are never used; update these unpackings to prefix
the unused names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so
static analysis recognizes intentional non-use while keeping any used variables
unchanged and preserving the tuple structure in the methods within the
NE16DenseConstraint class.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 29-31: Remove the unused local variable `dilation` that is
assigned from `parseDict["dilations"]`; since the parser enforces `dilations ==
[1, 1]` and NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.
- Around line 160-162: Replace the redundant ctxt.lookup(varOut).shape call with
the already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 79-80: In addPolicyConstraint (class NE16PointwiseConstraint) the
local variable padding is assigned from parseDict["pads"] but never used; remove
the unused assignment (the line `padding = parseDict["pads"]`) so there is no
dead variable left, or if padding is intended to be used, instead reference it
where needed—locate the parseDict usage in addPolicyConstraint and either delete
the padding assignment or integrate padding into the constraint logic.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`:
- Around line 7-11: Move the module-level assignment float32_tPtr =
PointerClass(float32_t) so all import statements are grouped first: keep the
imports for PointerClass, float32_t, and DeeployTypes together at the top, then
place the float32_tPtr assignment immediately after those imports; ensure you
reference PointerClass and float32_t exactly as they are imported and leave
NetworkContext, NodeTemplate, OperatorRepresentation imports unchanged.

In `@DeeployTest/test_platforms.py`:
- Around line 1004-1006: In test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the
unpacked variable config_name is unused; rename it to _config_name (or prefix
with an underscore) in the function signature to indicate it's intentionally
unused and match the existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e7e2d4c-444f-4b78-b2b5-22aef5e9a86f

📥 Commits

Reviewing files that changed from the base of the PR and between e41800e and 38f10f2.

📒 Files selected for processing (32)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt

✅ Files skipped from review due to trivial changes (5)

DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/CMakeLists.txt
Deeploy/Targets/NE16/Tiler.py
DeeployTest/test_gap9_ne16_tiled_config.py

🚧 Files skipped from review as they are similar to previous changes (6)

TargetLibraries/GAP9/CMakeLists.txt
CMakeLists.txt
DeeployTest/testUtils/platformMapping.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/Platform.py

coderabbitai · 2026-04-13T23:46:01Z

+            if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
+                weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+
+        ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]


⚠️ Potential issue | 🟡 Minor

Potential KeyError if node lacks engine attribute.

The list comprehension accesses node.attrs["engine"] without checking if the key exists. Nodes not assigned to an engine may lack this attribute.

🛡️ Proposed defensive fix

- ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] + ne16Nodes = [node for node in graph.nodes if node.attrs.get("engine") == self.ne16EngineName]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` at line 34, The list comprehension that builds ne16Nodes uses node.attrs["engine"] directly and can raise KeyError for nodes without an "engine"; change it to safely check for the key (e.g., use node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine attribute are compared against self.ne16EngineName; update the expression that defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (5)

Deeploy/Targets/NE16/Deployer.py (1)

19-30: ⚠️ Potential issue | 🟠 Major

Avoid shared inputOffsets state across deployer instances.

inputOffsets = {} still creates one dict shared by every NE16Deployer built without an explicit argument, so per-instance state can leak between deployments.

Suggested fix

-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 19 - 30, The constructor
__init__ currently uses a mutable default inputOffsets = {} which shares state
across instances; change the parameter default to inputOffsets: Optional[Dict] =
None and inside __init__ (before calling super().__init__) set inputOffsets = {}
if it's None, then pass that local dict into super().__init__; update the
function signature and ensure any references to inputOffsets in
NE16Deployer.__init__ use the per-instance dict created when None is supplied.

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

90-106: ⚠️ Potential issue | 🟠 Major

Widen before subtracting weight_offset.

If weightTensor.values is int8, values - weight_offset happens in the narrow dtype and can wrap before the later cast to uint8, corrupting the encoded weights.

Suggested fix

-    weight_offset = values.min()
-    values = values - weight_offset
+    weight_offset = int(values.min())
+    values = values.astype(np.int32) - weight_offset
     node.attrs["weight_offset"] = weight_offset

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 90 -
106, The subtraction values = values - weight_offset can underflow when
weightTensor.values is int8; before computing the offset and subtracting, widen
values to a safe integer type (e.g., int16/int32) so the subtraction is
performed without wraparound, then clip/normalize and finally cast to np.uint8
before calling _weightEncode; update the operations around weight_offset,
values, and the subsequent transpose/astype steps (referencing weight_offset,
values, weightTensor.values, and _weightEncode) to perform widening,
subtraction, any needed clamping, and only then cast to uint8 for encoding.

Deeploy/Targets/NE16/Engine.py (1)

62-75: ⚠️ Potential issue | 🟠 Major

Grouped convolutions are still being classified as NE16 PW/DW.

isPWConv() still accepts grouped 1x1 convs, and isDWConv() still accepts any grouped 3x3 conv. That will color generic grouped convolutions for NE16 even though the downstream paths only handle regular pointwise (group == 1) or true depthwise weights.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, isPWConv currently
allows grouped 1x1 convs and isDWConv accepts any grouped 3x3 conv; change the
predicates to only mark true pointwise (group == 1) and true depthwise (group ==
number of input channels) respectively: update isPWConv to include
node.attrs['group'] == 1 in its condition, and update isDWConv to require
node.attrs['group'] equal the convolution's input channel count (i.e., only
treat it as depthwise when group == input_channels) while keeping the existing
kernel_shape, dilation and stride checks.

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

43-46: ⚠️ Potential issue | 🟠 Major

Model pointwise output geometry from stride, not identity.

These constraints only describe stride-1 pointwise conv. If NE16Engine.enableStrides is ever enabled, H_out/W_out become wrong immediately because the tiler still forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar.

Suggested fix

         # Map output dims to inputs dims
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 46, The current constraints in NE16PointwiseConstraint map outputHeightVar
and outputWidthVar directly to inputHeightVar/inputWidthVar which only works for
stride=1; update the constraints to compute output dims from the pointwise
stride when NE16Engine.enableStrides is used (use the convolution output formula
for 1x1: H_out = ceil(inputHeight / strideH), W_out = ceil(inputWidth /
strideW)). Replace the equalities on outputHeightVar and outputWidthVar with
expressions that use the pointwise stride variables (e.g., strideH/strideW or
the engine's stride config) and implement the integer ceil division ((input +
stride - 1) / stride) so tilerModel.addConstraint uses outputHeightVar ==
(inputHeightVar + strideHVar - 1) / strideHVar and similarly for width; keep the
batch mapping (outputBatchVar == inputBatchVar) as-is.

Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Use a designated field initializer for weights_stride.

task.data.cfg.weights_stride = ... is not valid inside a C compound literal. This needs to be .weights_stride = ... like the neighboring fields, otherwise the generated kernel code will not compile.

Proposed fix

-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },

Is `task.data.cfg.weights_stride = ...` valid inside a C designated initializer, or must it be written as `.weights_stride = ...`?

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Replace the invalid in-line assignment inside the C compound literal with a
designated initializer: change the occurrence of "task.data.cfg.weights_stride =
(ne16_stride_t) { ... }" so that within the compound literal for task.data.cfg
you use ".weights_stride = (ne16_stride_t) { .d0 = ..., .d1 = ..., .d2 = ... }"
(i.e., reference the struct member as .weights_stride) so the generated kernel
compiles; update the initializer near the other designated fields in the same
compound literal where task.data.cfg is constructed.

🧹 Nitpick comments (1)

Deeploy/Targets/NE16/TileConstraints/__init__.py (1)
5-5: Consider using explicit submodule imports for clarity.

The from . import * statement in a package __init__.py doesn't automatically import submodules—it only imports names from __all__ (if defined) in the current package's namespace. For a package that simply exposes its submodules, explicit imports are clearer:
from . import NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers
This same pattern appears in several other new NE16 __init__.py files. If the current behavior is intentional (relying on consumers to import submodules directly), this can remain as-is.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
wildcard import in the package __init__ (from . import *) with explicit
submodule imports to make exported symbols clear: import the submodules
NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint and
RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 32-40: The constructor is mutating the shared lowering optimizer
in place (self.loweringOptimizer.passes) which causes cross-deployer pollution;
instead clone the optimizer or at least its passes list before modifying it
(e.g., deep-copy the optimizer or assign a new passes list based on
list(self.loweringOptimizer.passes)) and then perform the PULPNCHWtoNHWCPass ->
NCHWtoNHWCPass replacement and append ConvEngineDiscolorationPass() and
NE16OptimizationPass(self.default_channels_first, "NE16") to the cloned
optimizer so the module-scoped NE16Optimizer from Platform.py is not mutated
across instances.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-40: The NE16 parser's parseNodeCtxt currently defaults
channels_first=True but the implementation only accepts False; change the
default parameter in parseNodeCtxt to channels_first: bool = False and update
the three RQS override method signatures to also default channels_first to False
so callers use NHWC by default and the method signatures match the only
supported layout (refer to parseNodeCtxt and the three RQS override methods to
locate and update their signatures).

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 139-150: The _findAllMultiplicands function currently uses float
division and an exclusive upper bound which breaks integer factorization; change
the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use
math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors
so tmpX stays an int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 19-30: The constructor __init__ currently uses a mutable default
inputOffsets = {} which shares state across instances; change the parameter
default to inputOffsets: Optional[Dict] = None and inside __init__ (before
calling super().__init__) set inputOffsets = {} if it's None, then pass that
local dict into super().__init__; update the function signature and ensure any
references to inputOffsets in NE16Deployer.__init__ use the per-instance dict
created when None is supplied.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: isPWConv currently allows grouped 1x1 convs and isDWConv
accepts any grouped 3x3 conv; change the predicates to only mark true pointwise
(group == 1) and true depthwise (group == number of input channels)
respectively: update isPWConv to include node.attrs['group'] == 1 in its
condition, and update isDWConv to require node.attrs['group'] equal the
convolution's input channel count (i.e., only treat it as depthwise when group
== input_channels) while keeping the existing kernel_shape, dilation and stride
checks.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Replace the invalid in-line assignment inside the C
compound literal with a designated initializer: change the occurrence of
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" so that within the
compound literal for task.data.cfg you use ".weights_stride = (ne16_stride_t) {
.d0 = ..., .d1 = ..., .d2 = ... }" (i.e., reference the struct member as
.weights_stride) so the generated kernel compiles; update the initializer near
the other designated fields in the same compound literal where task.data.cfg is
constructed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint map
outputHeightVar and outputWidthVar directly to inputHeightVar/inputWidthVar
which only works for stride=1; update the constraints to compute output dims
from the pointwise stride when NE16Engine.enableStrides is used (use the
convolution output formula for 1x1: H_out = ceil(inputHeight / strideH), W_out =
ceil(inputWidth / strideW)). Replace the equalities on outputHeightVar and
outputWidthVar with expressions that use the pointwise stride variables (e.g.,
strideH/strideW or the engine's stride config) and implement the integer ceil
division ((input + stride - 1) / stride) so tilerModel.addConstraint uses
outputHeightVar == (inputHeightVar + strideHVar - 1) / strideHVar and similarly
for width; keep the batch mapping (outputBatchVar == inputBatchVar) as-is.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 90-106: The subtraction values = values - weight_offset can
underflow when weightTensor.values is int8; before computing the offset and
subtracting, widen values to a safe integer type (e.g., int16/int32) so the
subtraction is performed without wraparound, then clip/normalize and finally
cast to np.uint8 before calling _weightEncode; update the operations around
weight_offset, values, and the subsequent transpose/astype steps (referencing
weight_offset, values, weightTensor.values, and _weightEncode) to perform
widening, subtraction, any needed clamping, and only then cast to uint8 for
encoding.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the wildcard import in the package __init__ (from . import *)
with explicit submodule imports to make exported symbols clear: import the
submodules NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint
and RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaa00d35-32b5-49a6-8eeb-a015c49962b3

📥 Commits

Reviewing files that changed from the base of the PR and between 38f10f2 and 748707a.

📒 Files selected for processing (35)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
CMakeLists.txt
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt

✅ Files skipped from review due to trivial changes (4)

DeeployTest/conftest.py
Deeploy/Targets/NE16/Tiler.py
DeeployTest/testUtils/core/execution.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py

🚧 Files skipped from review as they are similar to previous changes (9)

DeeployTest/testUtils/platformMapping.py
TargetLibraries/GAP9/CMakeLists.txt
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
Deeploy/Targets/NE16/Bindings.py
DeeployTest/test_gap9_ne16_tiled_config.py

coderabbitai · 2026-04-14T09:03:09Z

+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:


⚠️ Potential issue | 🟠 Major

Make the NE16 parser default to NHWC.

parseNodeCtxt() defaults channels_first to True, but Line 56 rejects anything except False. Any caller that uses the default path will fail to parse NE16 convolutions. Flip this default here and mirror it in the three RQS overrides so the method signature matches the only supported layout.

Proposed fix

- def parseNodeCtxt(self, - ctxt: NetworkContext, - node: gs.Node, - channels_first: bool = True) -> Tuple[NetworkContext, bool]: + def parseNodeCtxt(self, + ctxt: NetworkContext, + node: gs.Node, + channels_first: bool = False) -> Tuple[NetworkContext, bool]: @@ - channels_first == False, + not channels_first,

- def parseNodeCtxt(self, - ctxt: NetworkContext, - node: gs.Node, - channels_first: bool = True) -> Tuple[NetworkContext, bool]: + def parseNodeCtxt(self, + ctxt: NetworkContext, + node: gs.Node, + channels_first: bool = False) -> Tuple[NetworkContext, bool]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 40, The NE16 parser's parseNodeCtxt currently defaults channels_first=True but the implementation only accepts False; change the default parameter in parseNodeCtxt to channels_first: bool = False and update the three RQS override method signatures to also default channels_first to False so callers use NHWC by default and the method signatures match the only supported layout (refer to parseNodeCtxt and the three RQS override methods to locate and update their signatures).

coderabbitai · 2026-04-14T09:03:09Z

+        weightBuffer = ctxt.lookup(weightBufferName)
+        # NE16 DW weight is packed as a single (1, 1, packed_bytes) block
+        # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
+        # Keep the outermost dim fixed at its full (=1) value regardless of
+        # the output channel tiling.
+        tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+


⚠️ Potential issue | 🟠 Major

Pin the whole packed DW weight tensor, not just dim 0.

Line 57 fixes only weightShape[0], but Lines 227-228 always emit a load for the full encoded weight block. That lets the tiler under-account weight memory by shrinking the packed-bytes axis, then codegen still DMA-loads the entire tensor. Constrain every dimension of the encoded DW weight buffer to Max() for this path.

Proposed fix

weightBuffer = ctxt.lookup(weightBufferName) # NE16 DW weight is packed as a single (1, 1, packed_bytes) block # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16). - # Keep the outermost dim fixed at its full (=1) value regardless of - # the output channel tiling. - tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max()) + # Keep the full encoded block resident regardless of output-channel tiling. + for dimIdx in range(len(weightBuffer.shape)): + weightDimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx) + tilerModel.addConstraint(weightDimVar == weightDimVar.Max())

Also applies to: 225-228

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines 52 - 58, The current constraint only pins weightShape[0] via weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as a full encoded block (so all dims must be pinned); update the tilerModel constraints for the packed DW weight (located where weightBuffer = ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to set every shape dimension of that buffer to its Max() instead of only weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape or equivalent per-dimension vars obtained from weightBuffer) and call tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler cannot shrink any axis while codegen still DMA-loads the full packed block.

coderabbitai · 2026-04-14T09:03:09Z

+def _findAllMultiplicands(x: int) -> List[int]:
+    multiplicands = []
+    tmpX = x
+    for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
+        while tmpX % i == 0:
+            multiplicands.append(i)
+            tmpX = tmpX / i
+
+    if x // math.prod(multiplicands) > 1:
+        multiplicands.append(x // math.prod(multiplicands))
+
+    return multiplicands


⚠️ Potential issue | 🟠 Major

Use integer factorization here.

tmpX = tmpX / i turns the accumulator into a float, and the current loop bound skips exact square roots. For dimensions like 4 or 49 this misses balanced factor pairs, so _bestReshapeOption() falls back to (dim, 1) and defeats the pointwise-reshape heuristic.

Suggested fix

def _findAllMultiplicands(x: int) -> List[int]: multiplicands = [] tmpX = x - for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number + for i in range(2, math.isqrt(x) + 1): while tmpX % i == 0: multiplicands.append(i) - tmpX = tmpX / i + tmpX //= i - if x // math.prod(multiplicands) > 1: - multiplicands.append(x // math.prod(multiplicands)) + if tmpX > 1: + multiplicands.append(tmpX)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 - 150, The _findAllMultiplicands function currently uses float division and an exclusive upper bound which breaks integer factorization; change the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors so tmpX stays an int; ensure the final remaining factor appended for x // math.prod(multiplicands) is computed with integer arithmetic and that multiplicands contains ints so callers like _bestReshapeOption receive correct balanced factor pairs.

Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)

- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>

- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.

ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.

QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.

The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner ('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist workers compile in parallel. Two workers leave enough headroom on the standard 7-GB runner. (Pre-existing flake; surfaced as a hard fail in CI runs that happen to land both heavy FP32 GEMM compilations on adjacent workers.)

coderabbitai

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Deeploy/Targets/GAP9/Platform.py (1)

8-54: ⚠️ Potential issue | 🟡 Minor

Run the formatter/import hooks for this file.

CI reports both yapf and isort modified this file, so the branch is not passing pre-commit yet.

#!/bin/bash
pre-commit run yapf isort --files Deeploy/Targets/GAP9/Platform.py

Also applies to: 281-283

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Platform.py` around lines 8 - 54, The file containing
the long import block (look for symbols like DeQuantTilingReadyBindings,
BasicAllocateTemplate, RemoveEmptyConvBiasPass and PULPDMASliceBindings) is not
formatted/organized per CI; run the import and formatter hooks (isort and yapf)
to reorder and format the imports/line breaks, then stage and commit the
resulting changes so pre-commit passes (this also applies to the nearby import
group that includes PULP* bindings and parsers).

♻️ Duplicate comments (12)

Deeploy/Targets/NE16/__init__.py (1)

5-5: ⚠️ Potential issue | 🟠 Major

Replace the self-import with explicit NE16 module exports.

This remains unresolved: from . import * does not expose the new NE16 sibling modules. Use explicit imports and include Tiler as well.

Proposed fix

-from . import *
+from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
+
+__all__ = [
+    "Bindings",
+    "Deployer",
+    "Engine",
+    "OptimizationPasses",
+    "Parsers",
+    "Platform",
+    "Tiler",
+    "Templates",
+    "TileConstraints",
+    "TopologyOptimizationPasses",
+]

#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/__init__.py` at line 5, Replace the wildcard self-import
in Deeploy/Targets/NE16/__init__.py with explicit exports: import the sibling
module objects you want to expose (e.g., the NE16 device class/name and the
Tiler class) by name and populate __all__ accordingly so consumers can access
NE16 and Tiler directly; specifically remove "from . import *", add explicit
imports of the NE16 module symbols and Tiler, and set __all__ = ["NE16",
"Tiler", ...] to list the public API.

Deeploy/Targets/NE16/Deployer.py (2)

28-30: ⚠️ Potential issue | 🟠 Major

Use None for the inputOffsets default.

{} is shared across NE16Deployer instances, so parent initialization can leak offsets between deployments.

Proposed fix

-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

Verification:

#!/bin/bash
python -m ruff check Deeploy/Targets/NE16/Deployer.py --select B006

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor
NE16Deployer.__init__ currently uses a mutable default for the inputOffsets
parameter (inputOffsets = {}), which can be shared across instances; change the
signature to use inputOffsets=None and then when calling super().__init__ pass
an empty dict when inputOffsets is None (e.g., use inputOffsets or {}), ensuring
NE16Deployer.__init__ does not leak offsets between deployments.

29-40: ⚠️ Potential issue | 🟠 Major

Avoid mutating the shared optimizer pass list.

self.loweringOptimizer.passes is modified in place. Since the NE16 optimizer is constructed at module scope, repeated deployer creation can append duplicate NE16 passes and alter other deployments’ lowering pipelines.

Proposed fix

+import copy
+
 ...
-        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
+        loweringOptimizer = copy.copy(loweringOptimizer)
+        loweringOptimizer.passes = list(loweringOptimizer.passes)
+
+        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 29 - 40, The code mutates the
shared loweringOptimizer.passes list causing duplicate/side-effect passes; fix
by creating a new list copy and assigning it back instead of modifying in place:
make a local copy (e.g., new_passes = list(self.loweringOptimizer.passes)),
perform the replacement of PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes
when self.Platform.engines[0].enable3x3 is true, then set
self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(),
NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these
symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass,
ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer
to avoid mutating shared state.

Deeploy/Targets/NE16/Parsers.py (2)

37-56: ⚠️ Potential issue | 🟠 Major

channels_first default is still True while the check rejects everything but False.

parseNodeCtxt defaults channels_first=True on line 40 (and the three RQS overrides at lines 113, 153, 193 also default to True), but line 56 rejects anything except False. Any caller that uses the default path returns (ctxt, False) and the NE16 parser fails to bind. Flip all four defaults to False, and replace the == False check with not channels_first.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 56, The parseNodeCtxt
method in NE16's Parsers.py currently defaults channels_first=True but
immediately rejects any True value (using "channels_first == False"); change the
default of parseNodeCtxt to channels_first=False and also update the three RQS
override methods that default channels_first=True to default to False, then
simplify the conditional check from "channels_first == False" to "not
channels_first" (ensure you update all occurrences in parseNodeCtxt and the
three RQS override methods so callers using defaults succeed).
118-123: ⚠️ Potential issue | 🔴 Critical

Input mapping still has no bounds check and shift is never set from inputs/attrs.

The three RQS parseNodeCtxt overrides (lines 119-121 / 159-161 / 199-201) unconditionally map node.inputs[idx] into ['data_in','weight','mul','add']. If a 5th tensor is present (some RQS pipelines emit shift as an input) this throws IndexError, and if it's absent, the template variable ${shift} (referenced by ConvTemplate.py line 339 scale_shift_addr = (uint32_t)${shift}) stays "NULL" with no validation. Either validate len(node.inputs) == 4 explicitly or map a 5th input/node.attrs['shift'] into operatorRepresentation['shift'].
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 118 - 123, The parseNodeCtxt
overrides unconditionally map node.inputs into inputs =
['data_in','weight','mul','add'] causing IndexError for a 5th tensor and leaving
operatorRepresentation['shift'] unset; update parseNodeCtxt (the methods that
populate self.operatorRepresentation) to first check len(node.inputs) and either
assert/raise if it is not 4 or 5, and if a 5th input exists map it to
self.operatorRepresentation['shift'] = ctxt.lookup(node.inputs[4].name).name,
and additionally if 'shift' can come from node.attrs check
node.attrs.get('shift') and set operatorRepresentation['shift'] accordingly so
ConvTemplate.py's ${shift} is always defined or an explicit error is raised.

Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (2)

88-102: ⚠️ Potential issue | 🟠 Major

int8 subtraction still overflows before cast to uint8.

values - weight_offset executes in values.dtype. For int8 weights, weight_offset = values.min() can be -128, so any 127 element wraps to -1 and becomes 255 after the later astype(uint8) — wrong absolute offset, wrong encoded weights. Please widen before subtracting.

🐛 Proposed fix

-    weight_offset = values.min()
-    values = values - weight_offset
+    weight_offset = int(values.min())
+    values = values.astype(np.int32) - weight_offset
     node.attrs["weight_offset"] = weight_offset

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 88 -
102, The subtraction "values = values - weight_offset" risks int8 overflow; cast
the tensor to a wider signed dtype (e.g., int16 or int32) before computing
"weight_offset" subtraction so arithmetic is done in a safe width, perform
"values = values.astype(wider_dtype) - int(weight_offset)" then (after any
necessary range checks/clamping) cast to uint8 for encoding; update the code
paths around weightTensor.values, weight_offset and the subsequent astype(uint8)
to use the widened intermediate to avoid wraparound for int8 inputs.

139-150: ⚠️ Potential issue | 🟠 Major

_findAllMultiplicands still uses float division and an exclusive sqrt bound.

range(2, math.ceil(math.sqrt(x))) misses the sqrt itself (e.g., for x = 4 / 49 the loop is empty so multiplicands == [] and math.prod([]) == 1, returning [x]). Also tmpX = tmpX / i converts the accumulator to float, which then fails tmpX % i == 0 for large factors. Result: _bestReshapeOption collapses to (dim, 1) for many common spatial sizes and the pointwise-reshape heuristic never fires.

♻️ Proposed fix

 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
-    for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
+    for i in range(2, math.isqrt(x) + 1):
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
-
-    if x // math.prod(multiplicands) > 1:
-        multiplicands.append(x // math.prod(multiplicands))
+            tmpX //= i
+    if tmpX > 1:
+        multiplicands.append(tmpX)
     return multiplicands

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 -
150, The function _findAllMultiplicands uses float division and an exclusive
sqrt bound; change the loop to iterate i from 2 up to and including the integer
square root (use math.isqrt(x) or int(math.sqrt(x)) + 1) and perform integer
division on tmpX (use //= instead of /=) so tmpX remains an int for the modulo
checks; also simplify the tail handling by appending tmpX if tmpX > 1 (rather
than recomputing x // math.prod(...)) so _findAllMultiplicands reliably returns
prime factors for inputs like 4 or 49 and avoids float-related modulo failures.

Deeploy/Targets/NE16/Engine.py (1)

62-75: ⚠️ Potential issue | 🟠 Major

isPWConv/isDWConv still over-accept grouped convs.

The classifier helpers still don't require group == 1 for PW or assert the DW shape condition (group == ch_im_out with weights.shape[1] == 1). With the parsers enforcing stricter rules downstream, this means canExecute can return True for a node that then fails parsing/tiling, instead of cleanly falling back to the GAP9 cluster engine.

🔒️ Proposed fix

     def isPWConv(self, node) -> bool:
         return node.op in ["Conv", "RequantizedConv"] and \
             isinstance(node.inputs[1], gs.Constant) and \
             node.attrs['kernel_shape'] == [1, 1] and \
             node.attrs['dilations'] == [1, 1] and \
+            node.attrs['group'] == 1 and \
             (node.attrs['strides'] == [1, 1] or self.enableStrides)

     def isDWConv(self, node) -> bool:
+        weights = node.inputs[1].values if isinstance(node.inputs[1], gs.Constant) else None
         return node.op in ["Conv", "RequantizedConv"] and \
             isinstance(node.inputs[1], gs.Constant) and \
             node.attrs['kernel_shape'] == [3, 3] and \
             node.attrs['dilations'] == [1, 1] and \
-            node.attrs['group'] != 1 and \
+            weights is not None and weights.shape[1] == 1 and \
+            node.attrs['group'] == weights.shape[0] and \
             (node.attrs['strides'] == [1, 1] or self.enableStrides)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, The PW/DW classifiers
are too permissive: update isPWConv to also require node.attrs['group'] == 1 to
reject grouped convs, and update isDWConv to enforce depthwise conditions by
checking the weight tensor on node.inputs[1] has shape where weights.shape[1] ==
1 and node.attrs['group'] == node.attrs.get('channels_out' or 'ch_im_out'
equivalent) (i.e., group == output channel count); locate these checks in the
isPWConv and isDWConv functions and add the group and weight-shape validations
so canExecute no longer returns True for convs that will fail downstream
parsing/tiling.

Deeploy/Targets/NE16/Templates/ConvTemplate.py (2)

352-356: ⚠️ Potential issue | 🔴 Critical

.weights_stride is still the invalid task.data.cfg.weights_stride = … form.

Despite the previous "addressed in 773777f" note, the rendered template at these lines still emits task.data.cfg.weights_stride = (ne16_stride_t) { … } in the middle of the ne16_cfg_t compound literal. That is not a designated initializer and will not compile. The fix needs to land in this file.

🐛 Proposed fix

             .output_stride = (ne16_stride_t) {
                 .d0 = NE16_OUTPUT_BANDWIDTH_BYTES,
                 .d1 = ${dim_im_out_y_stride},
                 .d2 = ${dim_im_out_x_stride}
             },
-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356, The
template emits an invalid assignment "task.data.cfg.weights_stride =
(ne16_stride_t) { ... }" inside the ne16_cfg_t compound literal; update the
ConvTemplate.py rendering so it produces a designated initializer by replacing
that emitted assignment with ".weights_stride = (ne16_stride_t){ .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e.,
remove the "task.data.cfg." prefix and emit the field as a proper member
initializer inside the ne16_cfg_t literal), ensuring punctuation and comma
placement matches other fields in the ne16_cfg_t literal.

145-151: ⚠️ Potential issue | 🟠 Major

"add" in operatorRepresentation still ignores the "NULL" sentinel.

Same story as before: mul is gated on != "NULL", but use_bias falls back to a presence check. parseNodeCtxt in Deeploy/Targets/NE16/Parsers.py lines 74–76 writes "NULL" into operatorRepresentation["add"] unconditionally, so for bias-less requantized convs bit 25 of conf0 ends up set while scale_bias_addr in the emitted task struct is NULL. Please treat "NULL" as absent here too.

🛡️ Proposed fix

         if operatorRepresentation["mul"] != "NULL":
             mulBuff = ctxt.lookup(operatorRepresentation["mul"])
             mulBits = mulBuff._type.referencedType.typeWidth
-            operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"],
-                                                                 operatorRepresentation["log2D"], mulBits, "add"
-                                                                 in operatorRepresentation, False)
+            use_bias = operatorRepresentation.get("add", "NULL") != "NULL"
+            operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"],
+                                                                 operatorRepresentation["log2D"], mulBits,
+                                                                 use_bias, False)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 145 - 151, The
conf0 bit for bias presence is being set when operatorRepresentation contains
the key "add" even if its value is the sentinel "NULL"; update the requantized
branch in ConvTemplate.py so the third argument to getNormQuantConf0 uses an
explicit non-NULL check (e.g. operatorRepresentation.get("add") != "NULL")
instead of '"add" in operatorRepresentation', mirroring the mul check
(operatorRepresentation["mul"] != "NULL"); locate the operatorRepresentation
usage around the mulBuff lookup and change the boolean expression passed to
getNormQuantConf0 accordingly so bias-less convs do not set bit 25.

Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

43-56: ⚠️ Potential issue | 🟠 Major

Pointwise output geometry still modeled as identity — wrong for strided 1×1 / padded cases.

With NE16Engine.enableStrides=True, isPWConv accepts strides != [1,1], but outputHeightVar == inputHeightVar / outputWidthVar == inputWidthVar forces stride == 1 and no padding. Either add strides == [1,1] assertions to the 1×1 path, or use the standard (effective_input - 1) // stride + 1 relation like dense/DW do.

🔒️ Proposed fix

-        # Map output dims to inputs dims
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+
+        # Map output dims to input dims (stride- and padding-aware for 1x1)
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth  = inputWidthVar  + ((padding[1] + padding[3]) * (inputWidthVar  == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar  == (effectiveWidth  - 1) // strides[1] + 1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 56, The current pointwise tiler constraints incorrectly force
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids
strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where
tilerModel.addConstraint is called (the lines referencing
outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A)
add an assertion that strides == [1,1] when isPWConv can accept strides (check
NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred)
replace the identity constraints with the standard convolution output relation
used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1,
computing effective_input = input + pad_top + pad_bottom (use whatever padding
variables exist in scope) and apply that formula for both height and width via
tilerModel.addConstraint.

Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (1)

52-58: ⚠️ Potential issue | 🟠 Major

Still only pinning weight dim 0; rest of packed DW block can be shrunk by the tiler.

The depthwise weight is emitted as a single packed block (HyperRectangle((0,0,0), (weightShape[0], weightShape[1], weightShape[2])) on line 228), but only dim 0 is constrained to Max() here. The tiler is free to pick smaller values for dims 1 and 2, under-accounting weight memory while codegen still DMAs the full block.

🔒️ Proposed fix

         weightBuffer = ctxt.lookup(weightBufferName)
-        # NE16 DW weight is packed as a single (1, 1, packed_bytes) block
-        # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
-        # Keep the outermost dim fixed at its full (=1) value regardless of
-        # the output channel tiling.
-        tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+        # NE16 DW weight is one packed block loaded entirely every iteration;
+        # pin every axis to its Max so the tiler accounts for the full block.
+        for dimIdx in range(len(weightBuffer.shape)):
+            dimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx)
+            tilerModel.addConstraint(dimVar == dimVar.Max())

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
52 - 58, The current code only pins weightOutChannelVar to its Max(), allowing
the other two packed weight dimensions to be shrunk; fix by retrieving the
remaining dimension variables for the weight buffer (the vars corresponding to
dims 1 and 2 of weightBuffer) and add equivalent constraints using
tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed block (all
dims of weightBuffer) is fixed to full size; use weightBuffer/weightBufferName
to locate the buffer and weightOutChannelVar as a reference to the existing
constraint when adding the other two constraints.

🧹 Nitpick comments (1)

Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

41-44: Unused weight-dim variables.

weightInChannelMajorVar, weightBitsVar, and weightBandwidthVar are created but never referenced in addGeometricalConstraint or anywhere else. Only weightOutChannelVar is actually used (line 56). Either wire them into real constraints (e.g., pin the bits/bandwidth axes like the pointwise policy does) or drop them.

♻️ Proposed cleanup

         weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0)
-        weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
-        weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
-        weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41
- 44, Remove or use the three unused weight-dim variables
(weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in
NE16DenseConstraint; they are allocated but never referenced in
addGeometricalConstraint while only weightOutChannelVar is used. Either delete
those getTensorDimVar calls to clean up the code, or wire them into the
constraint logic inside addGeometricalConstraint (or the class constructor) —
e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the
pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling
decisions and weightInChannelMajorVar is used where input-channel-major indexing
matters.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Around line 62-80: The bias handling must normalize C.values to a 1-D array of
length Ko before any signed-input compensation; instead of only truncating when
bias_flat.size > Ko, handle bias_flat.size < Ko by expanding/broadcasting (e.g.,
repeat or np.resize) to length Ko and assert the resulting shape equals Ko,
ensure dtype is appropriate (int32/int64 as needed), then perform the
signed-input compensation using w_sum, scale_values (from
operatorRepresentation['mul']) or the int32 path, and finally assign the
corrected Ko-length array back to C.values; update references around C.values,
bias_flat, Ko, w_sum and operatorRepresentation so no silent broadcasting or
out-of-bounds write occurs.

In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py`:
- Around line 116-123: The code always sets NSize = buffA.shape[-1] which
ignores transposition; update the calculation to choose the reduction dimension
based on operatorRepresentation['transA'] (and consider transB if used
elsewhere): compute NSize = buffA.shape[-2] when transA == 1 else
buffA.shape[-1], and adjust any subsequent uses that compute A/B tile offsets or
shapes so they use this transpose-aware NSize (refer to variables transA,
transB, buffA, buffB, and NSize in NE16GEMMTileConstraint.py).
- Around line 131-167: The loop over outputCubes is using cube.offset (the local
rectangle) which ignores AbsoluteHyperRectangle.absoluteOffset, causing input
tile offsets to be computed from the wrong origin; update the code that builds
AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output
offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive
offsets from absoluteCube.rectangle instead of cube) so the
inputACubes/inputBCubes and RequantCube use global offsets; ensure this change
applies where batch-aware prepending of batch offsets occurs (the block that
updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to
inputACubes, inputBCubes and creating RequantCube (HyperRectangle).

In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py`:
- Around line 80-106: Ensure RequantizedGemm handling validates all requant
constants before mutating the node: check node.op == 'RequantizedGemm' still has
node.attrs['div'] and that node.attrs['div'].values is a scalar; verify
mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and
that mul_values length is either 1 or equal to Ko and bias_values length is at
least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias
after these guards and return/skip modification if any check fails to avoid
partially rewriting the graph.

In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`:
- Around line 1200-1221: The identity bypass must be guarded so it only removes
dequant_node/quant_node when the operation truly preserves type and graph
outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output
and that quant_node and dequant_node have matching signed and bit_width (compare
quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents,
and ensure n_levels match and signed flags are equal); if bit widths or signed
differ (e.g., int16->int8) or the output is a declared graph output, do not
remove the nodes and instead let normal narrowing/clamping pass run—implement
these checks around the existing identity branch (using ratio, zp_d, zp_q,
signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the
input substitution, clear inputs/outputs and graph.cleanup().toposort() when all
guards pass.
- Around line 1234-1238: Add a normalized "n_levels" entry alongside
"n_levels_out" when building rqs_attrs so operatorRepresentation['n_levels'] is
present for RequantShiftChecker and callers that expect the normalized
attribute; update the rqs_attrs dict in the block that creates the constants
(the rqs_attrs variable) to include a gs.Constant key 'n_levels' (using the same
normalized n_levels value used for 'n_levels_out') so both 'n_levels' and
'n_levels_out' are emitted consistently with other passes.

In `@Deeploy/Targets/NE16/OptimizationPasses/__init__.py`:
- Line 5: Replace the wildcard import in __init__.py with explicit imports from
MemoryLevelAnnotationPasses.py so the symbol AnnotateNE16WeightMemoryLevel is
exported and discoverable; specifically, import the class/function name (e.g.,
AnnotateNE16WeightMemoryLevel) from MemoryLevelAnnotationPasses and either add
it to the module namespace or to __all__ in __init__.py (or remove the
initializer entirely) so callers can successfully do from
Deeploy.Targets.NE16.OptimizationPasses import AnnotateNE16WeightMemoryLevel.

In `@Deeploy/Targets/NE16/Platform.py`:
- Around line 7-21: The file's imports are not formatted/ordered per project
hooks; run the formatter and import-sorting hooks on this module (e.g., run
pre-commit or directly run yapf and isort) to reorder and reflow the import
block so CI passes; locate the top import section (references:
RequantizedGemmToPwPass, NE16Engine, GAP9Platform, GAP9_FloatGEMMMapper,
NE16AdjustGEMMWeightLayoutPass, PULPRQSGEMMLayer) and apply yapf/isort to fix
spacing, line breaks and import grouping consistently across that block (also
apply the same fixes around lines 34-43 where similar import formatting issues
exist).
- Around line 39-43: NE16Optimizer currently reuses the same pass instances from
PULPOptimizer by using [*PULPOptimizer.passes], causing shared mutable state;
change the construction to deep-copy the pass objects instead (e.g., use
copy.deepcopy on each element of PULPOptimizer.passes) so NE16Optimizer =
TopologyOptimizer([*deep-copied-passes, NE16AdjustGEMMWeightLayoutPass(),
RequantizedGemmToPwPass()], name="NE16Optimizer"); reference
PULPOptimizer.passes and NE16Optimizer and import/use copy.deepcopy (or use a
per-pass clone method if available) to ensure independent pass instances.

In `@Deeploy/Targets/NE16/Templates/__init__.py`:
- Line 5: The package __init__.py currently uses a self-referential wildcard
import ("from . import *") which doesn't actually import sibling modules (e.g.,
the NE16 allocation and conv template modules) and triggers Ruff F403; replace
the wildcard with explicit imports of the template modules (or explicit from
.module import <symbols>) and/or populate __all__ with the actual module or
symbol names (for example add explicit references to the NE16 allocation and
conv modules) so package-level imports expose the intended templates. Ensure you
reference and import the specific module names rather than relying on "from .
import *".

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the package wildcard import in __init__.py (currently "from .
import *") with explicit relative imports of the concrete constraint modules
and/or classes so sibling modules are actually loaded and their classes are
exposed; e.g., use "from .constraint_module import ConstraintClass" for each
constraint module, and populate an __all__ list with the exported class names
(or import the modules with "from . import constraint_module" if you prefer
module exports) so the concrete constraint classes are available to consumers.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py`:
- Line 5: The package __init__.py currently uses a wildcard import which doesn't
re-export classes from Passes.py; update
Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import
and re-export the four pass classes by adding an explicit import from .Passes of
NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass,
ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are
exposed (e.g., in __all__) so consumers can import them directly from the
package.

In `@DeeployTest/test_platforms.py`:
- Around line 1008-1010: Unpack test_params using a throwaway name for the
unused config_name (e.g., test_name, l1, _ = test_params) and build the cmake
args using iterable unpacking instead of list concatenation (e.g.,
ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]);
apply the same two edits for the other two occurrences that create
ne16_cmake_args in the GAP9_w_NE16 tiled tests so Ruff warnings for unused
variables and list concatenation are resolved.

In `@TargetLibraries/GAP9/inc/ne16_utils.h`:
- Around line 11-12: The public header ne16_utils.h currently pulls in
implementation-only headers; replace the includes of "CNN_BasicKernels_fp32.h"
and "pmsis.h" with the standard <stdint.h> so the public struct uses only
fixed-width integer types, and remove the unavailable dependency. Then add
`#include` "pmsis.h" into ne16_utils.c (where gap_coreid(), gap_ncore(), and the
v4s vector type are used) so implementation functions in ne16_utils.c can access
those symbols.

---

Outside diff comments:
In `@Deeploy/Targets/GAP9/Platform.py`:
- Around line 8-54: The file containing the long import block (look for symbols
like DeQuantTilingReadyBindings, BasicAllocateTemplate, RemoveEmptyConvBiasPass
and PULPDMASliceBindings) is not formatted/organized per CI; run the import and
formatter hooks (isort and yapf) to reorder and format the imports/line breaks,
then stage and commit the resulting changes so pre-commit passes (this also
applies to the nearby import group that includes PULP* bindings and parsers).

---

Duplicate comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: Replace the wildcard self-import in Deeploy/Targets/NE16/__init__.py
with explicit exports: import the sibling module objects you want to expose
(e.g., the NE16 device class/name and the Tiler class) by name and populate
__all__ accordingly so consumers can access NE16 and Tiler directly;
specifically remove "from . import *", add explicit imports of the NE16 module
symbols and Tiler, and set __all__ = ["NE16", "Tiler", ...] to list the public
API.

In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor NE16Deployer.__init__ currently uses a
mutable default for the inputOffsets parameter (inputOffsets = {}), which can be
shared across instances; change the signature to use inputOffsets=None and then
when calling super().__init__ pass an empty dict when inputOffsets is None
(e.g., use inputOffsets or {}), ensuring NE16Deployer.__init__ does not leak
offsets between deployments.
- Around line 29-40: The code mutates the shared loweringOptimizer.passes list
causing duplicate/side-effect passes; fix by creating a new list copy and
assigning it back instead of modifying in place: make a local copy (e.g.,
new_passes = list(self.loweringOptimizer.passes)), perform the replacement of
PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes when
self.Platform.engines[0].enable3x3 is true, then set
self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(),
NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these
symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass,
ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer
to avoid mutating shared state.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: The PW/DW classifiers are too permissive: update isPWConv
to also require node.attrs['group'] == 1 to reject grouped convs, and update
isDWConv to enforce depthwise conditions by checking the weight tensor on
node.inputs[1] has shape where weights.shape[1] == 1 and node.attrs['group'] ==
node.attrs.get('channels_out' or 'ch_im_out' equivalent) (i.e., group == output
channel count); locate these checks in the isPWConv and isDWConv functions and
add the group and weight-shape validations so canExecute no longer returns True
for convs that will fail downstream parsing/tiling.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-56: The parseNodeCtxt method in NE16's Parsers.py currently
defaults channels_first=True but immediately rejects any True value (using
"channels_first == False"); change the default of parseNodeCtxt to
channels_first=False and also update the three RQS override methods that default
channels_first=True to default to False, then simplify the conditional check
from "channels_first == False" to "not channels_first" (ensure you update all
occurrences in parseNodeCtxt and the three RQS override methods so callers using
defaults succeed).
- Around line 118-123: The parseNodeCtxt overrides unconditionally map
node.inputs into inputs = ['data_in','weight','mul','add'] causing IndexError
for a 5th tensor and leaving operatorRepresentation['shift'] unset; update
parseNodeCtxt (the methods that populate self.operatorRepresentation) to first
check len(node.inputs) and either assert/raise if it is not 4 or 5, and if a 5th
input exists map it to self.operatorRepresentation['shift'] =
ctxt.lookup(node.inputs[4].name).name, and additionally if 'shift' can come from
node.attrs check node.attrs.get('shift') and set operatorRepresentation['shift']
accordingly so ConvTemplate.py's ${shift} is always defined or an explicit error
is raised.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: The template emits an invalid assignment
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" inside the ne16_cfg_t
compound literal; update the ConvTemplate.py rendering so it produces a
designated initializer by replacing that emitted assignment with
".weights_stride = (ne16_stride_t){ .d0 = ${weightStrideD0}, .d1 =
${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e., remove the
"task.data.cfg." prefix and emit the field as a proper member initializer inside
the ne16_cfg_t literal), ensuring punctuation and comma placement matches other
fields in the ne16_cfg_t literal.
- Around line 145-151: The conf0 bit for bias presence is being set when
operatorRepresentation contains the key "add" even if its value is the sentinel
"NULL"; update the requantized branch in ConvTemplate.py so the third argument
to getNormQuantConf0 uses an explicit non-NULL check (e.g.
operatorRepresentation.get("add") != "NULL") instead of '"add" in
operatorRepresentation', mirroring the mul check (operatorRepresentation["mul"]
!= "NULL"); locate the operatorRepresentation usage around the mulBuff lookup
and change the boolean expression passed to getNormQuantConf0 accordingly so
bias-less convs do not set bit 25.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current code only pins weightOutChannelVar to its
Max(), allowing the other two packed weight dimensions to be shrunk; fix by
retrieving the remaining dimension variables for the weight buffer (the vars
corresponding to dims 1 and 2 of weightBuffer) and add equivalent constraints
using tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed
block (all dims of weightBuffer) is fixed to full size; use
weightBuffer/weightBufferName to locate the buffer and weightOutChannelVar as a
reference to the existing constraint when adding the other two constraints.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-56: The current pointwise tiler constraints incorrectly force
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids
strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where
tilerModel.addConstraint is called (the lines referencing
outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A)
add an assertion that strides == [1,1] when isPWConv can accept strides (check
NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred)
replace the identity constraints with the standard convolution output relation
used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1,
computing effective_input = input + pad_top + pad_bottom (use whatever padding
variables exist in scope) and apply that formula for both height and width via
tilerModel.addConstraint.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 88-102: The subtraction "values = values - weight_offset" risks
int8 overflow; cast the tensor to a wider signed dtype (e.g., int16 or int32)
before computing "weight_offset" subtraction so arithmetic is done in a safe
width, perform "values = values.astype(wider_dtype) - int(weight_offset)" then
(after any necessary range checks/clamping) cast to uint8 for encoding; update
the code paths around weightTensor.values, weight_offset and the subsequent
astype(uint8) to use the widened intermediate to avoid wraparound for int8
inputs.
- Around line 139-150: The function _findAllMultiplicands uses float division
and an exclusive sqrt bound; change the loop to iterate i from 2 up to and
including the integer square root (use math.isqrt(x) or int(math.sqrt(x)) + 1)
and perform integer division on tmpX (use //= instead of /=) so tmpX remains an
int for the modulo checks; also simplify the tail handling by appending tmpX if
tmpX > 1 (rather than recomputing x // math.prod(...)) so _findAllMultiplicands
reliably returns prime factors for inputs like 4 or 49 and avoids float-related
modulo failures.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: Remove or use the three unused weight-dim variables
(weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in
NE16DenseConstraint; they are allocated but never referenced in
addGeometricalConstraint while only weightOutChannelVar is used. Either delete
those getTensorDimVar calls to clean up the code, or wire them into the
constraint logic inside addGeometricalConstraint (or the class constructor) —
e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the
pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling
decisions and weightInChannelMajorVar is used where input-channel-major indexing
matters.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cc86fa11-0cab-48bf-89f1-7688eb67747c

📥 Commits

Reviewing files that changed from the base of the PR and between 748707a and 11a641b.

📒 Files selected for processing (55)

.github/workflows/_runner-gap9-w-ne16-tiled.yml
.github/workflows/_runner-snitch-tiled-sequential.yml
.github/workflows/ci-platform-gap9-tiled.yml
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
.github/workflows/ci-platform-gap9.yml
CMakeLists.txt
Deeploy/Targets/GAP9/Bindings.py
Deeploy/Targets/GAP9/DMA/L3Dma.py
Deeploy/Targets/GAP9/Parsers.py
Deeploy/Targets/GAP9/Platform.py
Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py
Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py
Deeploy/Targets/GAP9/Tiler.py
Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/Generic/Bindings.py
Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/Generic/TypeCheckers.py
Deeploy/Targets/NE16/Bindings.py
Deeploy/Targets/NE16/Deployer.py
Deeploy/Targets/NE16/Engine.py
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Deeploy/Targets/NE16/OptimizationPasses/__init__.py
Deeploy/Targets/NE16/Parsers.py
Deeploy/Targets/NE16/Platform.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
Deeploy/Targets/NE16/Templates/ConvTemplate.py
Deeploy/Targets/NE16/Templates/__init__.py
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
Deeploy/Targets/NE16/TileConstraints/__init__.py
Deeploy/Targets/NE16/Tiler.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
Deeploy/Targets/NE16/__init__.py
Deeploy/Targets/PULPOpen/Bindings.py
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
DeeployTest/CMakeLists.txt
DeeployTest/Platforms/GAP9/CMakeLists.txt
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
DeeployTest/conftest.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testMVP.py
DeeployTest/testUtils/core/execution.py
DeeployTest/testUtils/platformMapping.py
DeeployTest/test_gap9_ne16_tiled_config.py
DeeployTest/test_platforms.py
TargetLibraries/GAP9/CMakeLists.txt
TargetLibraries/GAP9/inc/ne16_utils.h
TargetLibraries/GAP9/src/ne16_utils.c

✅ Files skipped from review due to trivial changes (8)

Deeploy/Targets/GAP9/TopologyOptimizationPasses/init.py
DeeployTest/testMVP.py
DeeployTest/Platforms/GAP9/CMakeLists.txt
Deeploy/Targets/NE16/Tiler.py
.github/workflows/_runner-gap9-w-ne16-tiled.yml
Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
DeeployTest/test_gap9_ne16_tiled_config.py
Deeploy/Targets/NE16/Bindings.py

🚧 Files skipped from review as they are similar to previous changes (10)

DeeployTest/testUtils/core/execution.py
DeeployTest/conftest.py
DeeployTest/CMakeLists.txt
CMakeLists.txt
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
Deeploy/Targets/NE16/Templates/AllocateTemplate.py
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
DeeployTest/testUtils/platformMapping.py
.github/workflows/ci-platform-gap9-w-ne16-tiled.yml
Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py

coderabbitai · 2026-04-18T21:35:13Z

+        # Truncate broadcast bias [M, O] → per-channel [Ko] for NE16
+        bias_flat = C.values.flatten()
+        if bias_flat.size > Ko:
+            C.values = bias_flat[:Ko].copy()
+
+        # Signed input bias compensation
+        if input_signed:
+            # Compute w_sum BEFORE packing (needed for signed bias compensation)
+            w_sum = w_int8.astype(np.int64).sum(axis = 1)  # [Ko]
+            bias_values = C.values.flatten().astype(np.int64)
+            if 'mul' in operatorRepresentation:
+                # RequantizedGemm: bias -= 128 * w_sum * scale
+                scale_buf = ctxt.lookup(operatorRepresentation['mul'])
+                scale_values = scale_buf.values.flatten().astype(np.int64)
+                bias_values -= 128 * w_sum * scale_values
+            else:
+                # Gemm int32: bias -= 128 * w_sum (no scale)
+                bias_values -= 128 * w_sum
+            C.values = bias_values.astype(np.int32)


⚠️ Potential issue | 🟠 Major

Bias handling only covers size > Ko, not broadcast/scalar cases.

The if bias_flat.size > Ko branch handles a broadcast-expanded bias of shape [M, O], but when bias_flat.size < Ko (scalar bias, or per-batch bias of size 1 that wasn't pre-broadcast) C.values is left untouched. The signed-input branch below then calls C.values.flatten().astype(np.int64) and computes bias_values -= 128 * w_sum * scale_values with mismatched shapes, silently broadcasting the short bias over w_sum[:Ko] and writing back a non-Ko array that the NE16 kernel will later read as [Ko] (OOB/garbage).

Consider explicitly normalizing bias to [Ko] up front (broadcast-fill or np.resize) and asserting the resulting shape, e.g.:

🛡️ Proposed fix

- # Truncate broadcast bias [M, O] → per-channel [Ko] for NE16 - bias_flat = C.values.flatten() - if bias_flat.size > Ko: - C.values = bias_flat[:Ko].copy() + # Normalize bias to per-channel [Ko] for NE16 + bias_flat = C.values.flatten() + if bias_flat.size == Ko: + C.values = bias_flat.copy() + elif bias_flat.size > Ko and bias_flat.size % Ko == 0: + C.values = bias_flat[:Ko].copy() + elif bias_flat.size == 1: + C.values = np.full((Ko,), bias_flat[0], dtype = bias_flat.dtype) + else: + raise ValueError(f"NE16 GEMM bias size {bias_flat.size} incompatible with Ko={Ko}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 62 - 80, The bias handling must normalize C.values to a 1-D array of length Ko before any signed-input compensation; instead of only truncating when bias_flat.size > Ko, handle bias_flat.size < Ko by expanding/broadcasting (e.g., repeat or np.resize) to length Ko and assert the resulting shape equals Ko, ensure dtype is appropriate (int32/int64 as needed), then perform the signed-input compensation using w_sum, scale_values (from operatorRepresentation['mul']) or the int32 path, and finally assign the corrected Ko-length array back to C.values; update references around C.values, bias_flat, Ko, w_sum and operatorRepresentation so no silent broadcasting or out-of-bounds write occurs.

coderabbitai · 2026-04-18T21:35:13Z

+        transA = operatorRepresentation['transA']
+        transB = operatorRepresentation['transB']
+
+        buffA = ctxt.lookup(operatorRepresentation['A'])
+        buffB = ctxt.lookup(operatorRepresentation['B'])
+
+        NSize = buffA.shape[-1]
+


⚠️ Potential issue | 🟠 Major

Make NSize transpose-aware.

Line 122 always uses buffA.shape[-1], but for transA == 1 the reduction dimension is buffA.shape[-2]. This will serialize wrong A/B tile shapes for transposed GEMM.

Proposed fix

- NSize = buffA.shape[-1] + NSize = buffA.shape[-1 - transA]

Based on learnings, GEMM tiling serialization must use transpose flags to adjust NSize and matrix offset/shape calculations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines 116 - 123, The code always sets NSize = buffA.shape[-1] which ignores transposition; update the calculation to choose the reduction dimension based on operatorRepresentation['transA'] (and consider transB if used elsewhere): compute NSize = buffA.shape[-2] when transA == 1 else buffA.shape[-1], and adjust any subsequent uses that compute A/B tile offsets or shapes so they use this transpose-aware NSize (refer to variables transA, transB, buffA, buffB, and NSize in NE16GEMMTileConstraint.py).

coderabbitai · 2026-04-18T21:35:13Z

+        for cube in outputCubes:
+            MOffset, OOffset = cube.offset[-2:]
+            MSize, OSize = cube.dims[-2:]
+
+            if len(cube.offset) > 2:
+                BatchSize = math.prod(cube.dims[:-2])
+            else:
+                BatchSize = 1
+
+            replacements["M"].append(MSize)
+            replacements["O"].append(OSize)
+            replacements["batch"].append(BatchSize)
+
+            if transA == 0:
+                AMatrixOffsets = (MOffset, 0)
+                AMatrixShape = (MSize, NSize)
+            else:
+                AMatrixOffsets = (0, MOffset)
+                AMatrixShape = (NSize, MSize)
+
+            if len(buffA.shape) > 2:
+                batchDimCount = len(buffA.shape) - 2
+                AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets
+                AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape
+
+            inputACubes.append(HyperRectangle(AMatrixOffsets, AMatrixShape))
+
+            if transB == 0:
+                BMatrixOffsets = (0, OOffset)
+                BMatrixShape = (NSize, OSize)
+            else:
+                BMatrixOffsets = (OOffset, 0)
+                BMatrixShape = (OSize, NSize)
+
+            inputBCubes.append(HyperRectangle(BMatrixOffsets, BMatrixShape))
+
+            RequantCube = HyperRectangle((OOffset,), (OSize,))


⚠️ Potential issue | 🟠 Major

Use absolute output offsets when deriving input tiles.

The loop uses cube.offset, but cube is only absoluteCube.rectangle; this drops AbsoluteHyperRectangle.absoluteOffset. Non-zero output tiles can then reload A/B/C/mul/scale slices from the wrong global offsets.

Proposed fix

- for cube in outputCubes: - MOffset, OOffset = cube.offset[-2:] + for absoluteCube in absoluteOutputCubes: + cube = absoluteCube.rectangle + MOffset, OOffset = absoluteCube.absoluteOffset[-2:] MSize, OSize = cube.dims[-2:] @@ if len(buffA.shape) > 2: batchDimCount = len(buffA.shape) - 2 - AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets + AMatrixOffsets = tuple(absoluteCube.absoluteOffset[:-2][-batchDimCount:]) + AMatrixOffsets AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines 131 - 167, The loop over outputCubes is using cube.offset (the local rectangle) which ignores AbsoluteHyperRectangle.absoluteOffset, causing input tile offsets to be computed from the wrong origin; update the code that builds AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive offsets from absoluteCube.rectangle instead of cube) so the inputACubes/inputBCubes and RequantCube use global offsets; ensure this change applies where batch-aware prepending of batch offsets occurs (the block that updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to inputACubes, inputBCubes and creating RequantCube (HyperRectangle).

coderabbitai · 2026-04-18T21:35:13Z

+    if node.op == 'RequantizedGemm' and len(node.inputs) >= 4:
+        mulTensor = node.inputs[3]
+        biasTensor = node.inputs[2]
+
+        if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant):
+            mul_values = mulTensor.values.flatten().astype(np.int32)
+            log2D = int(np.log2(node.attrs['div'].values))
+
+            # Broadcast scalar mul to per-channel if needed
+            if len(mul_values) == 1:
+                mul_values = np.full(Ko, mul_values[0], dtype = np.int32)
+
+            ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D)
+
+            # Rescale bias from mul/log2D domain to scale/scale_n domain
+            # bias_merged is already *= mul from PULPGEMMRequantMergePass
+            # NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D)
+            bias_values = biasTensor.values.flatten().astype(np.int64)
+            ne16_bias = np.zeros(Ko, dtype = np.int64)
+            for ko in range(Ko):
+                shift_diff = int(ne16_scale_n[ko]) - log2D
+                if shift_diff >= 0:
+                    ne16_bias[ko] = bias_values[ko] << shift_diff
+                else:
+                    ne16_bias[ko] = bias_values[ko] >> (-shift_diff)
+
+            ne16_bias = ne16_bias.astype(np.int32)


⚠️ Potential issue | 🟠 Major

Validate requant constants before mutating the GEMM node.

This block can crash on missing/non-scalar div, mul lengths other than 1 or Ko, or bias shorter than Ko. Since the weight may already be transposed above, a failure here can leave a partially rewritten graph.

Proposed guard

if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant): mul_values = mulTensor.values.flatten().astype(np.int32) - log2D = int(np.log2(node.attrs['div'].values)) + div_attr = node.attrs.get('div') + div_values = getattr(div_attr, "values", div_attr) + if div_values is None or np.size(div_values) != 1: + return graph + log2D = int(np.log2(np.asarray(div_values).item())) # Broadcast scalar mul to per-channel if needed if len(mul_values) == 1: mul_values = np.full(Ko, mul_values[0], dtype = np.int32) + elif len(mul_values) != Ko: + return graph ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D) # Rescale bias from mul/log2D domain to scale/scale_n domain # bias_merged is already *= mul from PULPGEMMRequantMergePass # NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D) bias_values = biasTensor.values.flatten().astype(np.int64) + if len(bias_values) == 1: + bias_values = np.full(Ko, bias_values[0], dtype = np.int64) + elif len(bias_values) < Ko: + return graph + else: + bias_values = bias_values[:Ko] ne16_bias = np.zeros(Ko, dtype = np.int64)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py` around lines 80 - 106, Ensure RequantizedGemm handling validates all requant constants before mutating the node: check node.op == 'RequantizedGemm' still has node.attrs['div'] and that node.attrs['div'].values is a scalar; verify mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and that mul_values length is either 1 or equal to Ko and bias_values length is at least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias after these guards and return/skip modification if any check fails to avoid partially rewriting the graph.

coderabbitai · 2026-04-18T21:35:13Z

+    signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
+    signed = bool(signed_val)
+    bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
+    n_levels = 2**bit_width
+
+    # Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q
+    ratio = s_d * s_q
+
+    # Identity case: ratio ~= 1.0, both zero points == 0
+    EPSILON = 1e-6
+    if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:
+        input_tensor = dequant_node.inputs[0]
+        output_tensor = quant_node.outputs[0]
+        for downstream_node in list(output_tensor.outputs):
+            for i, inp in enumerate(downstream_node.inputs):
+                if inp == output_tensor:
+                    downstream_node.inputs[i] = input_tensor
+        dequant_node.inputs.clear()
+        dequant_node.outputs.clear()
+        quant_node.inputs.clear()
+        quant_node.outputs.clear()
+        graph.cleanup().toposort()


⚠️ Potential issue | 🟠 Major

Guard the identity bypass against output and dtype changes.

The fast path only checks ratio/zero-points and then removes both nodes. If the Quant output is a graph output, this leaves the declared output tensor without a producer. It also skips narrowing/clamping when Dequant and Quant bit widths differ, e.g. int16 dequantized then quantized to int8 with identity scale.

🐛 Proposed guard

signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1 signed = bool(signed_val) bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8 + dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item()) + dequant_signed = bool(dequant_signed_val) + dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item()) n_levels = 2**bit_width @@ - if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed: + if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and + signed == dequant_signed and bit_width == dequant_bit_width): input_tensor = dequant_node.inputs[0] output_tensor = quant_node.outputs[0] + if output_tensor in graph.outputs: + return graph for downstream_node in list(output_tensor.outputs):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1

signed = bool(signed_val)

bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8

n_levels = 2**bit_width

# Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q

ratio = s_d * s_q

# Identity case: ratio ~= 1.0, both zero points == 0

EPSILON = 1e-6

if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:

input_tensor = dequant_node.inputs[0]

output_tensor = quant_node.outputs[0]

for downstream_node in list(output_tensor.outputs):

for i, inp in enumerate(downstream_node.inputs):

if inp == output_tensor:

downstream_node.inputs[i] = input_tensor

dequant_node.inputs.clear()

dequant_node.outputs.clear()

quant_node.inputs.clear()

quant_node.outputs.clear()

graph.cleanup().toposort()

signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1

signed = bool(signed_val)

bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8

dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item())

dequant_signed = bool(dequant_signed_val)

dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item())

n_levels = 2**bit_width

# Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q

ratio = s_d * s_q

# Identity case: ratio ~= 1.0, both zero points == 0

EPSILON = 1e-6

if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and

signed == dequant_signed and bit_width == dequant_bit_width):

input_tensor = dequant_node.inputs[0]

output_tensor = quant_node.outputs[0]

if output_tensor in graph.outputs:

return graph

for downstream_node in list(output_tensor.outputs):

for i, inp in enumerate(downstream_node.inputs):

if inp == output_tensor:

downstream_node.inputs[i] = input_tensor

dequant_node.inputs.clear()

dequant_node.outputs.clear()

quant_node.inputs.clear()

quant_node.outputs.clear()

graph.cleanup().toposort()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py` around lines 1200 - 1221, The identity bypass must be guarded so it only removes dequant_node/quant_node when the operation truly preserves type and graph outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output and that quant_node and dequant_node have matching signed and bit_width (compare quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents, and ensure n_levels match and signed flags are equal); if bit widths or signed differ (e.g., int16->int8) or the output is a declared graph output, do not remove the nodes and instead let normal narrowing/clamping pass run—implement these checks around the existing identity branch (using ratio, zp_d, zp_q, signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the input substitution, clear inputs/outputs and graph.cleanup().toposort() when all guards pass.

coderabbitai · 2026-04-18T21:35:14Z

+#
+# SPDX-License-Identifier: Apache-2.0
+
+from . import *


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify package initializers that still use self-referential wildcard imports. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391

🏁 Script executed:

fd -t f . Deeploy/Targets/NE16/Templates/ --max-depth 2

Repository: pulp-platform/Deeploy

Length of output: 205

🏁 Script executed:

ls -la Deeploy/Targets/NE16/Templates/

Repository: pulp-platform/Deeploy

Length of output: 376

🏁 Script executed:

cat Deeploy/Targets/NE16/Templates/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 189

🏁 Script executed:

cat -n Deeploy/Targets/NE16/Templates/AllocateTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 924

🏁 Script executed:

head -50 Deeploy/Targets/NE16/Templates/ConvTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 1540

🏁 Script executed:

rg '__all__' Deeploy/Targets/NE16/Templates/

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

rg 'from Deeploy\.Targets\.NE16\.Templates import' --type py

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

rg 'from \.\.Templates import' Deeploy/Targets/NE16 --type py

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

rg 'from.*AllocateTemplate|from.*ConvTemplate|import.*AllocateTemplate|import.*ConvTemplate' --type py

Repository: pulp-platform/Deeploy

Length of output: 3880

🏁 Script executed:

rg 'ne16GenericGlobalInitTemplate|NE16ConvTemplate' --type py

Repository: pulp-platform/Deeploy

Length of output: 531

Export the actual template modules.

This self-referential wildcard import does not load sibling files like the NE16 allocation/conv templates, so package-level imports will not see those symbols and Ruff reports F403.

Proposed direction

-from . import * +from .AllocateTemplate import * +from .ConvTemplate import *

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from . import *

from .AllocateTemplate import *

from .ConvTemplate import *

🧰 Tools

🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Templates/__init__.py` at line 5, The package __init__.py currently uses a self-referential wildcard import ("from . import *") which doesn't actually import sibling modules (e.g., the NE16 allocation and conv template modules) and triggers Ruff F403; replace the wildcard with explicit imports of the template modules (or explicit from .module import <symbols>) and/or populate __all__ with the actual module or symbol names (for example add explicit references to the NE16 allocation and conv modules) so package-level imports expose the intended templates. Ensure you reference and import the specific module names rather than relying on "from . import *".

coderabbitai · 2026-04-18T21:35:14Z

+#
+# SPDX-License-Identifier: Apache-2.0
+
+from . import *


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify package initializers that still use self-referential wildcard imports. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391

🏁 Script executed:

#!/bin/bash # List files in the TileConstraints directory find Deeploy/Targets/NE16/TileConstraints -type f -name "*.py" | head -20

Repository: pulp-platform/Deeploy

Length of output: 356

🏁 Script executed:

#!/bin/bash # Check the content of the __init__.py file and surrounding files ls -la Deeploy/Targets/NE16/TileConstraints/

Repository: pulp-platform/Deeploy

Length of output: 535

🏁 Script executed:

#!/bin/bash # Check what's exported from the constraint modules cat -n Deeploy/Targets/NE16/TileConstraints/*.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 6259

🏁 Script executed:

#!/bin/bash # View the actual __init__.py file content cat -n Deeploy/Targets/NE16/TileConstraints/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224

🏁 Script executed:

#!/bin/bash # Check if constraint modules define __all__ for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do echo "=== $file ===" rg -n '__all__' "$file" || echo "No __all__ found" done

Repository: pulp-platform/Deeploy

Length of output: 327

🏁 Script executed:

#!/bin/bash # Check what's actually exported from constraint modules using grep for class definitions for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do echo "=== $file ===" rg -n '^class ' "$file" done

Repository: pulp-platform/Deeploy

Length of output: 651

Replace the self-import with explicit constraint exports.

from . import * imports from the current package namespace, not from sibling modules. Since the constraint modules are never imported, their classes are not exposed. Import the concrete constraint modules instead.

Proposed fix

-from . import * +from .NE16DenseConstraint import * +from .NE16DepthwiseConstraint import * +from .NE16PointwiseConstraint import *

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from . import *

from .NE16DenseConstraint import *

from .NE16DepthwiseConstraint import *

from .NE16PointwiseConstraint import *

🧰 Tools

🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the package wildcard import in __init__.py (currently "from . import *") with explicit relative imports of the concrete constraint modules and/or classes so sibling modules are actually loaded and their classes are exposed; e.g., use "from .constraint_module import ConstraintClass" for each constraint module, and populate an __all__ list with the exported class names (or import the modules with "from . import constraint_module" if you prefer module exports) so the concrete constraint classes are available to consumers.

coderabbitai · 2026-04-18T21:35:14Z

+#
+# SPDX-License-Identifier: Apache-2.0
+
+from . import *


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify package initializers that still use self-referential wildcard imports. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391

🏁 Script executed:

#!/bin/bash # List files in TopologyOptimizationPasses directory ls -la Deeploy/Targets/NE16/TopologyOptimizationPasses/

Repository: pulp-platform/Deeploy

Length of output: 300

🏁 Script executed:

#!/bin/bash # Check if Passes.py exists and show its content if [ -f "Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py" ]; then wc -l Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py head -50 Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py fi

Repository: pulp-platform/Deeploy

Length of output: 1989

🏁 Script executed:

#!/bin/bash # Check __init__.py for __all__ definition cat -n Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224

🏁 Script executed:

#!/bin/bash # Check for __all__ definition in Passes.py rg -n '__all__' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Find class definitions in Passes.py to identify the pass classes rg -n '^class ' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

Repository: pulp-platform/Deeploy

Length of output: 327

Import Passes explicitly.

from . import * does not re-export the pass classes from Passes.py; consumers importing from Deeploy.Targets.NE16.TopologyOptimizationPasses cannot access the four pass classes (NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass, ConvEngineDiscolorationPass, NE16OptimizationPass) without an explicit import.

Proposed fix

-from . import * +from .Passes import *

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from . import *

from .Passes import *

🧰 Tools

🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py` at line 5, The package __init__.py currently uses a wildcard import which doesn't re-export classes from Passes.py; update Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import and re-export the four pass classes by adding an explicit import from .Passes of NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass, ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are exposed (e.g., in __all__) so consumers can import them directly from the package.

coderabbitai · 2026-04-18T21:35:14Z

+    test_name, l1, config_name = test_params
+
+    ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python -m ruff check DeeployTest/test_platforms.py --select RUF059,RUF005

Repository: pulp-platform/Deeploy

Length of output: 25561

Clean up the Ruff warnings in the three new GAP9_w_NE16 tiled tests.

The unpacked config_name variable is never used, and Ruff prefers iterable unpacking over list concatenation. Apply these changes at lines 1008-1010, 1043-1045, and 1076-1078:

Proposed fix

- test_name, l1, config_name = test_params + test_name, l1, _config_name = test_params - ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"] + ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

test_name, l1, config_name = test_params

ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]

test_name, l1, _config_name = test_params

ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 1008-1008: Unpacked variable config_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

[warning] 1010-1010: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@DeeployTest/test_platforms.py` around lines 1008 - 1010, Unpack test_params using a throwaway name for the unused config_name (e.g., test_name, l1, _ = test_params) and build the cmake args using iterable unpacking instead of list concatenation (e.g., ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]); apply the same two edits for the other two occurrences that create ne16_cmake_args in the GAP9_w_NE16 tiled tests so Ruff warnings for unused variables and list concatenation are resolved.

coderabbitai · 2026-04-18T21:35:14Z

+#include "CNN_BasicKernels_fp32.h"
+#include "pmsis.h"


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether the missing header exists and where ne16_utils.h is included. fd -a 'CNN_BasicKernels_fp32\.h$' . rg -n '#include\s+"CNN_BasicKernels_fp32\.h"|#include\s+"ne16_utils\.h"'

Repository: pulp-platform/Deeploy

Length of output: 204

🏁 Script executed:

#!/bin/bash # Read the header file to understand what's actually needed echo "=== ne16_utils.h content ===" cat -n TargetLibraries/GAP9/inc/ne16_utils.h echo "" echo "=== ne16_utils.c content ===" cat -n TargetLibraries/GAP9/src/ne16_utils.c

Repository: pulp-platform/Deeploy

Length of output: 1945

Move implementation dependencies out of the public header.

ne16_utils.h only needs fixed-width integer types for its public struct, but it includes CNN_BasicKernels_fp32.h (which does not exist in the repository) and pmsis.h (which is only used in the implementation). These unavailable/unnecessary headers in a public header break any code that includes it.

Replace the includes in the header with <stdint.h>, and add #include "pmsis.h" to ne16_utils.c where it is actually needed (for gap_coreid(), gap_ncore(), and v4s vector types).

Proposed fix

In TargetLibraries/GAP9/inc/ne16_utils.h:

-#include "CNN_BasicKernels_fp32.h" -#include "pmsis.h" +#include <stdint.h>

In TargetLibraries/GAP9/src/ne16_utils.c:

`#include` "ne16_utils.h" +#include "pmsis.h"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#include "CNN_BasicKernels_fp32.h"

#include "pmsis.h"

`#include` <stdint.h>

🧰 Tools

🪛 Clang (14.0.6)

[error] 11-11: 'CNN_BasicKernels_fp32.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@TargetLibraries/GAP9/inc/ne16_utils.h` around lines 11 - 12, The public header ne16_utils.h currently pulls in implementation-only headers; replace the includes of "CNN_BasicKernels_fp32.h" and "pmsis.h" with the standard <stdint.h> so the public struct uses only fixed-width integer types, and remove the unavailable dependency. Then add `#include` "pmsis.h" into ne16_utils.c (where gap_coreid(), gap_ncore(), and the v4s vector type are used) so implementation functions in ne16_utils.c can access those symbols.

The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.

KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but the template was using (v4s){0,0,0,0} (signed), causing a compilation error on GCC with -Werror.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py (1)

41-41: Match the NodeTemplate.alignToContext return contract.

The override returns operatorRepresentation, so the annotation should use OperatorRepresentation instead of Dict.

♻️ Proposed annotation cleanup

-from typing import Dict, List, Tuple
+from typing import List, Tuple
@@
     def alignToContext(self, ctxt: NetworkContext,
-                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
+                       operatorRepresentation: OperatorRepresentation
+                       ) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 41, The return
type annotation for NodeTemplate.alignToContext is incorrect: the method
currently declares Tuple[NetworkContext, Dict, List[str]] but the override
actually returns operatorRepresentation, so update the second generic from Dict
to OperatorRepresentation; modify the signature of alignToContext to return
Tuple[NetworkContext, OperatorRepresentation, List[str]] and ensure any related
type imports/usages reference OperatorRepresentation (and update any docstrings
or comments mentioning the old Dict type).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 60: The unpacking assigns Ko, Ki = w_int8.shape but Ki is never used,
causing a Ruff unused-variable warning; change the unpack to use a dummy name
(e.g., Ko, _ = w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear
and the linter is satisfied, leaving any other code that references Ko
unchanged.
- Around line 93-102: The in-place int8->uint8 offsetting mutates the input
buffer `${A}` via ne16_int8_to_uint8 and corrupts aliased or reused inputs;
instead allocate a hoisted transient uint8 buffer sized `${batch} * ${M} *
${N}`, populate a ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out =
(uint8_t *)transient_buf, call pi_cl_team_fork(NUM_CORES, (void
*)ne16_int8_to_uint8, &that_arg), and use the transient buffer for downstream
NE16 calls; alternatively, if you must keep the in-place path, add a clear
comment and enforce single-consumer/non-aliased binding for `${A}` so callers
know mutation will occur.

---

Nitpick comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 41: The return type annotation for NodeTemplate.alignToContext is
incorrect: the method currently declares Tuple[NetworkContext, Dict, List[str]]
but the override actually returns operatorRepresentation, so update the second
generic from Dict to OperatorRepresentation; modify the signature of
alignToContext to return Tuple[NetworkContext, OperatorRepresentation,
List[str]] and ensure any related type imports/usages reference
OperatorRepresentation (and update any docstrings or comments mentioning the old
Dict type).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba38bca8-0085-4560-bbc7-d05070820224

📥 Commits

Reviewing files that changed from the base of the PR and between 11a641b and 6a3793a.

📒 Files selected for processing (2)

Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py
TargetLibraries/GAP9/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

TargetLibraries/GAP9/CMakeLists.txt

coderabbitai · 2026-04-19T00:23:31Z

+
+        # Weight packing and signed bias compensation
+        w_int8 = B.values.astype(np.int8)  # [Ko, Ki], still int8 at this point
+        Ko, Ki = w_int8.shape


⚠️ Potential issue | 🟡 Minor

Silence the Ruff unused-variable warning.

Ki is unpacked but never used; use a dummy name.

🧹 Proposed lint fix

- Ko, Ki = w_int8.shape + Ko, _ = w_int8.shape

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Ko, Ki = w_int8.shape

Ko, _ = w_int8.shape

🧰 Tools

🪛 Ruff (0.15.10)

[warning] 60-60: Unpacked variable Ki is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 60, The unpacking assigns Ko, Ki = w_int8.shape but Ki is never used, causing a Ruff unused-variable warning; change the unpack to use a dummy name (e.g., Ko, _ = w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear and the linter is satisfied, leaving any other code that references Ko unchanged.

coderabbitai · 2026-04-19T00:23:32Z

+% if input_signed:
+// Signed input: add 128 offset to convert int8 -> uint8 (multi-core SIMD)
+{
+    ne16_int8_to_uint8_T _offset_arg = {
+        .In = (int8_t *)${A},
+        .Out = (uint8_t *)${A},
+        .size = ${batch} * ${M} * ${N}
+    };
+    pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8, &_offset_arg);
+}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | head -200

Repository: pulp-platform/Deeploy

Length of output: 10496

🏁 Script executed:

cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | tail -50

Repository: pulp-platform/Deeploy

Length of output: 2599

🏁 Script executed:

# Search for references to NE16GEMMTemplate or similar buffer management patterns rg "NE16GEMM|buffer.*signed|input_signed" --max-count=20

Repository: pulp-platform/Deeploy

Length of output: 7878

🏁 Script executed:

cat -n Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 10247

🏁 Script executed:

# Check PULPOpen ConvTemplate for signed input handling rg -A 15 "if input_signed:" Deeploy/Targets/PULPOpen/Templates/ConvTemplate.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 2025

🏁 Script executed:

# Search for ne16_int8_to_uint8 references to understand the conversion pattern rg "ne16_int8_to_uint8" --max-count=20

Repository: pulp-platform/Deeploy

Length of output: 741

🏁 Script executed:

cat -n TargetLibraries/GAP9/src/ne16_utils.c

Repository: pulp-platform/Deeploy

Length of output: 1257

Avoid mutating ${A} during signed-input offsetting.

The in-place conversion at lines 93–102 and 168–177 corrupts ${A} permanently for any downstream consumer, skip connection, graph input reuse, or aliased buffer. The ne16_int8_to_uint8 utility mutates the buffer in-place, and no tile constraint or binding enforces single-consumer or non-aliased input guarantees. Use a hoisted transient uint8 input buffer for the offset path, or explicitly document and enforce that this binding only accepts single-consumer, non-aliased activations.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 93 - 102, The in-place int8->uint8 offsetting mutates the input buffer `${A}` via ne16_int8_to_uint8 and corrupts aliased or reused inputs; instead allocate a hoisted transient uint8 buffer sized `${batch} * ${M} * ${N}`, populate a ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out = (uint8_t *)transient_buf, call pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8, &that_arg), and use the transient buffer for downstream NE16 calls; alternatively, if you must keep the in-place path, add a clear comment and enforce single-consumer/non-aliased binding for `${A}` so callers know mutation will occur.

Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.

profileTiling generates code calling getCycles() in Network.c but the header declaring it was not included. Add CycleCounter.h to both GAP9 and NE16 platform include lists, and expose the GAP9 inc/ directory to the network target so the header is found at compile time.

GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.

The --enable-3x3 flag was parsed by the runner script but never passed to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3 and Dense 3x3 convolutions silently fell back to the PULP cluster instead of dispatching to NE16. Add the flag and set it on the engine.

The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.

Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8) is too small — NE16 dispatch overhead dominates at only 12.8% utilization. Also wire --enable-3x3 through deeployRunner gen_args.

runwangdl requested review from Victor-Jung and Xeratec as code owners April 13, 2026 22:29

runwangdl assigned runwangdl and pauloohaha and unassigned pauloohaha Apr 13, 2026

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

runwangdl force-pushed the gap9-ne16 branch 9 times, most recently from 1153077 to 4edb011 Compare April 13, 2026 23:43

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

runwangdl force-pushed the gap9-ne16 branch from 4edb011 to 748707a Compare April 14, 2026 08:54

coderabbitai Bot reviewed Apr 14, 2026

View reviewed changes

runwangdl assigned pauloohaha Apr 14, 2026

runwangdl and others added 2 commits April 14, 2026 10:43

runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50

runwangdl force-pushed the gap9-ne16 branch from b3f40e5 to 6c8ae2b Compare April 14, 2026 10:59

runwangdl added 3 commits April 14, 2026 11:01

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

[NE16] add CNN_Libraries_HWPE include path for GAP9 SDK NE16 kernels

a47ae48

The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.

runwangdl force-pushed the gap9-ne16 branch from 11a641b to a47ae48 Compare April 18, 2026 22:09

[NE16] fix v4s/v4u type mismatch in NE16 GEMM template

6a3793a

KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but the template was using (v4s){0,0,0,0} (signed), causing a compilation error on GCC with -Werror.

coderabbitai Bot reviewed Apr 19, 2026

View reviewed changes

runwangdl added 6 commits April 19, 2026 16:53

[CI] add NE16 profiling step to GAP9+NE16 tiled workflow

e6355a9

Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.

[CI] fix NE16 profiling: disable LTO and clean build dir

2d75f76

GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.

[NE16] pass --enable-3x3 from deeployRunner to generateNetwork

ac3d172

The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.

		parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n')
		parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n')

	from . import *
	from .AllocateTemplate import *
	from .ConvTemplate import *

-from . import *
+from .NE16DenseConstraint import *
+from .NE16DepthwiseConstraint import *
+from .NE16PointwiseConstraint import *

		test_name, l1, config_name = test_params

		ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]

	#include "CNN_BasicKernels_fp32.h"
	#include "pmsis.h"
	`#include` <stdint.h>

Conversation

runwangdl commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

Test plan

PR Merge Checklist

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

runwangdl commented Apr 13, 2026 •

edited

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading