Skip to content

[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9#183

Open
runwangdl wants to merge 14 commits intopulp-platform:develfrom
runwangdl:gap9-ne16
Open

[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9#183
runwangdl wants to merge 14 commits intopulp-platform:develfrom
runwangdl:gap9-ne16

Conversation

@runwangdl
Copy link
Copy Markdown
Contributor

@runwangdl runwangdl commented Apr 13, 2026

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

  • Deeploy/Targets/NE16/ — full Target: Platform.py (NE16Platform extends GAP9Platform, engines=[NE16Engine, GAP9ClusterEngine]), Engine.py, Bindings.py, Parsers.py, Tiler.py, Deployer.py (extends GAP9Deployer to reuse ClDma transformers), Templates/{Allocate,Conv}Template.py, TileConstraints/{NE16Pointwise,NE16Depthwise,NE16Dense}Constraint.py, TopologyOptimizationPasses/Passes.py (incl. _weightEncode ported from pulp-nnx/test/Ne16Weight.py).
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py and DeeployTest/test_gap9_ne16_tiled_config.py — new runner + minimal test config covering PW 1x1, DW 3x3, Dense 3x3 RQ Conv.
  • TargetLibraries/GAP9/CMakeLists.txt — for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9.

Changed

  • DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, and mapDeployer.
  • DeeployTest/testMVP.py — include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it, NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks).
  • DeeployTest/testUtils/core/execution.py — build the GAP9 SDK image target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
  • CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py — swap Neureka-inherited subtile constants for NE16 spec per ne16_task_defs.h: CIN_SUBTILE 32/28 → 16 (single mode), output 63, per-(cout,cinMajor) weight bytes PW 16, DW/Dense 144.

Fixed

  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Ran on gvsoc gap9.evk (inside ghcr.io/pulp-platform/deeploy-gap9:devel), full pipeline gen -> parse -> lower -> codegen -> CMake -> build -> gapy image -> gvsoc flash run:

Test Errors Runtime (cycles)
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 0 / 1152 901 917
Kernels/Integer/Conv/DW_2D_RQ (--enable-3x3) 0 / 1280 27 339
Kernels/Integer/Conv/Regular_2D_RQ (--enable-3x3) 0 / 6372 244 595

Follow-up (out of scope for this PR): PW Unsigned_RQ variant currently produces non-zero errors — likely unsigned weight-offset or out_type conf0 handling; tracked separately.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds NE16 accelerator support and a new platform variant GAP9_w_NE16: new NE16 engine/platform/deployer, parsers, templates, bindings, tiler constraints, topology/optimization passes, CMake/SDK integration, tests, and CI workflows enabling tiled NE16-enabled execution.

Changes

Cohort / File(s) Summary
Top-level build & CI
CMakeLists.txt, DeeployTest/CMakeLists.txt, TargetLibraries/GAP9/CMakeLists.txt, .github/workflows/.../*.yml
Add GAP9_w_NE16 platform option and build branch, enable USE_NE16 and pulp-nnx linkage, adjust GAP9 targets and C flags, and add NE16 CI workflows and runner.
NE16 package & exports
Deeploy/Targets/NE16/__init__.py, Deeploy/Targets/NE16/Templates/__init__.py, Deeploy/Targets/NE16/TileConstraints/__init__.py, Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py, Deeploy/Targets/NE16/OptimizationPasses/__init__.py
New NE16 package and subpackage initializers that wildcard re-export submodule symbols.
Engine / Platform / Deployer
Deeploy/Targets/NE16/Engine.py, Deeploy/Targets/NE16/Platform.py, Deeploy/Targets/NE16/Deployer.py
Add NE16Engine, NE16Platform/memory wrappers, NE16Deployer, NE16 mapping/optimizer, engine feature flags, and lowering-pass adjustments.
Parsers
Deeploy/Targets/NE16/Parsers.py, Deeploy/Targets/GAP9/Parsers.py
Add NE16-specific Conv2D parsers (base + PW/DW/Dense and RQ variants) and NE16GEMMParser for RequantizedGemm.
Bindings / Templates / Tiler-ready bindings
Deeploy/Targets/NE16/Bindings.py, Deeploy/Targets/GAP9/Bindings.py, Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py, Deeploy/Targets/NE16/Templates/..., Deeploy/Targets/NE16/Tiler.py, Deeploy/Targets/GAP9/Tiler.py
Add NE16 NodeBindings, GAP9 NE16 GEMM templates, NE16 convolution templates (init+exec), and expose tiling-ready bindings for PW/DW/Dense (RQ/non-RQ) and GEMM.
Tile constraints & requant helpers
Deeploy/Targets/NE16/TileConstraints/*, Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py
Add pointwise/depthwise/dense tiling constraints (and RQ variants), requant helper utilities, and NE16 GEMM tile constraint with serialization.
Topology & lowering optimization passes
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py, Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py, Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py, Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
Add weight encoding/layout passes, NE16 reshape/merge passes, GEMM weight adjustment pass, Dequant/Quant merge pass, and a memory-level weight annotation pass.
Type checkers and generic bindings
Deeploy/Targets/Generic/TypeCheckers.py, Deeploy/Targets/Generic/Bindings.py, Deeploy/Targets/PULPOpen/Bindings.py, Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
Add QuantChecker output-type signedness check, extend quant/dequant bindings (add uint8 variants), and add GAP9 SDK quant/dequant templates.
Optimization/SDK helpers & libs
TargetLibraries/GAP9/inc/ne16_utils.h, TargetLibraries/GAP9/src/ne16_utils.c, Deeploy/Targets/GAP9/DMA/L3Dma.py
Add NE16 utility int8→uint8 kernel and adjust L3 DMA template casts; remove old gap9L3DmaHack in favor of GAP9L3Dma.
Tests & test tooling
DeeployTest/*, DeeployTest/test_gap9_ne16_tiled_config.py, DeeployTest/test_platforms.py, DeeployTest/conftest.py, DeeployTest/testUtils/*
Add NE16 test configs, runner script, pytest marker gap9_w_ne16_tiled, platform mapping to NE16 platform/deployer, test cases, and CMake/test build adjustments.
Misc small changes
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py, Deeploy/Targets/GAP9/Bindings.py, Deeploy/Targets/GAP9/Tiler.py, Deeploy/Targets/GAP9/DMA/L3Dma.py
Minor pointer-type alias change and various GAP9 binding/tiler updates to integrate NE16 GEMM, SDK quant/dequant templates, and L3 DMA template refinements.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Graph as gs.Graph
    participant Parser as NE16 Parser
    participant Optimizer as TopologyOptimizer / Passes
    participant Tiler as Tiler / TileConstraints
    participant Templates as Conv/GEMM Templates
    participant Deployer as NE16Deployer / Engine
    participant Build as CMake / pulp-nnx

    Graph->>Parser: parse nodes (Conv/Gemm) with NE16 constraints & RQ fields
    Parser->>Optimizer: emit matches for NE16 topology passes (weight/layout, reshape)
    Optimizer->>Templates: request weight encoding and template alignment
    Templates->>Tiler: provide tiling metadata (counters, strides, conf0)
    Tiler->>Deployer: produce VariableReplacementScheme & TilingSchedule
    Deployer->>Build: integrate generated sources, link pulp-nnx, produce image
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Feature

Suggested reviewers

  • Xeratec
  • Victor-Jung
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR is linked to issue #39 regarding dynamic I/O tensor allocation, but the changeset focuses entirely on adding NE16 accelerator support without addressing any I/O tensor allocation requirements from that issue. Either link this PR to a different issue aligned with NE16 platform support, or address I/O tensor allocation objectives if that was the intended scope.
Docstring Coverage ⚠️ Warning Docstring coverage is 5.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title '[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9' clearly and specifically describes the main change: adding NE16 as an accelerator on the GAP9 platform.
Description check ✅ Passed The PR description provides detailed, relevant context about the NE16 integration: architecture overview, new components added, changes made, and a test plan with actual cycle counts.
Out of Scope Changes check ✅ Passed All changes are scoped to adding the NE16 platform and its integration with GAP9; no unrelated refactoring or extraneous modifications were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (4)
Deeploy/Targets/NE16/TileConstraints/__init__.py (1)

5-5: Replace self-referential wildcard import with explicit exports.

from . import * on Line 5 makes exported symbols unclear. Replace with explicit re-exports plus __all__ to clarify the public API.

Proposed fix
-from . import *
+from .NE16DenseConstraint import NE16DenseConv2DTileConstraint, NE16RQSDenseConv2DTileConstraint
+from .NE16DepthwiseConstraint import NE16DWConv2DTileConstraint, NE16RQSDWConv2DTileConstraint
+from .NE16PointwiseConstraint import NE16PWConv2DTileConstraint, NE16RQSPWConv2DTileConstraint
+
+__all__ = [
+    "NE16DenseConv2DTileConstraint",
+    "NE16RQSDenseConv2DTileConstraint",
+    "NE16DWConv2DTileConstraint",
+    "NE16RQSDWConv2DTileConstraint",
+    "NE16PWConv2DTileConstraint",
+    "NE16RQSPWConv2DTileConstraint",
+]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
self-referential wildcard import in __init__.py with explicit re-exports: remove
"from . import *" and import the specific public symbols (e.g., classes,
functions, constants) from the package submodules using explicit imports (for
example: from .submodule import ClassX, func_y), then declare __all__ =
["ClassX", "func_y", ...] listing those exported names; ensure you reference the
actual symbol names defined in the package (replace placeholders with the real
identifiers) and avoid introducing circular imports when reorganizing the
imports.
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

The BatchOffset, HOffset, WOffset, and BatchSize variables are unpacked for structural clarity but never used. Prefix them with _ to indicate intentional disuse.

🔧 Proposed fix
         for cube in outputCubes:
-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims

Similarly at line 173:

-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint.py currently bind unused
names BatchOffset, HOffset, WOffset, COffset and BatchSize (and similarly at the
later unpacking around line 173); update those unpackings to prefix unused
variables with an underscore (e.g., _BatchOffset, _HOffset, _WOffset,
_BatchSize) so the intent of intentional non-use is clear; locate the unpacking
expressions that assign to cube.offset and cube.dims and rename only the unused
identifiers accordingly while keeping used names (COffset, HSize, WSize, CSize)
unchanged.
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

125-136: Use integer division to avoid floating-point precision issues.

Line 131 uses float division (/), which can cause precision loss for large integers. Use integer division (//) instead.

🔧 Proposed fix
 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
     for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
+            tmpX = tmpX // i

     if x // math.prod(multiplicands) > 1:
         multiplicands.append(x // math.prod(multiplicands))

     return multiplicands
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 125 -
136, In _findAllMultiplicands, replace the floating-point division on tmpX (tmpX
= tmpX / i) with integer division (tmpX = tmpX // i) and ensure tmpX remains an
int so subsequent modulus and math.prod operations are integer-safe; update any
related expressions that rely on tmpX's type (e.g., the while loop condition
tmpX % i == 0 and the final math.prod usage) to work with the integer result.
Deeploy/Targets/NE16/Parsers.py (1)

55-62: Use not channels_first instead of == False.

Per PEP 8 and Ruff E712, avoid direct comparisons to False.

🔧 Proposed fix
         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,
                 # LMACAN: weight shape should be equal to 3 because we have to do the ne16's
                 #         special weight encoding
                 len(weight.shape) == 3,
         ]):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 55 - 62, In the conditional
inside Parsers.py (the if checking channels_first, data_in.shape, weight.shape)
replace the direct boolean comparison channels_first == False with the idiomatic
not channels_first (i.e., use not channels_first in the all([...]) list) so the
condition follows PEP8/Ruff E712; adjust the list element in the same if that
currently reads channels_first == False to read not channels_first.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: The __init__.py currently uses explicit imports but omitted the Tiler
module, causing Engine.py's internal imports to fail; update
Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module
(the Tiler.py symbol) alongside the other exported modules so it is available
when the package is imported, ensuring Engine.py's references to Tiler resolve
correctly (alternatively, use a package import loop such as using
pkgutil.iter_modules to dynamically import all submodules if you prefer
automatic inclusion).

In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor currently uses a mutable default
inputOffsets = {} which causes shared state; change the signature for the
relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use
inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets
is None before passing it to super().__init__ (or before any mutations), so each
instance gets its own dict and avoids state leakage when SignPropDeployer
populates the keys.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-81: The conv-classification is too loose: tighten isPWConv to
reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1,
and tighten isDWConv to only accept true depthwise convolutions by inspecting
the conv weight tensor (node.inputs[1].values) and requiring the per-group
in_channels == 1 and that the number of groups equals the number of output
channels (e.g., weights.shape[1] == 1 and weights.shape[0] ==
node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and
thereby canExecute) accordingly so only bona fide pointwise (group==1) and true
depthwise convs are classified for NE16 coloring.

In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Around line 27-44: The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 118-121: The parsers map only 4 inputs into
self.operatorRepresentation without bounds checking and always set shift to
'NULL', causing IndexError and mismatched scale_shift_addr; update parseNode()
in the affected parsers (the current loop that assigns inputs via inputs =
['data_in','weight','mul','add'] and self.operatorRepresentation[inputs[idx]] =
ctxt.lookup(inputNode.name).name) to perform explicit validation like
len(node.inputs) == 4 (following RQMatMulParser) OR handle a fifth input by
mapping node.inputs[4] into e.g. self.operatorRepresentation['shift'] (or read
shift from node.attrs['shift'] like PULPOpen) and stop hardcoding 'NULL' so the
template variable ${shift} used by ConvTemplate.py:310 gets a proper value;
apply the same change to NE16RQSPWConv2DParser and NE16RQSDenseConv2DParser.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 141-147: The conf0 computation treats bias presence
incorrectly—change the bias check to treat the "NULL" sentinel as absent: when
calling getNormQuantConf0 in the block guarded by operatorRepresentation["mul"]
!= "NULL", replace the argument "add" in operatorRepresentation with an explicit
boolean like ('add' in operatorRepresentation and operatorRepresentation["add"]
!= "NULL") so bit 25 isn't set when add == "NULL"; keep the existing mul !=
"NULL" guard as-is and pass the adjusted boolean into getNormQuantConf0
(symbols: operatorRepresentation, getNormQuantConf0, conf0, "mul", "add").
- Around line 323-327: The compound literal for ne16_cfg_t incorrectly assigns
weights_stride via an equals expression instead of a designated initializer;
change the block that sets task.data.cfg.weights_stride to use a designated
field (e.g., add a .weights_stride = { .d0 = weightStrideD0, .d1 =
weightStrideD1, .d2 = weightStrideD2 } entry) inside the ne16_cfg_t initializer
so it matches the surrounding fields like .input_stride, .output_stride and
.subtile and produces valid C code.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 81-101: The subtraction of weight_offset from values in Passes.py
can overflow if values is int8; convert values to a wider integer dtype (e.g.,
np.int16 or np.int32) before computing weight_offset and performing values =
values - weight_offset, then ensure the resulting values are clipped to the
0–255 range and finally cast to np.uint8 just before calling _weightEncode
(refer to weightTensor.values, weight_offset, values, and _weightEncode) so the
transpose and depthwise logic remain unchanged.

In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py`:
- Around line 15-16: The two parsed flags --ne16-wmem and --enable-3x3 are never
consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into
CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and
"-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same
cmake argument list used for building (the variable that assembles cmake args /
options or the call that runs cmake/configure in main()); if the project expects
OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you
reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the
code path that constructs cmake arguments in main() so these options are passed
to CMake.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 55-62: In the conditional inside Parsers.py (the if checking
channels_first, data_in.shape, weight.shape) replace the direct boolean
comparison channels_first == False with the idiomatic not channels_first (i.e.,
use not channels_first in the all([...]) list) so the condition follows
PEP8/Ruff E712; adjust the list element in the same if that currently reads
channels_first == False to read not channels_first.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the self-referential wildcard import in __init__.py with
explicit re-exports: remove "from . import *" and import the specific public
symbols (e.g., classes, functions, constants) from the package submodules using
explicit imports (for example: from .submodule import ClassX, func_y), then
declare __all__ = ["ClassX", "func_y", ...] listing those exported names; ensure
you reference the actual symbol names defined in the package (replace
placeholders with the real identifiers) and avoid introducing circular imports
when reorganizing the imports.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint.py currently
bind unused names BatchOffset, HOffset, WOffset, COffset and BatchSize (and
similarly at the later unpacking around line 173); update those unpackings to
prefix unused variables with an underscore (e.g., _BatchOffset, _HOffset,
_WOffset, _BatchSize) so the intent of intentional non-use is clear; locate the
unpacking expressions that assign to cube.offset and cube.dims and rename only
the unused identifiers accordingly while keeping used names (COffset, HSize,
WSize, CSize) unchanged.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 125-136: In _findAllMultiplicands, replace the floating-point
division on tmpX (tmpX = tmpX / i) with integer division (tmpX = tmpX // i) and
ensure tmpX remains an int so subsequent modulus and math.prod operations are
integer-safe; update any related expressions that rely on tmpX's type (e.g., the
while loop condition tmpX % i == 0 and the final math.prod usage) to work with
the integer result.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3a121db6-6486-4738-9ded-bbe6729e8727

📥 Commits

Reviewing files that changed from the base of the PR and between 3b011bb and e41800e.

📒 Files selected for processing (28)
  • CMakeLists.txt
  • Deeploy/Targets/NE16/Bindings.py
  • Deeploy/Targets/NE16/Deployer.py
  • Deeploy/Targets/NE16/Engine.py
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
  • Deeploy/Targets/NE16/OptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/Parsers.py
  • Deeploy/Targets/NE16/Platform.py
  • Deeploy/Targets/NE16/Templates/AllocateTemplate.py
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py
  • Deeploy/Targets/NE16/Templates/__init__.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
  • Deeploy/Targets/NE16/TileConstraints/__init__.py
  • Deeploy/Targets/NE16/Tiler.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/__init__.py
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • DeeployTest/CMakeLists.txt
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • DeeployTest/testMVP.py
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/testUtils/platformMapping.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
  • TargetLibraries/GAP9/CMakeLists.txt

#
# SPDX-License-Identifier: Apache-2.0

from . import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify self-referential wildcard imports in NE16 package initializers.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391


🏁 Script executed:

find Deeploy/Targets/NE16 -maxdepth 1 -type d | sort

Repository: pulp-platform/Deeploy

Length of output: 241


🏁 Script executed:

ls -la Deeploy/Targets/NE16/ | head -30

Repository: pulp-platform/Deeploy

Length of output: 878


🏁 Script executed:

cat -n Deeploy/Targets/NE16/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224


🏁 Script executed:

rg "from.*NE16.*import.*Tiler|from.*NE16.*Tiler" --type py

Repository: pulp-platform/Deeploy

Length of output: 205


🏁 Script executed:

rg "\.Tiler" Deeploy/Targets/NE16 --type py

Repository: pulp-platform/Deeploy

Length of output: 836


🏁 Script executed:

find Deeploy/Targets -maxdepth 2 -name "__init__.py" -type f | head -10

Repository: pulp-platform/Deeploy

Length of output: 374


🏁 Script executed:

for dir in Deeploy/Targets/*/; do echo "=== $(basename $dir) ==="; cat "$dir/__init__.py" 2>/dev/null | head -20; done

Repository: pulp-platform/Deeploy

Length of output: 1218


🏁 Script executed:

rg "__all__" Deeploy/Targets/NE16 --type py -A 5

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

head -20 Deeploy/Targets/NE16/Tiler.py

Repository: pulp-platform/Deeploy

Length of output: 1270


🏁 Script executed:

rg "import.*Tiler" Deeploy --type py | grep -v "TilerExtension\|TilerModel" | head -20

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

rg "from.*NE16" Deeploy --type py | grep -v "^Binary" | head -30

Repository: pulp-platform/Deeploy

Length of output: 2287


The proposed fix is incomplete: Tiler.py exists in the NE16 directory but is missing from the proposed imports

While the original concern is valid—from . import * does not expose submodules/packages as intended—the proposed fix lists only 9 modules and omits Tiler.py, which exists as a module in the directory and is imported internally by Engine.py.

Corrected fix
-from . import *
+from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
+
+__all__ = [
+    "Bindings",
+    "Deployer",
+    "Engine",
+    "OptimizationPasses",
+    "Parsers",
+    "Platform",
+    "Tiler",
+    "Templates",
+    "TileConstraints",
+    "TopologyOptimizationPasses",
+]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from . import *
from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
__all__ = [
"Bindings",
"Deployer",
"Engine",
"OptimizationPasses",
"Parsers",
"Platform",
"Tiler",
"Templates",
"TileConstraints",
"TopologyOptimizationPasses",
]
🧰 Tools
🪛 Ruff (0.15.9)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/__init__.py` at line 5, The __init__.py currently uses
explicit imports but omitted the Tiler module, causing Engine.py's internal
imports to fail; update Deeploy.Targets.NE16.__init__.py to explicitly import or
expose the Tiler module (the Tiler.py symbol) alongside the other exported
modules so it is available when the package is imported, ensuring Engine.py's
references to Tiler resolve correctly (alternatively, use a package import loop
such as using pkgutil.iter_modules to dynamically import all submodules if you
prefer automatic inclusion).

Comment on lines +28 to +30
inputOffsets = {}):
super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
default_channels_first, deeployStateDir, inputOffsets)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether inputOffsets is mutated downstream (which would make shared default state dangerous).
rg -nP --type=py -C3 'def __init__\(|inputOffsets\s*=' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py
rg -nP --type=py -C2 'inputOffsets\[[^]]+\]\s*=|inputOffsets\.update\(|setdefault\(' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py

Repository: pulp-platform/Deeploy

Length of output: 12773


🏁 Script executed:

# Find the parent class of GAP9Deployer
rg -n "class GAP9Deployer" Deeploy/Targets/GAP9/Deployer.py -A 5

Repository: pulp-platform/Deeploy

Length of output: 312


🏁 Script executed:

# Get the full NE16Deployer.__init__ context and check parent
head -n 35 Deeploy/Targets/NE16/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 1693


🏁 Script executed:

# Get the full GAP9Deployer.__init__ context to see parent class definition
head -n 70 Deeploy/Targets/GAP9/Deployer.py

Repository: pulp-platform/Deeploy

Length of output: 2726


🏁 Script executed:

# Search for any mutation of inputOffsets using broader patterns
rg -n "inputOffsets" Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py -A 2 -B 2

Repository: pulp-platform/Deeploy

Length of output: 1535


🏁 Script executed:

# Find PULPDeployer class and its __init__
fd "Deployer.py" Deeploy/Targets/PULPOpen/

Repository: pulp-platform/Deeploy

Length of output: 101


🏁 Script executed:

# Search for PULPDeployer definition
rg -n "class PULPDeployer" Deeploy/ -A 10

Repository: pulp-platform/Deeploy

Length of output: 974


🏁 Script executed:

# Search for how inputOffsets is used in the parent class hierarchy
rg -n "self.inputOffsets" Deeploy/ -B 2 -A 2

Repository: pulp-platform/Deeploy

Length of output: 3019


🏁 Script executed:

# Read SignPropDeployer to see the full __init__ and how inputOffsets is mutated
head -n 55 Deeploy/CommonExtensions/NetworkDeployers/SignPropDeployer.py

Repository: pulp-platform/Deeploy

Length of output: 2075


🏁 Script executed:

# Check if inputOffsets is mutated elsewhere besides the initial population
rg -n "inputOffsets\[" Deeploy/ -B 1 -A 1

Repository: pulp-platform/Deeploy

Length of output: 1113


Use None as default for inputOffsets parameter.

The {} default on line 28 creates a shared mutable object that gets mutated in SignPropDeployer.__init__() (which populates it with keys). Multiple instances created without providing inputOffsets will share the same dict, causing unintended state leakage.

Proposed fix
-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

Note: GAP9Deployer has the same issue on line 54 and should be fixed identically.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
inputOffsets = {}):
super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
default_channels_first, deeployStateDir, inputOffsets)
inputOffsets = None):
if inputOffsets is None:
inputOffsets = {}
super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
default_channels_first, deeployStateDir, inputOffsets)
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 28-28: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor
currently uses a mutable default inputOffsets = {} which causes shared state;
change the signature for the relevant classes (SignPropDeployer.__init__ and
GAP9Deployer.__init__) to use inputOffsets=None and inside each __init__ set
inputOffsets = {} if inputOffsets is None before passing it to super().__init__
(or before any mutations), so each instance gets its own dict and avoids state
leakage when SignPropDeployer populates the keys.

Comment on lines +62 to +81
def isPWConv(self, node) -> bool:
return node.op in ["Conv", "RequantizedConv"] and \
isinstance(node.inputs[1], gs.Constant) and \
node.attrs['kernel_shape'] == [1, 1] and \
node.attrs['dilations'] == [1, 1] and \
(node.attrs['strides'] == [1, 1] or self.enableStrides)

def isDWConv(self, node) -> bool:
return node.op in ["Conv", "RequantizedConv"] and \
isinstance(node.inputs[1], gs.Constant) and \
node.attrs['kernel_shape'] == [3, 3] and \
node.attrs['dilations'] == [1, 1] and \
node.attrs['group'] != 1 and \
(node.attrs['strides'] == [1, 1] or self.enableStrides)

def canExecute(self, node: gs.Node) -> bool:
if self.enable3x3:
return self.isPWConv(node) or self.isDWConv(node) or self.isDenseConv(node)
else:
return self.isPWConv(node)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Tighten grouped-convolution classification before coloring.

isPWConv() currently accepts grouped 1x1 convs, and isDWConv() accepts any 3x3 conv with group != 1. The downstream DW path assumes true depthwise semantics, so generic grouped convolutions can be colored for NE16 and then fail later in parsing/tiling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 81, The conv-classification
is too loose: tighten isPWConv to reject grouped 1x1 convolutions by adding a
check that node.attrs['group'] == 1, and tighten isDWConv to only accept true
depthwise convolutions by inspecting the conv weight tensor
(node.inputs[1].values) and requiring the per-group in_channels == 1 and that
the number of groups equals the number of output channels (e.g.,
weights.shape[1] == 1 and weights.shape[0] == node.attrs['group'] or group ==
out_channels); update isPWConv, isDWConv (and thereby canExecute) accordingly so
only bona fide pointwise (group==1) and true depthwise convs are classified for
NE16 coloring.

Comment on lines +27 to +44
weightMemoryOccupation = 0

# Current weight memory occupation
for buffer in {**ctxt.globalObjects, **ctxt.localObjects}.values():
if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
weightMemoryOccupation += _ne16WeightBufferSize(buffer)

ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
for node in ne16Nodes:
if node.op in ["Conv", "RequantizedConv"]:

if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)):
continue

buffer = ctxt.lookup(node.inputs[1].name)
if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size:
buffer._memoryLevel = self._weightMemoryLevel.name
weightMemoryOccupation += _ne16WeightBufferSize(buffer)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid double-counting reused weights and allow exact-fit placement.

If two NE16 nodes share the same weight buffer, this loop charges the same buffer twice once it has already been annotated in this pass. The strict < check also skips buffers that exactly fill the remaining SRAM. Both cases leave feasible placements unassigned.

Suggested fix
         ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
         for node in ne16Nodes:
             if node.op in ["Conv", "RequantizedConv"]:

                 if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)):
                     continue

                 buffer = ctxt.lookup(node.inputs[1].name)
-                if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size:
-                    buffer._memoryLevel = self._weightMemoryLevel.name
-                    weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+                if getattr(buffer, "_memoryLevel", None) == self._weightMemoryLevel.name:
+                    continue
+
+                bufferSize = _ne16WeightBufferSize(buffer)
+                if weightMemoryOccupation + bufferSize <= self._weightMemoryLevel.size:
+                    buffer._memoryLevel = self._weightMemoryLevel.name
+                    weightMemoryOccupation += bufferSize
         return ctxt, graph
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`
around lines 27 - 44, The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.

Comment thread Deeploy/Targets/NE16/Parsers.py
Comment thread Deeploy/Targets/NE16/Templates/ConvTemplate.py
Comment on lines +41 to +44
weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0)
weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify weight tensor shape after encoding and check if dimIdx 3 causes issues
echo "=== Check TilerModel.getTensorDimVar implementation ==="
ast-grep --pattern 'def getTensorDimVar($$$):
  $$$
'

echo -e "\n=== Check if weight shape assertions exist in tiler ==="
rg -n "getTensorDimVar.*weight" --type py -C 2

Repository: pulp-platform/Deeploy

Length of output: 17734


🏁 Script executed:

# Check if the variables declared at lines 41-44 are used later in the function
sed -n '1,100p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 6259


🏁 Script executed:

# Check if similar variables are used in the comparable Neureka file
sed -n '40,80p' Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py | cat -n

Repository: pulp-platform/Deeploy

Length of output: 2692


🏁 Script executed:

# Find where NE16AdjustWeightMemoryLayoutPass is defined and check weight shape
find . -name "*NE16AdjustWeightMemoryLayoutPass*" -o -name "*Parsers.py" | head -5

Repository: pulp-platform/Deeploy

Length of output: 246


🏁 Script executed:

# Search for all uses of the suspect variables in NE16DenseConstraint.py
rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370


🏁 Script executed:

# Check NE16 Parsers.py for weight shape information
cat -n Deeploy/Targets/NE16/Parsers.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 4765


🏁 Script executed:

# Compare with Neureka - check if those variables are used there
rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 370


🏁 Script executed:

# Check if getTensorDimVar creates variables on-demand or retrieves them
cat -n Deeploy/TilingExtension/TilerModel.py | sed -n '125,145p'

Repository: pulp-platform/Deeploy

Length of output: 1096


🏁 Script executed:

# Check addTensorDimToModel to see what dimensions it adds to the model
rg -n "def addTensorDimToModel" Deeploy/TilingExtension/TilerModel.py -A 15

Repository: pulp-platform/Deeploy

Length of output: 762


🏁 Script executed:

# Verify the runtime error scenario: check if weight is 3D and dimIdx=3 would fail
python3 << 'EOF'
# Simulate the getTensorDimVar call
weight_shape = (64, 32*64, 128)  # Example 3D weight: (cout, cinMajor, bits*spatial)
num_dims = len(weight_shape)

print(f"Weight shape: {weight_shape}")
print(f"Number of dimensions: {num_dims}")
print(f"Valid dimIdx range: 0 to {num_dims - 1}")

# Try to access dimIdx=3
for dimIdx in [0, 1, 2, 3]:
    varName = f"weight_dim_{dimIdx}"
    if dimIdx < num_dims:
        print(f"✓ dimIdx={dimIdx}: {varName} would exist (size: {weight_shape[dimIdx]})")
    else:
        print(f"✗ dimIdx={dimIdx}: {varName} would NOT exist (KeyError)")
EOF

Repository: pulp-platform/Deeploy

Length of output: 344


🏁 Script executed:

# Double-check the complete addGeometricalConstraint function to confirm these variables aren't used
wc -l Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
sed -n '24,90p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | grep -E "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar"

Repository: pulp-platform/Deeploy

Length of output: 425


Remove unused weight dimension variables that cause runtime KeyError.

Lines 42-44 declare variables that are never used and access a non-existent dimension. Per Parsers.py line 60, NE16 weight tensors are 3D (cout, cinMajor, bits*spatial) after encoding, with valid dimensions 0–2. Attempting to access dimIdx=3 via getTensorDimVar will fail at runtime with a KeyError.

Remove these three lines:

        weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
        weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
        weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)

Keep only line 41: weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0), which is actually used in the function.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41
- 44, The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).

Comment on lines +43 to +46
# Map output dims to inputs dims
tilerModel.addConstraint(outputBatchVar == inputBatchVar)
tilerModel.addConstraint(outputHeightVar == inputHeightVar)
tilerModel.addConstraint(outputWidthVar == inputWidthVar)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Model pointwise output geometry with stride, not identity.

This forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar, but NE16Engine.canExecute() explicitly allows strided 1x1 convs when enableStrides is set. Those tiles need the usual (effective_input - 1) // stride + 1 relation, otherwise the model is wrong as soon as stride > 1.

Suggested fix
-        # Map output dims to inputs dims
+        # Map output dims to input dims
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 46, The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).

Comment on lines +81 to +101
values = weightTensor.values

# Extract weight offset and translate weights by the offset
weight_offset = values.min()
values = values - weight_offset
node.attrs["weight_offset"] = weight_offset

if "channels_first" in node.attrs:
channels_first = node.attrs["channels_first"]
else:
channels_first = default_channels_first

# Weight encode expects channels first
if not channels_first:
values = values.transpose(0, 3, 1, 2)

bits = 8 # Support only 8 bit weights for now
if node.attrs['group'] == 1:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)
else:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential integer overflow when subtracting weight offset.

If values has dtype int8 (range -128 to 127), the subtraction at line 85 can overflow before the astype(np.uint8) at line 99. For example, with weight_offset = -128 and a value of 127:

127 - (-128) = 255  # overflows int8, wraps to -1

Cast to a wider type before subtraction to avoid wraparound:

🐛 Proposed fix
     # Adjust N-EUREKA's weights
     values = weightTensor.values

     # Extract weight offset and translate weights by the offset
     weight_offset = values.min()
-    values = values - weight_offset
+    values = values.astype(np.int32) - int(weight_offset)
     node.attrs["weight_offset"] = weight_offset

     if "channels_first" in node.attrs:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
values = weightTensor.values
# Extract weight offset and translate weights by the offset
weight_offset = values.min()
values = values - weight_offset
node.attrs["weight_offset"] = weight_offset
if "channels_first" in node.attrs:
channels_first = node.attrs["channels_first"]
else:
channels_first = default_channels_first
# Weight encode expects channels first
if not channels_first:
values = values.transpose(0, 3, 1, 2)
bits = 8 # Support only 8 bit weights for now
if node.attrs['group'] == 1:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)
else:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)
values = weightTensor.values
# Extract weight offset and translate weights by the offset
weight_offset = values.min()
values = values.astype(np.int32) - int(weight_offset)
node.attrs["weight_offset"] = weight_offset
if "channels_first" in node.attrs:
channels_first = node.attrs["channels_first"]
else:
channels_first = default_channels_first
# Weight encode expects channels first
if not channels_first:
values = values.transpose(0, 3, 1, 2)
bits = 8 # Support only 8 bit weights for now
if node.attrs['group'] == 1:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False)
else:
weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 81 -
101, The subtraction of weight_offset from values in Passes.py can overflow if
values is int8; convert values to a wider integer dtype (e.g., np.int16 or
np.int32) before computing weight_offset and performing values = values -
weight_offset, then ensure the resulting values are clipped to the 0–255 range
and finally cast to np.uint8 just before calling _weightEncode (refer to
weightTensor.values, weight_offset, values, and _weightEncode) so the transpose
and depthwise logic remain unchanged.

Comment on lines +15 to +16
parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n')
parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify end-to-end wiring for NE16-specific runner flags.
rg -nP --type=py -C3 'ne16[_-]wmem|enable[_-]3x3|hasattr\(args,\s*.\s*ne16|hasattr\(args,\s*.\s*enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py DeeployTest/testUtils/core/execution.py
rg -nP --type=cmake -C3 'NE16|3x3|WMEM|NUM_CORES' CMakeLists.txt DeeployTest/CMakeLists.txt TargetLibraries/GAP9/CMakeLists.txt

Repository: pulp-platform/Deeploy

Length of output: 6815


🏁 Script executed:

# Search for main() function and how it handles arguments
rg -n 'def main' DeeployTest/testUtils/deeployRunner.py -A 20

# Look for args usage with underscore conversion (--ne16-wmem → args.ne16_wmem)
rg -n 'args\.ne16|args\.enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py

# Check the entire flow in the runner file
cat -n DeeployTest/deeployRunner_tiled_gap9_w_ne16.py

Repository: pulp-platform/Deeploy

Length of output: 2136


🏁 Script executed:

# Get more of the main() function to see how it uses parser_setup_callback
sed -n '311,450p' DeeployTest/testUtils/deeployRunner.py | cat -n

# Also search for any reference to ne16_wmem or enable_3x3 after variable names are converted
rg -n 'ne16_wmem|enable_3x3' DeeployTest/testUtils/deeployRunner.py DeeployTest/testUtils/core/execution.py

Repository: pulp-platform/Deeploy

Length of output: 5539


Add handling for --ne16-wmem and --enable-3x3 flags in main().

The flags are defined on Lines 15-16, but unlike --cores (which is properly handled at deeployRunner.py:90-91), there is no downstream consumption of --ne16-wmem or --enable-3x3. These arguments are parsed but never converted to CMake options or used anywhere in execution. Either wire them into CMake configuration (e.g., -DENABLE_NE16_WMEM, -DENABLE_3X3_CONV), or remove them if not yet implemented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py` around lines 15 - 16, The two
parsed flags --ne16-wmem and --enable-3x3 are never consumed; update main() to
translate args.ne16_wmem and args.enable_3x3 into CMake defines (e.g., add
"-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and "-DENABLE_3X3_CONV=ON"
when args.enable_3x3 is true) and append them to the same cmake argument list
used for building (the variable that assembles cmake args / options or the call
that runs cmake/configure in main()); if the project expects OFF values when
flags are absent, explicitly add =OFF otherwise. Ensure you reference the
existing parser flags (--ne16-wmem / --enable-3x3) and modify the code path that
constructs cmake arguments in main() so these options are passed to CMake.

@runwangdl runwangdl force-pushed the gap9-ne16 branch 9 times, most recently from 1153077 to 4edb011 Compare April 13, 2026 23:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Fix invalid designated initializer syntax for weights_stride.

The line task.data.cfg.weights_stride = (ne16_stride_t) {...} is inside the .cfg = (ne16_cfg_t) { ... } designated initializer block. Within designated initializers, you must use .fieldname = value syntax. The current syntax is invalid C and will cause compilation errors.

🐛 Proposed fix
-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Inside the designated initializer for task.data.cfg (the .cfg = (ne16_cfg_t) {
... } block) the assignment to weights_stride uses an invalid nested C
cast-style initializer; replace it with a proper designated-field initializer by
setting .weights_stride = { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1},
.d2 = ${weightStrideD2} } (or include an explicit .weights_stride =
(ne16_stride_t){ .d0 = ..., .d1 = ..., .d2 = ... } if you prefer the cast) so
the field name .weights_stride is used with a brace-enclosed initializer
compatible with ne16_stride_t within the ne16_cfg_t designated initializer.
🧹 Nitpick comments (8)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)

7-11: Minor style nit: Consider grouping all imports before module-level assignments.

The assignment float32_tPtr = PointerClass(float32_t) on line 10 appears between import statements. Per PEP 8 conventions, imports should typically be grouped together at the top of the file before other statements.

♻️ Suggested reordering
 from Deeploy.AbstractDataTypes import PointerClass
 from Deeploy.CommonExtensions.DataTypes import float32_t
-
-float32_tPtr = PointerClass(float32_t)
 from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation
+
+float32_tPtr = PointerClass(float32_t)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py` around lines 7 - 11,
Move the module-level assignment float32_tPtr = PointerClass(float32_t) so all
import statements are grouped first: keep the imports for PointerClass,
float32_t, and DeeployTypes together at the top, then place the float32_tPtr
assignment immediately after those imports; ensure you reference PointerClass
and float32_t exactly as they are imported and leave NetworkContext,
NodeTemplate, OperatorRepresentation imports unchanged.
DeeployTest/test_platforms.py (1)

1004-1006: Unused config_name variable (consistent with existing tests).

The config_name variable is unpacked but never used. This matches the pattern in all other tiled test functions in this file (e.g., lines 317, 346, 808, etc.). For consistency, you could prefix it with an underscore to signal intent:

-    test_name, l1, config_name = test_params
+    test_name, l1, _config_name = test_params

This is a low-priority style nit since the existing codebase follows the same pattern.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/test_platforms.py` around lines 1004 - 1006, In
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the unpacked variable
config_name is unused; rename it to _config_name (or prefix with an underscore)
in the function signature to indicate it's intentionally unused and match the
existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).
.github/workflows/_runner-gap9-w-ne16-tiled.yml (1)

36-36: Consider logging pip install failures instead of silently ignoring them.

The || true suppresses all pip install failures silently. If the install genuinely fails, subsequent test runs will fail with confusing errors. Consider logging the failure while still allowing the workflow to proceed:

-          pip install -e . || true
+          pip install -e . || echo "WARNING: pip install failed, tests may not work correctly"

Alternatively, if the install is expected to always succeed in CI, remove || true to fail fast on broken dependencies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/_runner-gap9-w-ne16-tiled.yml at line 36, The workflow
step currently runs "pip install -e . || true", which silently swallows
installation failures; change this to either remove the "|| true" so the job
fails fast on pip install errors, or replace it with logic that captures the pip
exit status and logs the failure before continuing (e.g., run pip install -e .
and on failure emit a clear error log message referencing the pip install step);
locate the exact step containing "pip install -e . || true" in the workflow and
update it accordingly so failures are not silently ignored.
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

161-162: Prefix unused unpacked variables with underscore.

Per static analysis hints, several unpacked variables are never used. Prefix them with underscore to indicate intentional non-use.

♻️ Suggested fix
-            (BatchOffset, HOffset, WOffset, COffset) = cube.offset
-            (BatchSize, HSize, WSize, CSize) = cube.dims
+            (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset
+            (_BatchSize, HSize, WSize, CSize) = cube.dims
-            inBSize, inHSize, inWSize, inCSize = InCube.dims
+            _inBSize, _inHSize, inWSize, inCSize = InCube.dims

Also applies to: 173-173

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161
- 162, The tuple unpackings in NE16DenseConstraint where you extract cube.offset
into (BatchOffset, HOffset, WOffset, COffset) and cube.dims into (BatchSize,
HSize, WSize, CSize) (and the similar unpacking around line 173) include
variables that are never used; update these unpackings to prefix the unused
names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so static analysis
recognizes intentional non-use while keeping any used variables unchanged and
preserving the tuple structure in the methods within the NE16DenseConstraint
class.
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

79-80: Remove unused padding variable.

The padding variable is read from parseDict but never used in addPolicyConstraint.

♻️ Suggested fix
         strides = parseDict["strides"]
-        padding = parseDict["pads"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
79 - 80, In addPolicyConstraint (class NE16PointwiseConstraint) the local
variable padding is assigned from parseDict["pads"] but never used; remove the
unused assignment (the line `padding = parseDict["pads"]`) so there is no dead
variable left, or if padding is intended to be used, instead reference it where
needed—locate the parseDict usage in addPolicyConstraint and either delete the
padding assignment or integrate padding into the constraint logic.
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (2)

29-31: Remove unused dilation variable.

The dilation variable is extracted but never used in the function. Since NE16 doesn't support dilation (enforced by parser's dilations == [1, 1] check), this variable is unnecessary.

♻️ Suggested fix
         strides = parseDict["strides"]
         padding = parseDict["pads"]
-        dilation = parseDict["dilations"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
29 - 31, Remove the unused local variable `dilation` that is assigned from
`parseDict["dilations"]`; since the parser enforces `dilations == [1, 1]` and
NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.

160-162: Use already-looked-up outputBuffer instead of redundant lookup.

outputBuffer was already retrieved at line 153. Line 162 performs an unnecessary second lookup.

♻️ Suggested fix
             InCube, padding_tuple = Conv2DTileConstraint.computeInputCube((weightH, weightW), pads, strides, weightC,
                                                                           cube,
-                                                                          ctxt.lookup(varOut).shape)
+                                                                          outputBuffer.shape)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
160 - 162, Replace the redundant ctxt.lookup(varOut).shape call with the
already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.
Deeploy/Targets/NE16/Parsers.py (1)

56-56: Use idiomatic boolean check instead of equality comparison to False.

Per Ruff E712, avoid explicit equality comparisons to False.

♻️ Suggested fix
         if not all([
-                channels_first == False,
+                not channels_first,
                 len(data_in.shape) == 4,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` at line 56, Replace the explicit equality
check "channels_first == False" with the idiomatic boolean negation "not
channels_first" wherever it appears (e.g., in the call/config constructing the
parser using the channels_first variable) to comply with Ruff E712 and avoid
direct comparison to False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Line 34: The list comprehension that builds ne16Nodes uses
node.attrs["engine"] directly and can raise KeyError for nodes without an
"engine"; change it to safely check for the key (e.g., use
node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and
node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine
attribute are compared against self.ne16EngineName; update the expression that
defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Inside the designated initializer for task.data.cfg (the
.cfg = (ne16_cfg_t) { ... } block) the assignment to weights_stride uses an
invalid nested C cast-style initializer; replace it with a proper
designated-field initializer by setting .weights_stride = { .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} } (or
include an explicit .weights_stride = (ne16_stride_t){ .d0 = ..., .d1 = ..., .d2
= ... } if you prefer the cast) so the field name .weights_stride is used with a
brace-enclosed initializer compatible with ne16_stride_t within the ne16_cfg_t
designated initializer.

---

Nitpick comments:
In @.github/workflows/_runner-gap9-w-ne16-tiled.yml:
- Line 36: The workflow step currently runs "pip install -e . || true", which
silently swallows installation failures; change this to either remove the "||
true" so the job fails fast on pip install errors, or replace it with logic that
captures the pip exit status and logs the failure before continuing (e.g., run
pip install -e . and on failure emit a clear error log message referencing the
pip install step); locate the exact step containing "pip install -e . || true"
in the workflow and update it accordingly so failures are not silently ignored.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Line 56: Replace the explicit equality check "channels_first == False" with
the idiomatic boolean negation "not channels_first" wherever it appears (e.g.,
in the call/config constructing the parser using the channels_first variable) to
comply with Ruff E712 and avoid direct comparison to False.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint where you
extract cube.offset into (BatchOffset, HOffset, WOffset, COffset) and cube.dims
into (BatchSize, HSize, WSize, CSize) (and the similar unpacking around line
173) include variables that are never used; update these unpackings to prefix
the unused names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so
static analysis recognizes intentional non-use while keeping any used variables
unchanged and preserving the tuple structure in the methods within the
NE16DenseConstraint class.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 29-31: Remove the unused local variable `dilation` that is
assigned from `parseDict["dilations"]`; since the parser enforces `dilations ==
[1, 1]` and NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.
- Around line 160-162: Replace the redundant ctxt.lookup(varOut).shape call with
the already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 79-80: In addPolicyConstraint (class NE16PointwiseConstraint) the
local variable padding is assigned from parseDict["pads"] but never used; remove
the unused assignment (the line `padding = parseDict["pads"]`) so there is no
dead variable left, or if padding is intended to be used, instead reference it
where needed—locate the parseDict usage in addPolicyConstraint and either delete
the padding assignment or integrate padding into the constraint logic.

In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`:
- Around line 7-11: Move the module-level assignment float32_tPtr =
PointerClass(float32_t) so all import statements are grouped first: keep the
imports for PointerClass, float32_t, and DeeployTypes together at the top, then
place the float32_tPtr assignment immediately after those imports; ensure you
reference PointerClass and float32_t exactly as they are imported and leave
NetworkContext, NodeTemplate, OperatorRepresentation imports unchanged.

In `@DeeployTest/test_platforms.py`:
- Around line 1004-1006: In test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the
unpacked variable config_name is unused; rename it to _config_name (or prefix
with an underscore) in the function signature to indicate it's intentionally
unused and match the existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e7e2d4c-444f-4b78-b2b5-22aef5e9a86f

📥 Commits

Reviewing files that changed from the base of the PR and between e41800e and 38f10f2.

📒 Files selected for processing (32)
  • .github/workflows/_runner-gap9-w-ne16-tiled.yml
  • .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
  • CMakeLists.txt
  • Deeploy/Targets/NE16/Bindings.py
  • Deeploy/Targets/NE16/Deployer.py
  • Deeploy/Targets/NE16/Engine.py
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
  • Deeploy/Targets/NE16/OptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/Parsers.py
  • Deeploy/Targets/NE16/Platform.py
  • Deeploy/Targets/NE16/Templates/AllocateTemplate.py
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py
  • Deeploy/Targets/NE16/Templates/__init__.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
  • Deeploy/Targets/NE16/TileConstraints/__init__.py
  • Deeploy/Targets/NE16/Tiler.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/__init__.py
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • DeeployTest/CMakeLists.txt
  • DeeployTest/conftest.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • DeeployTest/testMVP.py
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/testUtils/platformMapping.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
  • DeeployTest/test_platforms.py
  • TargetLibraries/GAP9/CMakeLists.txt
✅ Files skipped from review due to trivial changes (5)
  • DeeployTest/testMVP.py
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/CMakeLists.txt
  • Deeploy/Targets/NE16/Tiler.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • TargetLibraries/GAP9/CMakeLists.txt
  • CMakeLists.txt
  • DeeployTest/testUtils/platformMapping.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
  • Deeploy/Targets/NE16/Platform.py

if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name:
weightMemoryOccupation += _ne16WeightBufferSize(buffer)

ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential KeyError if node lacks engine attribute.

The list comprehension accesses node.attrs["engine"] without checking if the key exists. Nodes not assigned to an engine may lack this attribute.

🛡️ Proposed defensive fix
-        ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
+        ne16Nodes = [node for node in graph.nodes if node.attrs.get("engine") == self.ne16EngineName]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` at
line 34, The list comprehension that builds ne16Nodes uses node.attrs["engine"]
directly and can raise KeyError for nodes without an "engine"; change it to
safely check for the key (e.g., use node.attrs.get("engine") ==
self.ne16EngineName or "engine" in node.attrs and node.attrs["engine"] ==
self.ne16EngineName) so only nodes with an engine attribute are compared against
self.ne16EngineName; update the expression that defines ne16Nodes accordingly in
MemoryLevelAnnotationPasses.py.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (5)
Deeploy/Targets/NE16/Deployer.py (1)

19-30: ⚠️ Potential issue | 🟠 Major

Avoid shared inputOffsets state across deployer instances.

inputOffsets = {} still creates one dict shared by every NE16Deployer built without an explicit argument, so per-instance state can leak between deployments.

Suggested fix
-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 19 - 30, The constructor
__init__ currently uses a mutable default inputOffsets = {} which shares state
across instances; change the parameter default to inputOffsets: Optional[Dict] =
None and inside __init__ (before calling super().__init__) set inputOffsets = {}
if it's None, then pass that local dict into super().__init__; update the
function signature and ensure any references to inputOffsets in
NE16Deployer.__init__ use the per-instance dict created when None is supplied.
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)

90-106: ⚠️ Potential issue | 🟠 Major

Widen before subtracting weight_offset.

If weightTensor.values is int8, values - weight_offset happens in the narrow dtype and can wrap before the later cast to uint8, corrupting the encoded weights.

Suggested fix
-    weight_offset = values.min()
-    values = values - weight_offset
+    weight_offset = int(values.min())
+    values = values.astype(np.int32) - weight_offset
     node.attrs["weight_offset"] = weight_offset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 90 -
106, The subtraction values = values - weight_offset can underflow when
weightTensor.values is int8; before computing the offset and subtracting, widen
values to a safe integer type (e.g., int16/int32) so the subtraction is
performed without wraparound, then clip/normalize and finally cast to np.uint8
before calling _weightEncode; update the operations around weight_offset,
values, and the subsequent transpose/astype steps (referencing weight_offset,
values, weightTensor.values, and _weightEncode) to perform widening,
subtraction, any needed clamping, and only then cast to uint8 for encoding.
Deeploy/Targets/NE16/Engine.py (1)

62-75: ⚠️ Potential issue | 🟠 Major

Grouped convolutions are still being classified as NE16 PW/DW.

isPWConv() still accepts grouped 1x1 convs, and isDWConv() still accepts any grouped 3x3 conv. That will color generic grouped convolutions for NE16 even though the downstream paths only handle regular pointwise (group == 1) or true depthwise weights.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, isPWConv currently
allows grouped 1x1 convs and isDWConv accepts any grouped 3x3 conv; change the
predicates to only mark true pointwise (group == 1) and true depthwise (group ==
number of input channels) respectively: update isPWConv to include
node.attrs['group'] == 1 in its condition, and update isDWConv to require
node.attrs['group'] equal the convolution's input channel count (i.e., only
treat it as depthwise when group == input_channels) while keeping the existing
kernel_shape, dilation and stride checks.
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

43-46: ⚠️ Potential issue | 🟠 Major

Model pointwise output geometry from stride, not identity.

These constraints only describe stride-1 pointwise conv. If NE16Engine.enableStrides is ever enabled, H_out/W_out become wrong immediately because the tiler still forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar.

Suggested fix
         # Map output dims to inputs dims
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 46, The current constraints in NE16PointwiseConstraint map outputHeightVar
and outputWidthVar directly to inputHeightVar/inputWidthVar which only works for
stride=1; update the constraints to compute output dims from the pointwise
stride when NE16Engine.enableStrides is used (use the convolution output formula
for 1x1: H_out = ceil(inputHeight / strideH), W_out = ceil(inputWidth /
strideW)). Replace the equalities on outputHeightVar and outputWidthVar with
expressions that use the pointwise stride variables (e.g., strideH/strideW or
the engine's stride config) and implement the integer ceil division ((input +
stride - 1) / stride) so tilerModel.addConstraint uses outputHeightVar ==
(inputHeightVar + strideHVar - 1) / strideHVar and similarly for width; keep the
batch mapping (outputBatchVar == inputBatchVar) as-is.
Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)

352-356: ⚠️ Potential issue | 🔴 Critical

Use a designated field initializer for weights_stride.

task.data.cfg.weights_stride = ... is not valid inside a C compound literal. This needs to be .weights_stride = ... like the neighboring fields, otherwise the generated kernel code will not compile.

Proposed fix
-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },
Is `task.data.cfg.weights_stride = ...` valid inside a C designated initializer, or must it be written as `.weights_stride = ...`?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356,
Replace the invalid in-line assignment inside the C compound literal with a
designated initializer: change the occurrence of "task.data.cfg.weights_stride =
(ne16_stride_t) { ... }" so that within the compound literal for task.data.cfg
you use ".weights_stride = (ne16_stride_t) { .d0 = ..., .d1 = ..., .d2 = ... }"
(i.e., reference the struct member as .weights_stride) so the generated kernel
compiles; update the initializer near the other designated fields in the same
compound literal where task.data.cfg is constructed.
🧹 Nitpick comments (1)
Deeploy/Targets/NE16/TileConstraints/__init__.py (1)

5-5: Consider using explicit submodule imports for clarity.

The from . import * statement in a package __init__.py doesn't automatically import submodules—it only imports names from __all__ (if defined) in the current package's namespace. For a package that simply exposes its submodules, explicit imports are clearer:

from . import NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers

This same pattern appears in several other new NE16 __init__.py files. If the current behavior is intentional (relying on consumers to import submodules directly), this can remain as-is.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
wildcard import in the package __init__ (from . import *) with explicit
submodule imports to make exported symbols clear: import the submodules
NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint and
RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 32-40: The constructor is mutating the shared lowering optimizer
in place (self.loweringOptimizer.passes) which causes cross-deployer pollution;
instead clone the optimizer or at least its passes list before modifying it
(e.g., deep-copy the optimizer or assign a new passes list based on
list(self.loweringOptimizer.passes)) and then perform the PULPNCHWtoNHWCPass ->
NCHWtoNHWCPass replacement and append ConvEngineDiscolorationPass() and
NE16OptimizationPass(self.default_channels_first, "NE16") to the cloned
optimizer so the module-scoped NE16Optimizer from Platform.py is not mutated
across instances.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-40: The NE16 parser's parseNodeCtxt currently defaults
channels_first=True but the implementation only accepts False; change the
default parameter in parseNodeCtxt to channels_first: bool = False and update
the three RQS override method signatures to also default channels_first to False
so callers use NHWC by default and the method signatures match the only
supported layout (refer to parseNodeCtxt and the three RQS override methods to
locate and update their signatures).

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 139-150: The _findAllMultiplicands function currently uses float
division and an exclusive upper bound which breaks integer factorization; change
the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use
math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors
so tmpX stays an int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.

---

Duplicate comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 19-30: The constructor __init__ currently uses a mutable default
inputOffsets = {} which shares state across instances; change the parameter
default to inputOffsets: Optional[Dict] = None and inside __init__ (before
calling super().__init__) set inputOffsets = {} if it's None, then pass that
local dict into super().__init__; update the function signature and ensure any
references to inputOffsets in NE16Deployer.__init__ use the per-instance dict
created when None is supplied.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: isPWConv currently allows grouped 1x1 convs and isDWConv
accepts any grouped 3x3 conv; change the predicates to only mark true pointwise
(group == 1) and true depthwise (group == number of input channels)
respectively: update isPWConv to include node.attrs['group'] == 1 in its
condition, and update isDWConv to require node.attrs['group'] equal the
convolution's input channel count (i.e., only treat it as depthwise when group
== input_channels) while keeping the existing kernel_shape, dilation and stride
checks.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Replace the invalid in-line assignment inside the C
compound literal with a designated initializer: change the occurrence of
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" so that within the
compound literal for task.data.cfg you use ".weights_stride = (ne16_stride_t) {
.d0 = ..., .d1 = ..., .d2 = ... }" (i.e., reference the struct member as
.weights_stride) so the generated kernel compiles; update the initializer near
the other designated fields in the same compound literal where task.data.cfg is
constructed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint map
outputHeightVar and outputWidthVar directly to inputHeightVar/inputWidthVar
which only works for stride=1; update the constraints to compute output dims
from the pointwise stride when NE16Engine.enableStrides is used (use the
convolution output formula for 1x1: H_out = ceil(inputHeight / strideH), W_out =
ceil(inputWidth / strideW)). Replace the equalities on outputHeightVar and
outputWidthVar with expressions that use the pointwise stride variables (e.g.,
strideH/strideW or the engine's stride config) and implement the integer ceil
division ((input + stride - 1) / stride) so tilerModel.addConstraint uses
outputHeightVar == (inputHeightVar + strideHVar - 1) / strideHVar and similarly
for width; keep the batch mapping (outputBatchVar == inputBatchVar) as-is.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 90-106: The subtraction values = values - weight_offset can
underflow when weightTensor.values is int8; before computing the offset and
subtracting, widen values to a safe integer type (e.g., int16/int32) so the
subtraction is performed without wraparound, then clip/normalize and finally
cast to np.uint8 before calling _weightEncode; update the operations around
weight_offset, values, and the subsequent transpose/astype steps (referencing
weight_offset, values, weightTensor.values, and _weightEncode) to perform
widening, subtraction, any needed clamping, and only then cast to uint8 for
encoding.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the wildcard import in the package __init__ (from . import *)
with explicit submodule imports to make exported symbols clear: import the
submodules NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint
and RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: eaa00d35-32b5-49a6-8eeb-a015c49962b3

📥 Commits

Reviewing files that changed from the base of the PR and between 38f10f2 and 748707a.

📒 Files selected for processing (35)
  • .github/workflows/_runner-gap9-w-ne16-tiled.yml
  • .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
  • CMakeLists.txt
  • Deeploy/Targets/NE16/Bindings.py
  • Deeploy/Targets/NE16/Deployer.py
  • Deeploy/Targets/NE16/Engine.py
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
  • Deeploy/Targets/NE16/OptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/Parsers.py
  • Deeploy/Targets/NE16/Platform.py
  • Deeploy/Targets/NE16/Templates/AllocateTemplate.py
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py
  • Deeploy/Targets/NE16/Templates/__init__.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
  • Deeploy/Targets/NE16/TileConstraints/__init__.py
  • Deeploy/Targets/NE16/Tiler.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/__init__.py
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • DeeployTest/CMakeLists.txt
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
  • DeeployTest/conftest.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • DeeployTest/testMVP.py
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/testUtils/platformMapping.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
  • DeeployTest/test_platforms.py
  • TargetLibraries/GAP9/CMakeLists.txt
✅ Files skipped from review due to trivial changes (4)
  • DeeployTest/conftest.py
  • Deeploy/Targets/NE16/Tiler.py
  • DeeployTest/testUtils/core/execution.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
🚧 Files skipped from review as they are similar to previous changes (9)
  • DeeployTest/testUtils/platformMapping.py
  • TargetLibraries/GAP9/CMakeLists.txt
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • .github/workflows/_runner-gap9-w-ne16-tiled.yml
  • .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • Deeploy/Targets/NE16/Bindings.py
  • DeeployTest/test_gap9_ne16_tiled_config.py

Comment thread Deeploy/Targets/NE16/Deployer.py
Comment on lines +37 to +40
def parseNodeCtxt(self,
ctxt: NetworkContext,
node: gs.Node,
channels_first: bool = True) -> Tuple[NetworkContext, bool]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the NE16 parser default to NHWC.

parseNodeCtxt() defaults channels_first to True, but Line 56 rejects anything except False. Any caller that uses the default path will fail to parse NE16 convolutions. Flip this default here and mirror it in the three RQS overrides so the method signature matches the only supported layout.

Proposed fix
-    def parseNodeCtxt(self,
-                      ctxt: NetworkContext,
-                      node: gs.Node,
-                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = False) -> Tuple[NetworkContext, bool]:
@@
-                channels_first == False,
+                not channels_first,
-    def parseNodeCtxt(self,
-                      ctxt: NetworkContext,
-                      node: gs.Node,
-                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = False) -> Tuple[NetworkContext, bool]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 40, The NE16 parser's
parseNodeCtxt currently defaults channels_first=True but the implementation only
accepts False; change the default parameter in parseNodeCtxt to channels_first:
bool = False and update the three RQS override method signatures to also default
channels_first to False so callers use NHWC by default and the method signatures
match the only supported layout (refer to parseNodeCtxt and the three RQS
override methods to locate and update their signatures).

Comment on lines +52 to +58
weightBuffer = ctxt.lookup(weightBufferName)
# NE16 DW weight is packed as a single (1, 1, packed_bytes) block
# containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
# Keep the outermost dim fixed at its full (=1) value regardless of
# the output channel tiling.
tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pin the whole packed DW weight tensor, not just dim 0.

Line 57 fixes only weightShape[0], but Lines 227-228 always emit a load for the full encoded weight block. That lets the tiler under-account weight memory by shrinking the packed-bytes axis, then codegen still DMA-loads the entire tensor. Constrain every dimension of the encoded DW weight buffer to Max() for this path.

Proposed fix
         weightBuffer = ctxt.lookup(weightBufferName)
         # NE16 DW weight is packed as a single (1, 1, packed_bytes) block
         # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
-        # Keep the outermost dim fixed at its full (=1) value regardless of
-        # the output channel tiling.
-        tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+        # Keep the full encoded block resident regardless of output-channel tiling.
+        for dimIdx in range(len(weightBuffer.shape)):
+            weightDimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx)
+            tilerModel.addConstraint(weightDimVar == weightDimVar.Max())

Also applies to: 225-228

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
52 - 58, The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.

Comment on lines +139 to +150
def _findAllMultiplicands(x: int) -> List[int]:
multiplicands = []
tmpX = x
for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number
while tmpX % i == 0:
multiplicands.append(i)
tmpX = tmpX / i

if x // math.prod(multiplicands) > 1:
multiplicands.append(x // math.prod(multiplicands))

return multiplicands
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use integer factorization here.

tmpX = tmpX / i turns the accumulator into a float, and the current loop bound skips exact square roots. For dimensions like 4 or 49 this misses balanced factor pairs, so _bestReshapeOption() falls back to (dim, 1) and defeats the pointwise-reshape heuristic.

Suggested fix
 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
-    for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
+    for i in range(2, math.isqrt(x) + 1):
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
+            tmpX //= i

-    if x // math.prod(multiplicands) > 1:
-        multiplicands.append(x // math.prod(multiplicands))
+    if tmpX > 1:
+        multiplicands.append(tmpX)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 -
150, The _findAllMultiplicands function currently uses float division and an
exclusive upper bound which breaks integer factorization; change the loop to
iterate i from 2 through int(math.sqrt(x)) inclusive (or use math.isqrt(x)+1)
and use integer division (tmpX //= i) when extracting factors so tmpX stays an
int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.

runwangdl and others added 2 commits April 14, 2026 10:43
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform
with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends
GAP9Deployer (reuses ClDma transformers via GAP9Bindings).

New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers,
Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses).
The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py
(single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile
constants set per ne16_task_defs.h (output 3x3, weight stride bytes
PW=16 DW/Dense=144).

New test infrastructure:
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv)

DeeployTest wiring:
- testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms
  list, mapPlatform, setupMemoryPlatform, mapDeployer.
- testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper
  branch (without it NE16AdjustWeightMemoryLayoutPass never fires and
  parsing backtracks to exhaustion).
- testUtils/core/execution.py: build the GAP9 SDK 'image' target for
  GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
- CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16
  alongside GAP9 in the platform branches.
- TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform,
  add_subdirectory on pulp-nnx with USE_NE16=ON and link it into
  deeploygap9.

Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced
an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define
it locally via PointerClass(float32_t) to unblock the import chain
reached by NE16Platform.

Verified on gvsoc gap9.evk:
  PW 1x1 RQ  (Regular_RQ):    0/1152 errors, 901917 cycles
  DW 3x3 RQ  (DW_2D_RQ):      0/1280 errors, 27339  cycles  (--enable-3x3)
  Dense 3x3  (Regular_2D_RQ): 0/6372 errors, 244595 cycles  (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
- The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
- The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
- Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
- Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
- Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

Bug fixes:
- Add output signedness check in QuantChecker
- Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
- Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
- Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

Co-authored-by: runwangdl <samanthawangdl@gmail.com>
@runwangdl runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 →
  CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK
  CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the
  pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed).
- Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine
  with a trimmed includeList (no CNN_BasicKernels_NE16.h /
  ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull
  in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the
  NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's
private GitHub Container Registry. Only upstream's self-hosted
runners have credentials to pull it; on fork CI runs (ubuntu-latest)
the docker pull fails with 'Error response from daemon: denied' and
the whole job is reported as failure.

Guard the select-env entry of all three gap9 workflows
(ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP
cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires
opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings
only registered the signed-int8 output variant, so any Quant pattern
with signed=0 (e.g. 4-bit unsigned quantization in
Models/Transformer_DeepQuant) had no candidate and parsing exhausted
backtracking.

Add uint8 output to BasicQuantBindings and uint8 input to
BasicDequantBindings in both Targets/Generic/Bindings.py and
Targets/PULPOpen/Bindings.py.

Verified: Models/Transformer_DeepQuant network gen now succeeds for
both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.

(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Deeploy/Targets/GAP9/Platform.py (1)

8-54: ⚠️ Potential issue | 🟡 Minor

Run the formatter/import hooks for this file.

CI reports both yapf and isort modified this file, so the branch is not passing pre-commit yet.

#!/bin/bash
pre-commit run yapf isort --files Deeploy/Targets/GAP9/Platform.py

Also applies to: 281-283

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Platform.py` around lines 8 - 54, The file containing
the long import block (look for symbols like DeQuantTilingReadyBindings,
BasicAllocateTemplate, RemoveEmptyConvBiasPass and PULPDMASliceBindings) is not
formatted/organized per CI; run the import and formatter hooks (isort and yapf)
to reorder and format the imports/line breaks, then stage and commit the
resulting changes so pre-commit passes (this also applies to the nearby import
group that includes PULP* bindings and parsers).
♻️ Duplicate comments (12)
Deeploy/Targets/NE16/__init__.py (1)

5-5: ⚠️ Potential issue | 🟠 Major

Replace the self-import with explicit NE16 module exports.

This remains unresolved: from . import * does not expose the new NE16 sibling modules. Use explicit imports and include Tiler as well.

Proposed fix
-from . import *
+from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
+
+__all__ = [
+    "Bindings",
+    "Deployer",
+    "Engine",
+    "OptimizationPasses",
+    "Parsers",
+    "Platform",
+    "Tiler",
+    "Templates",
+    "TileConstraints",
+    "TopologyOptimizationPasses",
+]
#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/__init__.py` at line 5, Replace the wildcard self-import
in Deeploy/Targets/NE16/__init__.py with explicit exports: import the sibling
module objects you want to expose (e.g., the NE16 device class/name and the
Tiler class) by name and populate __all__ accordingly so consumers can access
NE16 and Tiler directly; specifically remove "from . import *", add explicit
imports of the NE16 module symbols and Tiler, and set __all__ = ["NE16",
"Tiler", ...] to list the public API.
Deeploy/Targets/NE16/Deployer.py (2)

28-30: ⚠️ Potential issue | 🟠 Major

Use None for the inputOffsets default.

{} is shared across NE16Deployer instances, so parent initialization can leak offsets between deployments.

Proposed fix
-                 inputOffsets = {}):
+                 inputOffsets = None):
+        if inputOffsets is None:
+            inputOffsets = {}
         super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)

Verification:

#!/bin/bash
python -m ruff check Deeploy/Targets/NE16/Deployer.py --select B006
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor
NE16Deployer.__init__ currently uses a mutable default for the inputOffsets
parameter (inputOffsets = {}), which can be shared across instances; change the
signature to use inputOffsets=None and then when calling super().__init__ pass
an empty dict when inputOffsets is None (e.g., use inputOffsets or {}), ensuring
NE16Deployer.__init__ does not leak offsets between deployments.

29-40: ⚠️ Potential issue | 🟠 Major

Avoid mutating the shared optimizer pass list.

self.loweringOptimizer.passes is modified in place. Since the NE16 optimizer is constructed at module scope, repeated deployer creation can append duplicate NE16 passes and alter other deployments’ lowering pipelines.

Proposed fix
+import copy
+
 ...
-        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
+        loweringOptimizer = copy.copy(loweringOptimizer)
+        loweringOptimizer.passes = list(loweringOptimizer.passes)
+
+        super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
                          default_channels_first, deeployStateDir, inputOffsets)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Deployer.py` around lines 29 - 40, The code mutates the
shared loweringOptimizer.passes list causing duplicate/side-effect passes; fix
by creating a new list copy and assigning it back instead of modifying in place:
make a local copy (e.g., new_passes = list(self.loweringOptimizer.passes)),
perform the replacement of PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes
when self.Platform.engines[0].enable3x3 is true, then set
self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(),
NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these
symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass,
ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer
to avoid mutating shared state.
Deeploy/Targets/NE16/Parsers.py (2)

37-56: ⚠️ Potential issue | 🟠 Major

channels_first default is still True while the check rejects everything but False.

parseNodeCtxt defaults channels_first=True on line 40 (and the three RQS overrides at lines 113, 153, 193 also default to True), but line 56 rejects anything except False. Any caller that uses the default path returns (ctxt, False) and the NE16 parser fails to bind. Flip all four defaults to False, and replace the == False check with not channels_first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 56, The parseNodeCtxt
method in NE16's Parsers.py currently defaults channels_first=True but
immediately rejects any True value (using "channels_first == False"); change the
default of parseNodeCtxt to channels_first=False and also update the three RQS
override methods that default channels_first=True to default to False, then
simplify the conditional check from "channels_first == False" to "not
channels_first" (ensure you update all occurrences in parseNodeCtxt and the
three RQS override methods so callers using defaults succeed).

118-123: ⚠️ Potential issue | 🔴 Critical

Input mapping still has no bounds check and shift is never set from inputs/attrs.

The three RQS parseNodeCtxt overrides (lines 119-121 / 159-161 / 199-201) unconditionally map node.inputs[idx] into ['data_in','weight','mul','add']. If a 5th tensor is present (some RQS pipelines emit shift as an input) this throws IndexError, and if it's absent, the template variable ${shift} (referenced by ConvTemplate.py line 339 scale_shift_addr = (uint32_t)${shift}) stays "NULL" with no validation. Either validate len(node.inputs) == 4 explicitly or map a 5th input/node.attrs['shift'] into operatorRepresentation['shift'].

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Parsers.py` around lines 118 - 123, The parseNodeCtxt
overrides unconditionally map node.inputs into inputs =
['data_in','weight','mul','add'] causing IndexError for a 5th tensor and leaving
operatorRepresentation['shift'] unset; update parseNodeCtxt (the methods that
populate self.operatorRepresentation) to first check len(node.inputs) and either
assert/raise if it is not 4 or 5, and if a 5th input exists map it to
self.operatorRepresentation['shift'] = ctxt.lookup(node.inputs[4].name).name,
and additionally if 'shift' can come from node.attrs check
node.attrs.get('shift') and set operatorRepresentation['shift'] accordingly so
ConvTemplate.py's ${shift} is always defined or an explicit error is raised.
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (2)

88-102: ⚠️ Potential issue | 🟠 Major

int8 subtraction still overflows before cast to uint8.

values - weight_offset executes in values.dtype. For int8 weights, weight_offset = values.min() can be -128, so any 127 element wraps to -1 and becomes 255 after the later astype(uint8) — wrong absolute offset, wrong encoded weights. Please widen before subtracting.

🐛 Proposed fix
-    weight_offset = values.min()
-    values = values - weight_offset
+    weight_offset = int(values.min())
+    values = values.astype(np.int32) - weight_offset
     node.attrs["weight_offset"] = weight_offset
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 88 -
102, The subtraction "values = values - weight_offset" risks int8 overflow; cast
the tensor to a wider signed dtype (e.g., int16 or int32) before computing
"weight_offset" subtraction so arithmetic is done in a safe width, perform
"values = values.astype(wider_dtype) - int(weight_offset)" then (after any
necessary range checks/clamping) cast to uint8 for encoding; update the code
paths around weightTensor.values, weight_offset and the subsequent astype(uint8)
to use the widened intermediate to avoid wraparound for int8 inputs.

139-150: ⚠️ Potential issue | 🟠 Major

_findAllMultiplicands still uses float division and an exclusive sqrt bound.

range(2, math.ceil(math.sqrt(x))) misses the sqrt itself (e.g., for x = 4 / 49 the loop is empty so multiplicands == [] and math.prod([]) == 1, returning [x]). Also tmpX = tmpX / i converts the accumulator to float, which then fails tmpX % i == 0 for large factors. Result: _bestReshapeOption collapses to (dim, 1) for many common spatial sizes and the pointwise-reshape heuristic never fires.

♻️ Proposed fix
 def _findAllMultiplicands(x: int) -> List[int]:
     multiplicands = []
     tmpX = x
-    for i in range(2, math.ceil(math.sqrt(x))):  # Ceil cause range doesn't include the last number
+    for i in range(2, math.isqrt(x) + 1):
         while tmpX % i == 0:
             multiplicands.append(i)
-            tmpX = tmpX / i
-
-    if x // math.prod(multiplicands) > 1:
-        multiplicands.append(x // math.prod(multiplicands))
+            tmpX //= i
+    if tmpX > 1:
+        multiplicands.append(tmpX)
     return multiplicands
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 -
150, The function _findAllMultiplicands uses float division and an exclusive
sqrt bound; change the loop to iterate i from 2 up to and including the integer
square root (use math.isqrt(x) or int(math.sqrt(x)) + 1) and perform integer
division on tmpX (use //= instead of /=) so tmpX remains an int for the modulo
checks; also simplify the tail handling by appending tmpX if tmpX > 1 (rather
than recomputing x // math.prod(...)) so _findAllMultiplicands reliably returns
prime factors for inputs like 4 or 49 and avoids float-related modulo failures.
Deeploy/Targets/NE16/Engine.py (1)

62-75: ⚠️ Potential issue | 🟠 Major

isPWConv/isDWConv still over-accept grouped convs.

The classifier helpers still don't require group == 1 for PW or assert the DW shape condition (group == ch_im_out with weights.shape[1] == 1). With the parsers enforcing stricter rules downstream, this means canExecute can return True for a node that then fails parsing/tiling, instead of cleanly falling back to the GAP9 cluster engine.

🔒️ Proposed fix
     def isPWConv(self, node) -> bool:
         return node.op in ["Conv", "RequantizedConv"] and \
             isinstance(node.inputs[1], gs.Constant) and \
             node.attrs['kernel_shape'] == [1, 1] and \
             node.attrs['dilations'] == [1, 1] and \
+            node.attrs['group'] == 1 and \
             (node.attrs['strides'] == [1, 1] or self.enableStrides)

     def isDWConv(self, node) -> bool:
+        weights = node.inputs[1].values if isinstance(node.inputs[1], gs.Constant) else None
         return node.op in ["Conv", "RequantizedConv"] and \
             isinstance(node.inputs[1], gs.Constant) and \
             node.attrs['kernel_shape'] == [3, 3] and \
             node.attrs['dilations'] == [1, 1] and \
-            node.attrs['group'] != 1 and \
+            weights is not None and weights.shape[1] == 1 and \
+            node.attrs['group'] == weights.shape[0] and \
             (node.attrs['strides'] == [1, 1] or self.enableStrides)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, The PW/DW classifiers
are too permissive: update isPWConv to also require node.attrs['group'] == 1 to
reject grouped convs, and update isDWConv to enforce depthwise conditions by
checking the weight tensor on node.inputs[1] has shape where weights.shape[1] ==
1 and node.attrs['group'] == node.attrs.get('channels_out' or 'ch_im_out'
equivalent) (i.e., group == output channel count); locate these checks in the
isPWConv and isDWConv functions and add the group and weight-shape validations
so canExecute no longer returns True for convs that will fail downstream
parsing/tiling.
Deeploy/Targets/NE16/Templates/ConvTemplate.py (2)

352-356: ⚠️ Potential issue | 🔴 Critical

.weights_stride is still the invalid task.data.cfg.weights_stride = … form.

Despite the previous "addressed in 773777f" note, the rendered template at these lines still emits task.data.cfg.weights_stride = (ne16_stride_t) { … } in the middle of the ne16_cfg_t compound literal. That is not a designated initializer and will not compile. The fix needs to land in this file.

🐛 Proposed fix
             .output_stride = (ne16_stride_t) {
                 .d0 = NE16_OUTPUT_BANDWIDTH_BYTES,
                 .d1 = ${dim_im_out_y_stride},
                 .d2 = ${dim_im_out_x_stride}
             },
-            task.data.cfg.weights_stride = (ne16_stride_t) {
+            .weights_stride = (ne16_stride_t) {
                 .d0 = ${weightStrideD0},
                 .d1 = ${weightStrideD1},
                 .d2 = ${weightStrideD2}
             },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356, The
template emits an invalid assignment "task.data.cfg.weights_stride =
(ne16_stride_t) { ... }" inside the ne16_cfg_t compound literal; update the
ConvTemplate.py rendering so it produces a designated initializer by replacing
that emitted assignment with ".weights_stride = (ne16_stride_t){ .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e.,
remove the "task.data.cfg." prefix and emit the field as a proper member
initializer inside the ne16_cfg_t literal), ensuring punctuation and comma
placement matches other fields in the ne16_cfg_t literal.

145-151: ⚠️ Potential issue | 🟠 Major

"add" in operatorRepresentation still ignores the "NULL" sentinel.

Same story as before: mul is gated on != "NULL", but use_bias falls back to a presence check. parseNodeCtxt in Deeploy/Targets/NE16/Parsers.py lines 74–76 writes "NULL" into operatorRepresentation["add"] unconditionally, so for bias-less requantized convs bit 25 of conf0 ends up set while scale_bias_addr in the emitted task struct is NULL. Please treat "NULL" as absent here too.

🛡️ Proposed fix
         if operatorRepresentation["mul"] != "NULL":
             mulBuff = ctxt.lookup(operatorRepresentation["mul"])
             mulBits = mulBuff._type.referencedType.typeWidth
-            operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"],
-                                                                 operatorRepresentation["log2D"], mulBits, "add"
-                                                                 in operatorRepresentation, False)
+            use_bias = operatorRepresentation.get("add", "NULL") != "NULL"
+            operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"],
+                                                                 operatorRepresentation["log2D"], mulBits,
+                                                                 use_bias, False)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 145 - 151, The
conf0 bit for bias presence is being set when operatorRepresentation contains
the key "add" even if its value is the sentinel "NULL"; update the requantized
branch in ConvTemplate.py so the third argument to getNormQuantConf0 uses an
explicit non-NULL check (e.g. operatorRepresentation.get("add") != "NULL")
instead of '"add" in operatorRepresentation', mirroring the mul check
(operatorRepresentation["mul"] != "NULL"); locate the operatorRepresentation
usage around the mulBuff lookup and change the boolean expression passed to
getNormQuantConf0 accordingly so bias-less convs do not set bit 25.
Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)

43-56: ⚠️ Potential issue | 🟠 Major

Pointwise output geometry still modeled as identity — wrong for strided 1×1 / padded cases.

With NE16Engine.enableStrides=True, isPWConv accepts strides != [1,1], but outputHeightVar == inputHeightVar / outputWidthVar == inputWidthVar forces stride == 1 and no padding. Either add strides == [1,1] assertions to the 1×1 path, or use the standard (effective_input - 1) // stride + 1 relation like dense/DW do.

🔒️ Proposed fix
-        # Map output dims to inputs dims
+        strides = parseDict["strides"]
+        padding = parseDict["pads"]
+        inputBuffer = ctxt.lookup(inputBufferName)
+
+        # Map output dims to input dims (stride- and padding-aware for 1x1)
         tilerModel.addConstraint(outputBatchVar == inputBatchVar)
-        tilerModel.addConstraint(outputHeightVar == inputHeightVar)
-        tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+        effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+        effectiveWidth  = inputWidthVar  + ((padding[1] + padding[3]) * (inputWidthVar  == inputBuffer.shape[2]))
+        tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+        tilerModel.addConstraint(outputWidthVar  == (effectiveWidth  - 1) // strides[1] + 1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 56, The current pointwise tiler constraints incorrectly force
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids
strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where
tilerModel.addConstraint is called (the lines referencing
outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A)
add an assertion that strides == [1,1] when isPWConv can accept strides (check
NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred)
replace the identity constraints with the standard convolution output relation
used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1,
computing effective_input = input + pad_top + pad_bottom (use whatever padding
variables exist in scope) and apply that formula for both height and width via
tilerModel.addConstraint.
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (1)

52-58: ⚠️ Potential issue | 🟠 Major

Still only pinning weight dim 0; rest of packed DW block can be shrunk by the tiler.

The depthwise weight is emitted as a single packed block (HyperRectangle((0,0,0), (weightShape[0], weightShape[1], weightShape[2])) on line 228), but only dim 0 is constrained to Max() here. The tiler is free to pick smaller values for dims 1 and 2, under-accounting weight memory while codegen still DMAs the full block.

🔒️ Proposed fix
         weightBuffer = ctxt.lookup(weightBufferName)
-        # NE16 DW weight is packed as a single (1, 1, packed_bytes) block
-        # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
-        # Keep the outermost dim fixed at its full (=1) value regardless of
-        # the output channel tiling.
-        tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+        # NE16 DW weight is one packed block loaded entirely every iteration;
+        # pin every axis to its Max so the tiler accounts for the full block.
+        for dimIdx in range(len(weightBuffer.shape)):
+            dimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx)
+            tilerModel.addConstraint(dimVar == dimVar.Max())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
52 - 58, The current code only pins weightOutChannelVar to its Max(), allowing
the other two packed weight dimensions to be shrunk; fix by retrieving the
remaining dimension variables for the weight buffer (the vars corresponding to
dims 1 and 2 of weightBuffer) and add equivalent constraints using
tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed block (all
dims of weightBuffer) is fixed to full size; use weightBuffer/weightBufferName
to locate the buffer and weightOutChannelVar as a reference to the existing
constraint when adding the other two constraints.
🧹 Nitpick comments (1)
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)

41-44: Unused weight-dim variables.

weightInChannelMajorVar, weightBitsVar, and weightBandwidthVar are created but never referenced in addGeometricalConstraint or anywhere else. Only weightOutChannelVar is actually used (line 56). Either wire them into real constraints (e.g., pin the bits/bandwidth axes like the pointwise policy does) or drop them.

♻️ Proposed cleanup
         weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0)
-        weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
-        weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
-        weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41
- 44, Remove or use the three unused weight-dim variables
(weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in
NE16DenseConstraint; they are allocated but never referenced in
addGeometricalConstraint while only weightOutChannelVar is used. Either delete
those getTensorDimVar calls to clean up the code, or wire them into the
constraint logic inside addGeometricalConstraint (or the class constructor) —
e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the
pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling
decisions and weightInChannelMajorVar is used where input-channel-major indexing
matters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Around line 62-80: The bias handling must normalize C.values to a 1-D array of
length Ko before any signed-input compensation; instead of only truncating when
bias_flat.size > Ko, handle bias_flat.size < Ko by expanding/broadcasting (e.g.,
repeat or np.resize) to length Ko and assert the resulting shape equals Ko,
ensure dtype is appropriate (int32/int64 as needed), then perform the
signed-input compensation using w_sum, scale_values (from
operatorRepresentation['mul']) or the int32 path, and finally assign the
corrected Ko-length array back to C.values; update references around C.values,
bias_flat, Ko, w_sum and operatorRepresentation so no silent broadcasting or
out-of-bounds write occurs.

In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py`:
- Around line 116-123: The code always sets NSize = buffA.shape[-1] which
ignores transposition; update the calculation to choose the reduction dimension
based on operatorRepresentation['transA'] (and consider transB if used
elsewhere): compute NSize = buffA.shape[-2] when transA == 1 else
buffA.shape[-1], and adjust any subsequent uses that compute A/B tile offsets or
shapes so they use this transpose-aware NSize (refer to variables transA,
transB, buffA, buffB, and NSize in NE16GEMMTileConstraint.py).
- Around line 131-167: The loop over outputCubes is using cube.offset (the local
rectangle) which ignores AbsoluteHyperRectangle.absoluteOffset, causing input
tile offsets to be computed from the wrong origin; update the code that builds
AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output
offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive
offsets from absoluteCube.rectangle instead of cube) so the
inputACubes/inputBCubes and RequantCube use global offsets; ensure this change
applies where batch-aware prepending of batch offsets occurs (the block that
updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to
inputACubes, inputBCubes and creating RequantCube (HyperRectangle).

In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py`:
- Around line 80-106: Ensure RequantizedGemm handling validates all requant
constants before mutating the node: check node.op == 'RequantizedGemm' still has
node.attrs['div'] and that node.attrs['div'].values is a scalar; verify
mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and
that mul_values length is either 1 or equal to Ko and bias_values length is at
least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias
after these guards and return/skip modification if any check fails to avoid
partially rewriting the graph.

In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`:
- Around line 1200-1221: The identity bypass must be guarded so it only removes
dequant_node/quant_node when the operation truly preserves type and graph
outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output
and that quant_node and dequant_node have matching signed and bit_width (compare
quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents,
and ensure n_levels match and signed flags are equal); if bit widths or signed
differ (e.g., int16->int8) or the output is a declared graph output, do not
remove the nodes and instead let normal narrowing/clamping pass run—implement
these checks around the existing identity branch (using ratio, zp_d, zp_q,
signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the
input substitution, clear inputs/outputs and graph.cleanup().toposort() when all
guards pass.
- Around line 1234-1238: Add a normalized "n_levels" entry alongside
"n_levels_out" when building rqs_attrs so operatorRepresentation['n_levels'] is
present for RequantShiftChecker and callers that expect the normalized
attribute; update the rqs_attrs dict in the block that creates the constants
(the rqs_attrs variable) to include a gs.Constant key 'n_levels' (using the same
normalized n_levels value used for 'n_levels_out') so both 'n_levels' and
'n_levels_out' are emitted consistently with other passes.

In `@Deeploy/Targets/NE16/OptimizationPasses/__init__.py`:
- Line 5: Replace the wildcard import in __init__.py with explicit imports from
MemoryLevelAnnotationPasses.py so the symbol AnnotateNE16WeightMemoryLevel is
exported and discoverable; specifically, import the class/function name (e.g.,
AnnotateNE16WeightMemoryLevel) from MemoryLevelAnnotationPasses and either add
it to the module namespace or to __all__ in __init__.py (or remove the
initializer entirely) so callers can successfully do from
Deeploy.Targets.NE16.OptimizationPasses import AnnotateNE16WeightMemoryLevel.

In `@Deeploy/Targets/NE16/Platform.py`:
- Around line 7-21: The file's imports are not formatted/ordered per project
hooks; run the formatter and import-sorting hooks on this module (e.g., run
pre-commit or directly run yapf and isort) to reorder and reflow the import
block so CI passes; locate the top import section (references:
RequantizedGemmToPwPass, NE16Engine, GAP9Platform, GAP9_FloatGEMMMapper,
NE16AdjustGEMMWeightLayoutPass, PULPRQSGEMMLayer) and apply yapf/isort to fix
spacing, line breaks and import grouping consistently across that block (also
apply the same fixes around lines 34-43 where similar import formatting issues
exist).
- Around line 39-43: NE16Optimizer currently reuses the same pass instances from
PULPOptimizer by using [*PULPOptimizer.passes], causing shared mutable state;
change the construction to deep-copy the pass objects instead (e.g., use
copy.deepcopy on each element of PULPOptimizer.passes) so NE16Optimizer =
TopologyOptimizer([*deep-copied-passes, NE16AdjustGEMMWeightLayoutPass(),
RequantizedGemmToPwPass()], name="NE16Optimizer"); reference
PULPOptimizer.passes and NE16Optimizer and import/use copy.deepcopy (or use a
per-pass clone method if available) to ensure independent pass instances.

In `@Deeploy/Targets/NE16/Templates/__init__.py`:
- Line 5: The package __init__.py currently uses a self-referential wildcard
import ("from . import *") which doesn't actually import sibling modules (e.g.,
the NE16 allocation and conv template modules) and triggers Ruff F403; replace
the wildcard with explicit imports of the template modules (or explicit from
.module import <symbols>) and/or populate __all__ with the actual module or
symbol names (for example add explicit references to the NE16 allocation and
conv modules) so package-level imports expose the intended templates. Ensure you
reference and import the specific module names rather than relying on "from .
import *".

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the package wildcard import in __init__.py (currently "from .
import *") with explicit relative imports of the concrete constraint modules
and/or classes so sibling modules are actually loaded and their classes are
exposed; e.g., use "from .constraint_module import ConstraintClass" for each
constraint module, and populate an __all__ list with the exported class names
(or import the modules with "from . import constraint_module" if you prefer
module exports) so the concrete constraint classes are available to consumers.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py`:
- Line 5: The package __init__.py currently uses a wildcard import which doesn't
re-export classes from Passes.py; update
Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import
and re-export the four pass classes by adding an explicit import from .Passes of
NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass,
ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are
exposed (e.g., in __all__) so consumers can import them directly from the
package.

In `@DeeployTest/test_platforms.py`:
- Around line 1008-1010: Unpack test_params using a throwaway name for the
unused config_name (e.g., test_name, l1, _ = test_params) and build the cmake
args using iterable unpacking instead of list concatenation (e.g.,
ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]);
apply the same two edits for the other two occurrences that create
ne16_cmake_args in the GAP9_w_NE16 tiled tests so Ruff warnings for unused
variables and list concatenation are resolved.

In `@TargetLibraries/GAP9/inc/ne16_utils.h`:
- Around line 11-12: The public header ne16_utils.h currently pulls in
implementation-only headers; replace the includes of "CNN_BasicKernels_fp32.h"
and "pmsis.h" with the standard <stdint.h> so the public struct uses only
fixed-width integer types, and remove the unavailable dependency. Then add
`#include` "pmsis.h" into ne16_utils.c (where gap_coreid(), gap_ncore(), and the
v4s vector type are used) so implementation functions in ne16_utils.c can access
those symbols.

---

Outside diff comments:
In `@Deeploy/Targets/GAP9/Platform.py`:
- Around line 8-54: The file containing the long import block (look for symbols
like DeQuantTilingReadyBindings, BasicAllocateTemplate, RemoveEmptyConvBiasPass
and PULPDMASliceBindings) is not formatted/organized per CI; run the import and
formatter hooks (isort and yapf) to reorder and format the imports/line breaks,
then stage and commit the resulting changes so pre-commit passes (this also
applies to the nearby import group that includes PULP* bindings and parsers).

---

Duplicate comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: Replace the wildcard self-import in Deeploy/Targets/NE16/__init__.py
with explicit exports: import the sibling module objects you want to expose
(e.g., the NE16 device class/name and the Tiler class) by name and populate
__all__ accordingly so consumers can access NE16 and Tiler directly;
specifically remove "from . import *", add explicit imports of the NE16 module
symbols and Tiler, and set __all__ = ["NE16", "Tiler", ...] to list the public
API.

In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor NE16Deployer.__init__ currently uses a
mutable default for the inputOffsets parameter (inputOffsets = {}), which can be
shared across instances; change the signature to use inputOffsets=None and then
when calling super().__init__ pass an empty dict when inputOffsets is None
(e.g., use inputOffsets or {}), ensuring NE16Deployer.__init__ does not leak
offsets between deployments.
- Around line 29-40: The code mutates the shared loweringOptimizer.passes list
causing duplicate/side-effect passes; fix by creating a new list copy and
assigning it back instead of modifying in place: make a local copy (e.g.,
new_passes = list(self.loweringOptimizer.passes)), perform the replacement of
PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes when
self.Platform.engines[0].enable3x3 is true, then set
self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(),
NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these
symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass,
ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer
to avoid mutating shared state.

In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: The PW/DW classifiers are too permissive: update isPWConv
to also require node.attrs['group'] == 1 to reject grouped convs, and update
isDWConv to enforce depthwise conditions by checking the weight tensor on
node.inputs[1] has shape where weights.shape[1] == 1 and node.attrs['group'] ==
node.attrs.get('channels_out' or 'ch_im_out' equivalent) (i.e., group == output
channel count); locate these checks in the isPWConv and isDWConv functions and
add the group and weight-shape validations so canExecute no longer returns True
for convs that will fail downstream parsing/tiling.

In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-56: The parseNodeCtxt method in NE16's Parsers.py currently
defaults channels_first=True but immediately rejects any True value (using
"channels_first == False"); change the default of parseNodeCtxt to
channels_first=False and also update the three RQS override methods that default
channels_first=True to default to False, then simplify the conditional check
from "channels_first == False" to "not channels_first" (ensure you update all
occurrences in parseNodeCtxt and the three RQS override methods so callers using
defaults succeed).
- Around line 118-123: The parseNodeCtxt overrides unconditionally map
node.inputs into inputs = ['data_in','weight','mul','add'] causing IndexError
for a 5th tensor and leaving operatorRepresentation['shift'] unset; update
parseNodeCtxt (the methods that populate self.operatorRepresentation) to first
check len(node.inputs) and either assert/raise if it is not 4 or 5, and if a 5th
input exists map it to self.operatorRepresentation['shift'] =
ctxt.lookup(node.inputs[4].name).name, and additionally if 'shift' can come from
node.attrs check node.attrs.get('shift') and set operatorRepresentation['shift']
accordingly so ConvTemplate.py's ${shift} is always defined or an explicit error
is raised.

In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: The template emits an invalid assignment
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" inside the ne16_cfg_t
compound literal; update the ConvTemplate.py rendering so it produces a
designated initializer by replacing that emitted assignment with
".weights_stride = (ne16_stride_t){ .d0 = ${weightStrideD0}, .d1 =
${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e., remove the
"task.data.cfg." prefix and emit the field as a proper member initializer inside
the ne16_cfg_t literal), ensuring punctuation and comma placement matches other
fields in the ne16_cfg_t literal.
- Around line 145-151: The conf0 bit for bias presence is being set when
operatorRepresentation contains the key "add" even if its value is the sentinel
"NULL"; update the requantized branch in ConvTemplate.py so the third argument
to getNormQuantConf0 uses an explicit non-NULL check (e.g.
operatorRepresentation.get("add") != "NULL") instead of '"add" in
operatorRepresentation', mirroring the mul check (operatorRepresentation["mul"]
!= "NULL"); locate the operatorRepresentation usage around the mulBuff lookup
and change the boolean expression passed to getNormQuantConf0 accordingly so
bias-less convs do not set bit 25.

In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current code only pins weightOutChannelVar to its
Max(), allowing the other two packed weight dimensions to be shrunk; fix by
retrieving the remaining dimension variables for the weight buffer (the vars
corresponding to dims 1 and 2 of weightBuffer) and add equivalent constraints
using tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed
block (all dims of weightBuffer) is fixed to full size; use
weightBuffer/weightBufferName to locate the buffer and weightOutChannelVar as a
reference to the existing constraint when adding the other two constraints.

In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-56: The current pointwise tiler constraints incorrectly force
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids
strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where
tilerModel.addConstraint is called (the lines referencing
outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A)
add an assertion that strides == [1,1] when isPWConv can accept strides (check
NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred)
replace the identity constraints with the standard convolution output relation
used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1,
computing effective_input = input + pad_top + pad_bottom (use whatever padding
variables exist in scope) and apply that formula for both height and width via
tilerModel.addConstraint.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 88-102: The subtraction "values = values - weight_offset" risks
int8 overflow; cast the tensor to a wider signed dtype (e.g., int16 or int32)
before computing "weight_offset" subtraction so arithmetic is done in a safe
width, perform "values = values.astype(wider_dtype) - int(weight_offset)" then
(after any necessary range checks/clamping) cast to uint8 for encoding; update
the code paths around weightTensor.values, weight_offset and the subsequent
astype(uint8) to use the widened intermediate to avoid wraparound for int8
inputs.
- Around line 139-150: The function _findAllMultiplicands uses float division
and an exclusive sqrt bound; change the loop to iterate i from 2 up to and
including the integer square root (use math.isqrt(x) or int(math.sqrt(x)) + 1)
and perform integer division on tmpX (use //= instead of /=) so tmpX remains an
int for the modulo checks; also simplify the tail handling by appending tmpX if
tmpX > 1 (rather than recomputing x // math.prod(...)) so _findAllMultiplicands
reliably returns prime factors for inputs like 4 or 49 and avoids float-related
modulo failures.

---

Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: Remove or use the three unused weight-dim variables
(weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in
NE16DenseConstraint; they are allocated but never referenced in
addGeometricalConstraint while only weightOutChannelVar is used. Either delete
those getTensorDimVar calls to clean up the code, or wire them into the
constraint logic inside addGeometricalConstraint (or the class constructor) —
e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the
pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling
decisions and weightInChannelMajorVar is used where input-channel-major indexing
matters.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cc86fa11-0cab-48bf-89f1-7688eb67747c

📥 Commits

Reviewing files that changed from the base of the PR and between 748707a and 11a641b.

📒 Files selected for processing (55)
  • .github/workflows/_runner-gap9-w-ne16-tiled.yml
  • .github/workflows/_runner-snitch-tiled-sequential.yml
  • .github/workflows/ci-platform-gap9-tiled.yml
  • .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
  • .github/workflows/ci-platform-gap9.yml
  • CMakeLists.txt
  • Deeploy/Targets/GAP9/Bindings.py
  • Deeploy/Targets/GAP9/DMA/L3Dma.py
  • Deeploy/Targets/GAP9/Parsers.py
  • Deeploy/Targets/GAP9/Platform.py
  • Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
  • Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py
  • Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py
  • Deeploy/Targets/GAP9/Tiler.py
  • Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.py
  • Deeploy/Targets/Generic/Bindings.py
  • Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/Generic/TypeCheckers.py
  • Deeploy/Targets/NE16/Bindings.py
  • Deeploy/Targets/NE16/Deployer.py
  • Deeploy/Targets/NE16/Engine.py
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
  • Deeploy/Targets/NE16/OptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/Parsers.py
  • Deeploy/Targets/NE16/Platform.py
  • Deeploy/Targets/NE16/Templates/AllocateTemplate.py
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py
  • Deeploy/Targets/NE16/Templates/__init__.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py
  • Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
  • Deeploy/Targets/NE16/TileConstraints/__init__.py
  • Deeploy/Targets/NE16/Tiler.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py
  • Deeploy/Targets/NE16/__init__.py
  • Deeploy/Targets/PULPOpen/Bindings.py
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • DeeployTest/CMakeLists.txt
  • DeeployTest/Platforms/GAP9/CMakeLists.txt
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npz
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnx
  • DeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npz
  • DeeployTest/conftest.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • DeeployTest/testMVP.py
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/testUtils/platformMapping.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
  • DeeployTest/test_platforms.py
  • TargetLibraries/GAP9/CMakeLists.txt
  • TargetLibraries/GAP9/inc/ne16_utils.h
  • TargetLibraries/GAP9/src/ne16_utils.c
✅ Files skipped from review due to trivial changes (8)
  • Deeploy/Targets/GAP9/TopologyOptimizationPasses/init.py
  • DeeployTest/testMVP.py
  • DeeployTest/Platforms/GAP9/CMakeLists.txt
  • Deeploy/Targets/NE16/Tiler.py
  • .github/workflows/_runner-gap9-w-ne16-tiled.yml
  • Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
  • DeeployTest/test_gap9_ne16_tiled_config.py
  • Deeploy/Targets/NE16/Bindings.py
🚧 Files skipped from review as they are similar to previous changes (10)
  • DeeployTest/testUtils/core/execution.py
  • DeeployTest/conftest.py
  • DeeployTest/CMakeLists.txt
  • CMakeLists.txt
  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
  • Deeploy/Targets/NE16/Templates/AllocateTemplate.py
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
  • DeeployTest/testUtils/platformMapping.py
  • .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
  • Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py

Comment on lines +62 to +80
# Truncate broadcast bias [M, O] → per-channel [Ko] for NE16
bias_flat = C.values.flatten()
if bias_flat.size > Ko:
C.values = bias_flat[:Ko].copy()

# Signed input bias compensation
if input_signed:
# Compute w_sum BEFORE packing (needed for signed bias compensation)
w_sum = w_int8.astype(np.int64).sum(axis = 1) # [Ko]
bias_values = C.values.flatten().astype(np.int64)
if 'mul' in operatorRepresentation:
# RequantizedGemm: bias -= 128 * w_sum * scale
scale_buf = ctxt.lookup(operatorRepresentation['mul'])
scale_values = scale_buf.values.flatten().astype(np.int64)
bias_values -= 128 * w_sum * scale_values
else:
# Gemm int32: bias -= 128 * w_sum (no scale)
bias_values -= 128 * w_sum
C.values = bias_values.astype(np.int32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bias handling only covers size > Ko, not broadcast/scalar cases.

The if bias_flat.size > Ko branch handles a broadcast-expanded bias of shape [M, O], but when bias_flat.size < Ko (scalar bias, or per-batch bias of size 1 that wasn't pre-broadcast) C.values is left untouched. The signed-input branch below then calls C.values.flatten().astype(np.int64) and computes bias_values -= 128 * w_sum * scale_values with mismatched shapes, silently broadcasting the short bias over w_sum[:Ko] and writing back a non-Ko array that the NE16 kernel will later read as [Ko] (OOB/garbage).

Consider explicitly normalizing bias to [Ko] up front (broadcast-fill or np.resize) and asserting the resulting shape, e.g.:

🛡️ Proposed fix
-        # Truncate broadcast bias [M, O] → per-channel [Ko] for NE16
-        bias_flat = C.values.flatten()
-        if bias_flat.size > Ko:
-            C.values = bias_flat[:Ko].copy()
+        # Normalize bias to per-channel [Ko] for NE16
+        bias_flat = C.values.flatten()
+        if bias_flat.size == Ko:
+            C.values = bias_flat.copy()
+        elif bias_flat.size > Ko and bias_flat.size % Ko == 0:
+            C.values = bias_flat[:Ko].copy()
+        elif bias_flat.size == 1:
+            C.values = np.full((Ko,), bias_flat[0], dtype = bias_flat.dtype)
+        else:
+            raise ValueError(f"NE16 GEMM bias size {bias_flat.size} incompatible with Ko={Ko}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 62 - 80, The
bias handling must normalize C.values to a 1-D array of length Ko before any
signed-input compensation; instead of only truncating when bias_flat.size > Ko,
handle bias_flat.size < Ko by expanding/broadcasting (e.g., repeat or np.resize)
to length Ko and assert the resulting shape equals Ko, ensure dtype is
appropriate (int32/int64 as needed), then perform the signed-input compensation
using w_sum, scale_values (from operatorRepresentation['mul']) or the int32
path, and finally assign the corrected Ko-length array back to C.values; update
references around C.values, bias_flat, Ko, w_sum and operatorRepresentation so
no silent broadcasting or out-of-bounds write occurs.

Comment on lines +116 to +123
transA = operatorRepresentation['transA']
transB = operatorRepresentation['transB']

buffA = ctxt.lookup(operatorRepresentation['A'])
buffB = ctxt.lookup(operatorRepresentation['B'])

NSize = buffA.shape[-1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make NSize transpose-aware.

Line 122 always uses buffA.shape[-1], but for transA == 1 the reduction dimension is buffA.shape[-2]. This will serialize wrong A/B tile shapes for transposed GEMM.

Proposed fix
-        NSize = buffA.shape[-1]
+        NSize = buffA.shape[-1 - transA]

Based on learnings, GEMM tiling serialization must use transpose flags to adjust NSize and matrix offset/shape calculations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines
116 - 123, The code always sets NSize = buffA.shape[-1] which ignores
transposition; update the calculation to choose the reduction dimension based on
operatorRepresentation['transA'] (and consider transB if used elsewhere):
compute NSize = buffA.shape[-2] when transA == 1 else buffA.shape[-1], and
adjust any subsequent uses that compute A/B tile offsets or shapes so they use
this transpose-aware NSize (refer to variables transA, transB, buffA, buffB, and
NSize in NE16GEMMTileConstraint.py).

Comment on lines +131 to +167
for cube in outputCubes:
MOffset, OOffset = cube.offset[-2:]
MSize, OSize = cube.dims[-2:]

if len(cube.offset) > 2:
BatchSize = math.prod(cube.dims[:-2])
else:
BatchSize = 1

replacements["M"].append(MSize)
replacements["O"].append(OSize)
replacements["batch"].append(BatchSize)

if transA == 0:
AMatrixOffsets = (MOffset, 0)
AMatrixShape = (MSize, NSize)
else:
AMatrixOffsets = (0, MOffset)
AMatrixShape = (NSize, MSize)

if len(buffA.shape) > 2:
batchDimCount = len(buffA.shape) - 2
AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets
AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape

inputACubes.append(HyperRectangle(AMatrixOffsets, AMatrixShape))

if transB == 0:
BMatrixOffsets = (0, OOffset)
BMatrixShape = (NSize, OSize)
else:
BMatrixOffsets = (OOffset, 0)
BMatrixShape = (OSize, NSize)

inputBCubes.append(HyperRectangle(BMatrixOffsets, BMatrixShape))

RequantCube = HyperRectangle((OOffset,), (OSize,))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use absolute output offsets when deriving input tiles.

The loop uses cube.offset, but cube is only absoluteCube.rectangle; this drops AbsoluteHyperRectangle.absoluteOffset. Non-zero output tiles can then reload A/B/C/mul/scale slices from the wrong global offsets.

Proposed fix
-        for cube in outputCubes:
-            MOffset, OOffset = cube.offset[-2:]
+        for absoluteCube in absoluteOutputCubes:
+            cube = absoluteCube.rectangle
+            MOffset, OOffset = absoluteCube.absoluteOffset[-2:]
             MSize, OSize = cube.dims[-2:]
@@
             if len(buffA.shape) > 2:
                 batchDimCount = len(buffA.shape) - 2
-                AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets
+                AMatrixOffsets = tuple(absoluteCube.absoluteOffset[:-2][-batchDimCount:]) + AMatrixOffsets
                 AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines
131 - 167, The loop over outputCubes is using cube.offset (the local rectangle)
which ignores AbsoluteHyperRectangle.absoluteOffset, causing input tile offsets
to be computed from the wrong origin; update the code that builds
AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output
offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive
offsets from absoluteCube.rectangle instead of cube) so the
inputACubes/inputBCubes and RequantCube use global offsets; ensure this change
applies where batch-aware prepending of batch offsets occurs (the block that
updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to
inputACubes, inputBCubes and creating RequantCube (HyperRectangle).

Comment on lines +80 to +106
if node.op == 'RequantizedGemm' and len(node.inputs) >= 4:
mulTensor = node.inputs[3]
biasTensor = node.inputs[2]

if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant):
mul_values = mulTensor.values.flatten().astype(np.int32)
log2D = int(np.log2(node.attrs['div'].values))

# Broadcast scalar mul to per-channel if needed
if len(mul_values) == 1:
mul_values = np.full(Ko, mul_values[0], dtype = np.int32)

ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D)

# Rescale bias from mul/log2D domain to scale/scale_n domain
# bias_merged is already *= mul from PULPGEMMRequantMergePass
# NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D)
bias_values = biasTensor.values.flatten().astype(np.int64)
ne16_bias = np.zeros(Ko, dtype = np.int64)
for ko in range(Ko):
shift_diff = int(ne16_scale_n[ko]) - log2D
if shift_diff >= 0:
ne16_bias[ko] = bias_values[ko] << shift_diff
else:
ne16_bias[ko] = bias_values[ko] >> (-shift_diff)

ne16_bias = ne16_bias.astype(np.int32)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate requant constants before mutating the GEMM node.

This block can crash on missing/non-scalar div, mul lengths other than 1 or Ko, or bias shorter than Ko. Since the weight may already be transposed above, a failure here can leave a partially rewritten graph.

Proposed guard
         if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant):
             mul_values = mulTensor.values.flatten().astype(np.int32)
-            log2D = int(np.log2(node.attrs['div'].values))
+            div_attr = node.attrs.get('div')
+            div_values = getattr(div_attr, "values", div_attr)
+            if div_values is None or np.size(div_values) != 1:
+                return graph
+            log2D = int(np.log2(np.asarray(div_values).item()))
 
             # Broadcast scalar mul to per-channel if needed
             if len(mul_values) == 1:
                 mul_values = np.full(Ko, mul_values[0], dtype = np.int32)
+            elif len(mul_values) != Ko:
+                return graph
 
             ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D)
 
             # Rescale bias from mul/log2D domain to scale/scale_n domain
             # bias_merged is already *= mul from PULPGEMMRequantMergePass
             # NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D)
             bias_values = biasTensor.values.flatten().astype(np.int64)
+            if len(bias_values) == 1:
+                bias_values = np.full(Ko, bias_values[0], dtype = np.int64)
+            elif len(bias_values) < Ko:
+                return graph
+            else:
+                bias_values = bias_values[:Ko]
             ne16_bias = np.zeros(Ko, dtype = np.int64)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py` around lines 80 -
106, Ensure RequantizedGemm handling validates all requant constants before
mutating the node: check node.op == 'RequantizedGemm' still has
node.attrs['div'] and that node.attrs['div'].values is a scalar; verify
mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and
that mul_values length is either 1 or equal to Ko and bias_values length is at
least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias
after these guards and return/skip modification if any check fails to avoid
partially rewriting the graph.

Comment on lines +1200 to +1221
signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
signed = bool(signed_val)
bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
n_levels = 2**bit_width

# Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q
ratio = s_d * s_q

# Identity case: ratio ~= 1.0, both zero points == 0
EPSILON = 1e-6
if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:
input_tensor = dequant_node.inputs[0]
output_tensor = quant_node.outputs[0]
for downstream_node in list(output_tensor.outputs):
for i, inp in enumerate(downstream_node.inputs):
if inp == output_tensor:
downstream_node.inputs[i] = input_tensor
dequant_node.inputs.clear()
dequant_node.outputs.clear()
quant_node.inputs.clear()
quant_node.outputs.clear()
graph.cleanup().toposort()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard the identity bypass against output and dtype changes.

The fast path only checks ratio/zero-points and then removes both nodes. If the Quant output is a graph output, this leaves the declared output tensor without a producer. It also skips narrowing/clamping when Dequant and Quant bit widths differ, e.g. int16 dequantized then quantized to int8 with identity scale.

🐛 Proposed guard
     signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
     signed = bool(signed_val)
     bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
+    dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item())
+    dequant_signed = bool(dequant_signed_val)
+    dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item())
     n_levels = 2**bit_width
@@
-    if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:
+    if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and
+            signed == dequant_signed and bit_width == dequant_bit_width):
         input_tensor = dequant_node.inputs[0]
         output_tensor = quant_node.outputs[0]
+        if output_tensor in graph.outputs:
+            return graph
         for downstream_node in list(output_tensor.outputs):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
signed = bool(signed_val)
bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
n_levels = 2**bit_width
# Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q
ratio = s_d * s_q
# Identity case: ratio ~= 1.0, both zero points == 0
EPSILON = 1e-6
if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:
input_tensor = dequant_node.inputs[0]
output_tensor = quant_node.outputs[0]
for downstream_node in list(output_tensor.outputs):
for i, inp in enumerate(downstream_node.inputs):
if inp == output_tensor:
downstream_node.inputs[i] = input_tensor
dequant_node.inputs.clear()
dequant_node.outputs.clear()
quant_node.inputs.clear()
quant_node.outputs.clear()
graph.cleanup().toposort()
signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
signed = bool(signed_val)
bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item())
dequant_signed = bool(dequant_signed_val)
dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item())
n_levels = 2**bit_width
# Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q
ratio = s_d * s_q
# Identity case: ratio ~= 1.0, both zero points == 0
EPSILON = 1e-6
if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and
signed == dequant_signed and bit_width == dequant_bit_width):
input_tensor = dequant_node.inputs[0]
output_tensor = quant_node.outputs[0]
if output_tensor in graph.outputs:
return graph
for downstream_node in list(output_tensor.outputs):
for i, inp in enumerate(downstream_node.inputs):
if inp == output_tensor:
downstream_node.inputs[i] = input_tensor
dequant_node.inputs.clear()
dequant_node.outputs.clear()
quant_node.inputs.clear()
quant_node.outputs.clear()
graph.cleanup().toposort()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py` around lines
1200 - 1221, The identity bypass must be guarded so it only removes
dequant_node/quant_node when the operation truly preserves type and graph
outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output
and that quant_node and dequant_node have matching signed and bit_width (compare
quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents,
and ensure n_levels match and signed flags are equal); if bit widths or signed
differ (e.g., int16->int8) or the output is a declared graph output, do not
remove the nodes and instead let normal narrowing/clamping pass run—implement
these checks around the existing identity branch (using ratio, zp_d, zp_q,
signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the
input substitution, clear inputs/outputs and graph.cleanup().toposort() when all
guards pass.

#
# SPDX-License-Identifier: Apache-2.0

from . import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391


🏁 Script executed:

fd -t f . Deeploy/Targets/NE16/Templates/ --max-depth 2

Repository: pulp-platform/Deeploy

Length of output: 205


🏁 Script executed:

ls -la Deeploy/Targets/NE16/Templates/

Repository: pulp-platform/Deeploy

Length of output: 376


🏁 Script executed:

cat Deeploy/Targets/NE16/Templates/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 189


🏁 Script executed:

cat -n Deeploy/Targets/NE16/Templates/AllocateTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 924


🏁 Script executed:

head -50 Deeploy/Targets/NE16/Templates/ConvTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 1540


🏁 Script executed:

rg '__all__' Deeploy/Targets/NE16/Templates/

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

rg 'from Deeploy\.Targets\.NE16\.Templates import' --type py

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

rg 'from \.\.Templates import' Deeploy/Targets/NE16 --type py

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

rg 'from.*AllocateTemplate|from.*ConvTemplate|import.*AllocateTemplate|import.*ConvTemplate' --type py

Repository: pulp-platform/Deeploy

Length of output: 3880


🏁 Script executed:

rg 'ne16GenericGlobalInitTemplate|NE16ConvTemplate' --type py

Repository: pulp-platform/Deeploy

Length of output: 531


Export the actual template modules.

This self-referential wildcard import does not load sibling files like the NE16 allocation/conv templates, so package-level imports will not see those symbols and Ruff reports F403.

Proposed direction
-from . import *
+from .AllocateTemplate import *
+from .ConvTemplate import *
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from . import *
from .AllocateTemplate import *
from .ConvTemplate import *
🧰 Tools
🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/Templates/__init__.py` at line 5, The package
__init__.py currently uses a self-referential wildcard import ("from . import
*") which doesn't actually import sibling modules (e.g., the NE16 allocation and
conv template modules) and triggers Ruff F403; replace the wildcard with
explicit imports of the template modules (or explicit from .module import
<symbols>) and/or populate __all__ with the actual module or symbol names (for
example add explicit references to the NE16 allocation and conv modules) so
package-level imports expose the intended templates. Ensure you reference and
import the specific module names rather than relying on "from . import *".

#
# SPDX-License-Identifier: Apache-2.0

from . import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391


🏁 Script executed:

#!/bin/bash
# List files in the TileConstraints directory
find Deeploy/Targets/NE16/TileConstraints -type f -name "*.py" | head -20

Repository: pulp-platform/Deeploy

Length of output: 356


🏁 Script executed:

#!/bin/bash
# Check the content of the __init__.py file and surrounding files
ls -la Deeploy/Targets/NE16/TileConstraints/

Repository: pulp-platform/Deeploy

Length of output: 535


🏁 Script executed:

#!/bin/bash
# Check what's exported from the constraint modules
cat -n Deeploy/Targets/NE16/TileConstraints/*.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 6259


🏁 Script executed:

#!/bin/bash
# View the actual __init__.py file content
cat -n Deeploy/Targets/NE16/TileConstraints/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224


🏁 Script executed:

#!/bin/bash
# Check if constraint modules define __all__
for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do
  echo "=== $file ==="
  rg -n '__all__' "$file" || echo "No __all__ found"
done

Repository: pulp-platform/Deeploy

Length of output: 327


🏁 Script executed:

#!/bin/bash
# Check what's actually exported from constraint modules using grep for class definitions
for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do
  echo "=== $file ==="
  rg -n '^class ' "$file"
done

Repository: pulp-platform/Deeploy

Length of output: 651


Replace the self-import with explicit constraint exports.

from . import * imports from the current package namespace, not from sibling modules. Since the constraint modules are never imported, their classes are not exposed. Import the concrete constraint modules instead.

Proposed fix
-from . import *
+from .NE16DenseConstraint import *
+from .NE16DepthwiseConstraint import *
+from .NE16PointwiseConstraint import *
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from . import *
from .NE16DenseConstraint import *
from .NE16DepthwiseConstraint import *
from .NE16PointwiseConstraint import *
🧰 Tools
🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
package wildcard import in __init__.py (currently "from . import *") with
explicit relative imports of the concrete constraint modules and/or classes so
sibling modules are actually loaded and their classes are exposed; e.g., use
"from .constraint_module import ConstraintClass" for each constraint module, and
populate an __all__ list with the exported class names (or import the modules
with "from . import constraint_module" if you prefer module exports) so the
concrete constraint classes are available to consumers.

#
# SPDX-License-Identifier: Apache-2.0

from . import *
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16

Repository: pulp-platform/Deeploy

Length of output: 391


🏁 Script executed:

#!/bin/bash
# List files in TopologyOptimizationPasses directory
ls -la Deeploy/Targets/NE16/TopologyOptimizationPasses/

Repository: pulp-platform/Deeploy

Length of output: 300


🏁 Script executed:

#!/bin/bash
# Check if Passes.py exists and show its content
if [ -f "Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py" ]; then
  wc -l Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
  head -50 Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
fi

Repository: pulp-platform/Deeploy

Length of output: 1989


🏁 Script executed:

#!/bin/bash
# Check __init__.py for __all__ definition
cat -n Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py

Repository: pulp-platform/Deeploy

Length of output: 224


🏁 Script executed:

#!/bin/bash
# Check for __all__ definition in Passes.py
rg -n '__all__' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

Repository: pulp-platform/Deeploy

Length of output: 47


🏁 Script executed:

#!/bin/bash
# Find class definitions in Passes.py to identify the pass classes
rg -n '^class ' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py

Repository: pulp-platform/Deeploy

Length of output: 327


Import Passes explicitly.

from . import * does not re-export the pass classes from Passes.py; consumers importing from Deeploy.Targets.NE16.TopologyOptimizationPasses cannot access the four pass classes (NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass, ConvEngineDiscolorationPass, NE16OptimizationPass) without an explicit import.

Proposed fix
-from . import *
+from .Passes import *
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from . import *
from .Passes import *
🧰 Tools
🪛 Ruff (0.15.10)

[error] 5-5: from . import * used; unable to detect undefined names

(F403)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py` at line 5, The
package __init__.py currently uses a wildcard import which doesn't re-export
classes from Passes.py; update
Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import
and re-export the four pass classes by adding an explicit import from .Passes of
NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass,
ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are
exposed (e.g., in __all__) so consumers can import them directly from the
package.

Comment on lines +1008 to +1010
test_name, l1, config_name = test_params

ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python -m ruff check DeeployTest/test_platforms.py --select RUF059,RUF005

Repository: pulp-platform/Deeploy

Length of output: 25561


Clean up the Ruff warnings in the three new GAP9_w_NE16 tiled tests.

The unpacked config_name variable is never used, and Ruff prefers iterable unpacking over list concatenation. Apply these changes at lines 1008-1010, 1043-1045, and 1076-1078:

Proposed fix
-    test_name, l1, config_name = test_params
+    test_name, l1, _config_name = test_params
 
-    ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
+    ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
test_name, l1, config_name = test_params
ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
test_name, l1, _config_name = test_params
ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 1008-1008: Unpacked variable config_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


[warning] 1010-1010: Consider iterable unpacking instead of concatenation

Replace with iterable unpacking

(RUF005)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DeeployTest/test_platforms.py` around lines 1008 - 1010, Unpack test_params
using a throwaway name for the unused config_name (e.g., test_name, l1, _ =
test_params) and build the cmake args using iterable unpacking instead of list
concatenation (e.g., ne16_cmake_args = [*cmake_args,
f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]); apply the same two edits for the
other two occurrences that create ne16_cmake_args in the GAP9_w_NE16 tiled tests
so Ruff warnings for unused variables and list concatenation are resolved.

Comment on lines +11 to +12
#include "CNN_BasicKernels_fp32.h"
#include "pmsis.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether the missing header exists and where ne16_utils.h is included.
fd -a 'CNN_BasicKernels_fp32\.h$' .
rg -n '#include\s+"CNN_BasicKernels_fp32\.h"|#include\s+"ne16_utils\.h"'

Repository: pulp-platform/Deeploy

Length of output: 204


🏁 Script executed:

#!/bin/bash
# Read the header file to understand what's actually needed
echo "=== ne16_utils.h content ===" 
cat -n TargetLibraries/GAP9/inc/ne16_utils.h

echo ""
echo "=== ne16_utils.c content ===" 
cat -n TargetLibraries/GAP9/src/ne16_utils.c

Repository: pulp-platform/Deeploy

Length of output: 1945


Move implementation dependencies out of the public header.

ne16_utils.h only needs fixed-width integer types for its public struct, but it includes CNN_BasicKernels_fp32.h (which does not exist in the repository) and pmsis.h (which is only used in the implementation). These unavailable/unnecessary headers in a public header break any code that includes it.

Replace the includes in the header with <stdint.h>, and add #include "pmsis.h" to ne16_utils.c where it is actually needed (for gap_coreid(), gap_ncore(), and v4s vector types).

Proposed fix

In TargetLibraries/GAP9/inc/ne16_utils.h:

-#include "CNN_BasicKernels_fp32.h"
-#include "pmsis.h"
+#include <stdint.h>

In TargetLibraries/GAP9/src/ne16_utils.c:

 `#include` "ne16_utils.h"
+#include "pmsis.h"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include "CNN_BasicKernels_fp32.h"
#include "pmsis.h"
`#include` <stdint.h>
🧰 Tools
🪛 Clang (14.0.6)

[error] 11-11: 'CNN_BasicKernels_fp32.h' file not found

(clang-diagnostic-error)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@TargetLibraries/GAP9/inc/ne16_utils.h` around lines 11 - 12, The public
header ne16_utils.h currently pulls in implementation-only headers; replace the
includes of "CNN_BasicKernels_fp32.h" and "pmsis.h" with the standard <stdint.h>
so the public struct uses only fixed-width integer types, and remove the
unavailable dependency. Then add `#include` "pmsis.h" into ne16_utils.c (where
gap_coreid(), gap_ncore(), and the v4s vector type are used) so implementation
functions in ne16_utils.c can access those symbols.

The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9
SDK autotiler CNN_Libraries_HWPE directory), but this path was missing
from the cmake include directories, causing build failures on plain GAP9.
KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but
the template was using (v4s){0,0,0,0} (signed), causing a compilation
error on GCC with -Werror.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py (1)

41-41: Match the NodeTemplate.alignToContext return contract.

The override returns operatorRepresentation, so the annotation should use OperatorRepresentation instead of Dict.

♻️ Proposed annotation cleanup
-from typing import Dict, List, Tuple
+from typing import List, Tuple
@@
     def alignToContext(self, ctxt: NetworkContext,
-                       operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]:
+                       operatorRepresentation: OperatorRepresentation
+                       ) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 41, The return
type annotation for NodeTemplate.alignToContext is incorrect: the method
currently declares Tuple[NetworkContext, Dict, List[str]] but the override
actually returns operatorRepresentation, so update the second generic from Dict
to OperatorRepresentation; modify the signature of alignToContext to return
Tuple[NetworkContext, OperatorRepresentation, List[str]] and ensure any related
type imports/usages reference OperatorRepresentation (and update any docstrings
or comments mentioning the old Dict type).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 60: The unpacking assigns Ko, Ki = w_int8.shape but Ki is never used,
causing a Ruff unused-variable warning; change the unpack to use a dummy name
(e.g., Ko, _ = w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear
and the linter is satisfied, leaving any other code that references Ko
unchanged.
- Around line 93-102: The in-place int8->uint8 offsetting mutates the input
buffer `${A}` via ne16_int8_to_uint8 and corrupts aliased or reused inputs;
instead allocate a hoisted transient uint8 buffer sized `${batch} * ${M} *
${N}`, populate a ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out =
(uint8_t *)transient_buf, call pi_cl_team_fork(NUM_CORES, (void
*)ne16_int8_to_uint8, &that_arg), and use the transient buffer for downstream
NE16 calls; alternatively, if you must keep the in-place path, add a clear
comment and enforce single-consumer/non-aliased binding for `${A}` so callers
know mutation will occur.

---

Nitpick comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 41: The return type annotation for NodeTemplate.alignToContext is
incorrect: the method currently declares Tuple[NetworkContext, Dict, List[str]]
but the override actually returns operatorRepresentation, so update the second
generic from Dict to OperatorRepresentation; modify the signature of
alignToContext to return Tuple[NetworkContext, OperatorRepresentation,
List[str]] and ensure any related type imports/usages reference
OperatorRepresentation (and update any docstrings or comments mentioning the old
Dict type).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ba38bca8-0085-4560-bbc7-d05070820224

📥 Commits

Reviewing files that changed from the base of the PR and between 11a641b and 6a3793a.

📒 Files selected for processing (2)
  • Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py
  • TargetLibraries/GAP9/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • TargetLibraries/GAP9/CMakeLists.txt


# Weight packing and signed bias compensation
w_int8 = B.values.astype(np.int8) # [Ko, Ki], still int8 at this point
Ko, Ki = w_int8.shape
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silence the Ruff unused-variable warning.

Ki is unpacked but never used; use a dummy name.

🧹 Proposed lint fix
-        Ko, Ki = w_int8.shape
+        Ko, _ = w_int8.shape
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Ko, Ki = w_int8.shape
Ko, _ = w_int8.shape
🧰 Tools
🪛 Ruff (0.15.10)

[warning] 60-60: Unpacked variable Ki is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 60, The unpacking
assigns Ko, Ki = w_int8.shape but Ki is never used, causing a Ruff
unused-variable warning; change the unpack to use a dummy name (e.g., Ko, _ =
w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear and the linter is
satisfied, leaving any other code that references Ko unchanged.

Comment on lines +93 to +102
% if input_signed:
// Signed input: add 128 offset to convert int8 -> uint8 (multi-core SIMD)
{
ne16_int8_to_uint8_T _offset_arg = {
.In = (int8_t *)${A},
.Out = (uint8_t *)${A},
.size = ${batch} * ${M} * ${N}
};
pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8, &_offset_arg);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | head -200

Repository: pulp-platform/Deeploy

Length of output: 10496


🏁 Script executed:

cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | tail -50

Repository: pulp-platform/Deeploy

Length of output: 2599


🏁 Script executed:

# Search for references to NE16GEMMTemplate or similar buffer management patterns
rg "NE16GEMM|buffer.*signed|input_signed" --max-count=20

Repository: pulp-platform/Deeploy

Length of output: 7878


🏁 Script executed:

cat -n Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py

Repository: pulp-platform/Deeploy

Length of output: 10247


🏁 Script executed:

# Check PULPOpen ConvTemplate for signed input handling
rg -A 15 "if input_signed:" Deeploy/Targets/PULPOpen/Templates/ConvTemplate.py | head -100

Repository: pulp-platform/Deeploy

Length of output: 2025


🏁 Script executed:

# Search for ne16_int8_to_uint8 references to understand the conversion pattern
rg "ne16_int8_to_uint8" --max-count=20

Repository: pulp-platform/Deeploy

Length of output: 741


🏁 Script executed:

cat -n TargetLibraries/GAP9/src/ne16_utils.c

Repository: pulp-platform/Deeploy

Length of output: 1257


Avoid mutating ${A} during signed-input offsetting.

The in-place conversion at lines 93–102 and 168–177 corrupts ${A} permanently for any downstream consumer, skip connection, graph input reuse, or aliased buffer. The ne16_int8_to_uint8 utility mutates the buffer in-place, and no tile constraint or binding enforces single-consumer or non-aliased input guarantees. Use a hoisted transient uint8 input buffer for the offset path, or explicitly document and enforce that this binding only accepts single-consumer, non-aliased activations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 93 - 102,
The in-place int8->uint8 offsetting mutates the input buffer `${A}` via
ne16_int8_to_uint8 and corrupts aliased or reused inputs; instead allocate a
hoisted transient uint8 buffer sized `${batch} * ${M} * ${N}`, populate a
ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out = (uint8_t
*)transient_buf, call pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8,
&that_arg), and use the transient buffer for downstream NE16 calls;
alternatively, if you must keep the in-place path, add a clear comment and
enforce single-consumer/non-aliased binding for `${A}` so callers know mutation
will occur.

Runs each NE16 conv kernel with --profileTiling after the normal test
suite to collect cycle counts from gvsoc.
profileTiling generates code calling getCycles() in Network.c but the
header declaring it was not included. Add CycleCounter.h to both GAP9
and NE16 platform include lists, and expose the GAP9 inc/ directory to
the network target so the header is found at compile time.
GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The
profiling step needs a clean rebuild with LTO disabled to avoid
conflicts with the cached LTO-enabled build from the test step.
The --enable-3x3 flag was parsed by the runner script but never passed
to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3
and Dense 3x3 convolutions silently fell back to the PULP cluster
instead of dispatching to NE16. Add the flag and set it on the engine.
The deeployRunner parsed --enable-3x3 but never forwarded it to
generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False
and DW/Dense 3x3 convs silently fell back to the cluster.
Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly
benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8)
is too small — NE16 dispatch overhead dominates at only 12.8%
utilization. Also wire --enable-3x3 through deeployRunner gen_args.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants