[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9#183
[NE16] Add GAP9_w_NE16 platform: NE16 accelerator Engine on GAP9#183runwangdl wants to merge 14 commits intopulp-platform:develfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds NE16 accelerator support and a new platform variant Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Graph as gs.Graph
participant Parser as NE16 Parser
participant Optimizer as TopologyOptimizer / Passes
participant Tiler as Tiler / TileConstraints
participant Templates as Conv/GEMM Templates
participant Deployer as NE16Deployer / Engine
participant Build as CMake / pulp-nnx
Graph->>Parser: parse nodes (Conv/Gemm) with NE16 constraints & RQ fields
Parser->>Optimizer: emit matches for NE16 topology passes (weight/layout, reshape)
Optimizer->>Templates: request weight encoding and template alignment
Templates->>Tiler: provide tiling metadata (counters, strides, conf0)
Tiler->>Deployer: produce VariableReplacementScheme & TilingSchedule
Deployer->>Build: integrate generated sources, link pulp-nnx, produce image
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 11
🧹 Nitpick comments (4)
Deeploy/Targets/NE16/TileConstraints/__init__.py (1)
5-5: Replace self-referential wildcard import with explicit exports.
from . import *on Line 5 makes exported symbols unclear. Replace with explicit re-exports plus__all__to clarify the public API.Proposed fix
-from . import * +from .NE16DenseConstraint import NE16DenseConv2DTileConstraint, NE16RQSDenseConv2DTileConstraint +from .NE16DepthwiseConstraint import NE16DWConv2DTileConstraint, NE16RQSDWConv2DTileConstraint +from .NE16PointwiseConstraint import NE16PWConv2DTileConstraint, NE16RQSPWConv2DTileConstraint + +__all__ = [ + "NE16DenseConv2DTileConstraint", + "NE16RQSDenseConv2DTileConstraint", + "NE16DWConv2DTileConstraint", + "NE16RQSDWConv2DTileConstraint", + "NE16PWConv2DTileConstraint", + "NE16RQSPWConv2DTileConstraint", +]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the self-referential wildcard import in __init__.py with explicit re-exports: remove "from . import *" and import the specific public symbols (e.g., classes, functions, constants) from the package submodules using explicit imports (for example: from .submodule import ClassX, func_y), then declare __all__ = ["ClassX", "func_y", ...] listing those exported names; ensure you reference the actual symbol names defined in the package (replace placeholders with the real identifiers) and avoid introducing circular imports when reorganizing the imports.Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)
161-162: Prefix unused unpacked variables with underscore.The
BatchOffset,HOffset,WOffset, andBatchSizevariables are unpacked for structural clarity but never used. Prefix them with_to indicate intentional disuse.🔧 Proposed fix
for cube in outputCubes: - (BatchOffset, HOffset, WOffset, COffset) = cube.offset - (BatchSize, HSize, WSize, CSize) = cube.dims + (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset + (_BatchSize, HSize, WSize, CSize) = cube.dimsSimilarly at line 173:
- inBSize, inHSize, inWSize, inCSize = InCube.dims + _inBSize, _inHSize, inWSize, inCSize = InCube.dims🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161 - 162, The tuple unpackings in NE16DenseConstraint.py currently bind unused names BatchOffset, HOffset, WOffset, COffset and BatchSize (and similarly at the later unpacking around line 173); update those unpackings to prefix unused variables with an underscore (e.g., _BatchOffset, _HOffset, _WOffset, _BatchSize) so the intent of intentional non-use is clear; locate the unpacking expressions that assign to cube.offset and cube.dims and rename only the unused identifiers accordingly while keeping used names (COffset, HSize, WSize, CSize) unchanged.Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)
125-136: Use integer division to avoid floating-point precision issues.Line 131 uses float division (
/), which can cause precision loss for large integers. Use integer division (//) instead.🔧 Proposed fix
def _findAllMultiplicands(x: int) -> List[int]: multiplicands = [] tmpX = x for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number while tmpX % i == 0: multiplicands.append(i) - tmpX = tmpX / i + tmpX = tmpX // i if x // math.prod(multiplicands) > 1: multiplicands.append(x // math.prod(multiplicands)) return multiplicands🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 125 - 136, In _findAllMultiplicands, replace the floating-point division on tmpX (tmpX = tmpX / i) with integer division (tmpX = tmpX // i) and ensure tmpX remains an int so subsequent modulus and math.prod operations are integer-safe; update any related expressions that rely on tmpX's type (e.g., the while loop condition tmpX % i == 0 and the final math.prod usage) to work with the integer result.Deeploy/Targets/NE16/Parsers.py (1)
55-62: Usenot channels_firstinstead of== False.Per PEP 8 and Ruff E712, avoid direct comparisons to
False.🔧 Proposed fix
if not all([ - channels_first == False, + not channels_first, len(data_in.shape) == 4, # LMACAN: weight shape should be equal to 3 because we have to do the ne16's # special weight encoding len(weight.shape) == 3, ]):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` around lines 55 - 62, In the conditional inside Parsers.py (the if checking channels_first, data_in.shape, weight.shape) replace the direct boolean comparison channels_first == False with the idiomatic not channels_first (i.e., use not channels_first in the all([...]) list) so the condition follows PEP8/Ruff E712; adjust the list element in the same if that currently reads channels_first == False to read not channels_first.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: The __init__.py currently uses explicit imports but omitted the Tiler
module, causing Engine.py's internal imports to fail; update
Deeploy.Targets.NE16.__init__.py to explicitly import or expose the Tiler module
(the Tiler.py symbol) alongside the other exported modules so it is available
when the package is imported, ensuring Engine.py's references to Tiler resolve
correctly (alternatively, use a package import loop such as using
pkgutil.iter_modules to dynamically import all submodules if you prefer
automatic inclusion).
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor currently uses a mutable default
inputOffsets = {} which causes shared state; change the signature for the
relevant classes (SignPropDeployer.__init__ and GAP9Deployer.__init__) to use
inputOffsets=None and inside each __init__ set inputOffsets = {} if inputOffsets
is None before passing it to super().__init__ (or before any mutations), so each
instance gets its own dict and avoids state leakage when SignPropDeployer
populates the keys.
In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-81: The conv-classification is too loose: tighten isPWConv to
reject grouped 1x1 convolutions by adding a check that node.attrs['group'] == 1,
and tighten isDWConv to only accept true depthwise convolutions by inspecting
the conv weight tensor (node.inputs[1].values) and requiring the per-group
in_channels == 1 and that the number of groups equals the number of output
channels (e.g., weights.shape[1] == 1 and weights.shape[0] ==
node.attrs['group'] or group == out_channels); update isPWConv, isDWConv (and
thereby canExecute) accordingly so only bona fide pointwise (group==1) and true
depthwise convs are classified for NE16 coloring.
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Around line 27-44: The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 118-121: The parsers map only 4 inputs into
self.operatorRepresentation without bounds checking and always set shift to
'NULL', causing IndexError and mismatched scale_shift_addr; update parseNode()
in the affected parsers (the current loop that assigns inputs via inputs =
['data_in','weight','mul','add'] and self.operatorRepresentation[inputs[idx]] =
ctxt.lookup(inputNode.name).name) to perform explicit validation like
len(node.inputs) == 4 (following RQMatMulParser) OR handle a fifth input by
mapping node.inputs[4] into e.g. self.operatorRepresentation['shift'] (or read
shift from node.attrs['shift'] like PULPOpen) and stop hardcoding 'NULL' so the
template variable ${shift} used by ConvTemplate.py:310 gets a proper value;
apply the same change to NE16RQSPWConv2DParser and NE16RQSDenseConv2DParser.
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 141-147: The conf0 computation treats bias presence
incorrectly—change the bias check to treat the "NULL" sentinel as absent: when
calling getNormQuantConf0 in the block guarded by operatorRepresentation["mul"]
!= "NULL", replace the argument "add" in operatorRepresentation with an explicit
boolean like ('add' in operatorRepresentation and operatorRepresentation["add"]
!= "NULL") so bit 25 isn't set when add == "NULL"; keep the existing mul !=
"NULL" guard as-is and pass the adjusted boolean into getNormQuantConf0
(symbols: operatorRepresentation, getNormQuantConf0, conf0, "mul", "add").
- Around line 323-327: The compound literal for ne16_cfg_t incorrectly assigns
weights_stride via an equals expression instead of a designated initializer;
change the block that sets task.data.cfg.weights_stride to use a designated
field (e.g., add a .weights_stride = { .d0 = weightStrideD0, .d1 =
weightStrideD1, .d2 = weightStrideD2 } entry) inside the ne16_cfg_t initializer
so it matches the surrounding fields like .input_stride, .output_stride and
.subtile and produces valid C code.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).
In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 81-101: The subtraction of weight_offset from values in Passes.py
can overflow if values is int8; convert values to a wider integer dtype (e.g.,
np.int16 or np.int32) before computing weight_offset and performing values =
values - weight_offset, then ensure the resulting values are clipped to the
0–255 range and finally cast to np.uint8 just before calling _weightEncode
(refer to weightTensor.values, weight_offset, values, and _weightEncode) so the
transpose and depthwise logic remain unchanged.
In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py`:
- Around line 15-16: The two parsed flags --ne16-wmem and --enable-3x3 are never
consumed; update main() to translate args.ne16_wmem and args.enable_3x3 into
CMake defines (e.g., add "-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and
"-DENABLE_3X3_CONV=ON" when args.enable_3x3 is true) and append them to the same
cmake argument list used for building (the variable that assembles cmake args /
options or the call that runs cmake/configure in main()); if the project expects
OFF values when flags are absent, explicitly add =OFF otherwise. Ensure you
reference the existing parser flags (--ne16-wmem / --enable-3x3) and modify the
code path that constructs cmake arguments in main() so these options are passed
to CMake.
---
Nitpick comments:
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 55-62: In the conditional inside Parsers.py (the if checking
channels_first, data_in.shape, weight.shape) replace the direct boolean
comparison channels_first == False with the idiomatic not channels_first (i.e.,
use not channels_first in the all([...]) list) so the condition follows
PEP8/Ruff E712; adjust the list element in the same if that currently reads
channels_first == False to read not channels_first.
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the self-referential wildcard import in __init__.py with
explicit re-exports: remove "from . import *" and import the specific public
symbols (e.g., classes, functions, constants) from the package submodules using
explicit imports (for example: from .submodule import ClassX, func_y), then
declare __all__ = ["ClassX", "func_y", ...] listing those exported names; ensure
you reference the actual symbol names defined in the package (replace
placeholders with the real identifiers) and avoid introducing circular imports
when reorganizing the imports.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint.py currently
bind unused names BatchOffset, HOffset, WOffset, COffset and BatchSize (and
similarly at the later unpacking around line 173); update those unpackings to
prefix unused variables with an underscore (e.g., _BatchOffset, _HOffset,
_WOffset, _BatchSize) so the intent of intentional non-use is clear; locate the
unpacking expressions that assign to cube.offset and cube.dims and rename only
the unused identifiers accordingly while keeping used names (COffset, HSize,
WSize, CSize) unchanged.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 125-136: In _findAllMultiplicands, replace the floating-point
division on tmpX (tmpX = tmpX / i) with integer division (tmpX = tmpX // i) and
ensure tmpX remains an int so subsequent modulus and math.prod operations are
integer-safe; update any related expressions that rely on tmpX's type (e.g., the
while loop condition tmpX % i == 0 and the final math.prod usage) to work with
the integer result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3a121db6-6486-4738-9ded-bbe6729e8727
📒 Files selected for processing (28)
CMakeLists.txtDeeploy/Targets/NE16/Bindings.pyDeeploy/Targets/NE16/Deployer.pyDeeploy/Targets/NE16/Engine.pyDeeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.pyDeeploy/Targets/NE16/OptimizationPasses/__init__.pyDeeploy/Targets/NE16/Parsers.pyDeeploy/Targets/NE16/Platform.pyDeeploy/Targets/NE16/Templates/AllocateTemplate.pyDeeploy/Targets/NE16/Templates/ConvTemplate.pyDeeploy/Targets/NE16/Templates/__init__.pyDeeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/RequantHelpers.pyDeeploy/Targets/NE16/TileConstraints/__init__.pyDeeploy/Targets/NE16/Tiler.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/__init__.pyDeeploy/Targets/NE16/__init__.pyDeeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.pyDeeployTest/CMakeLists.txtDeeployTest/deeployRunner_tiled_gap9_w_ne16.pyDeeployTest/testMVP.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/platformMapping.pyDeeployTest/test_gap9_ne16_tiled_config.pyTargetLibraries/GAP9/CMakeLists.txt
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from . import * |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify self-referential wildcard imports in NE16 package initializers.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16Repository: pulp-platform/Deeploy
Length of output: 391
🏁 Script executed:
find Deeploy/Targets/NE16 -maxdepth 1 -type d | sortRepository: pulp-platform/Deeploy
Length of output: 241
🏁 Script executed:
ls -la Deeploy/Targets/NE16/ | head -30Repository: pulp-platform/Deeploy
Length of output: 878
🏁 Script executed:
cat -n Deeploy/Targets/NE16/__init__.pyRepository: pulp-platform/Deeploy
Length of output: 224
🏁 Script executed:
rg "from.*NE16.*import.*Tiler|from.*NE16.*Tiler" --type pyRepository: pulp-platform/Deeploy
Length of output: 205
🏁 Script executed:
rg "\.Tiler" Deeploy/Targets/NE16 --type pyRepository: pulp-platform/Deeploy
Length of output: 836
🏁 Script executed:
find Deeploy/Targets -maxdepth 2 -name "__init__.py" -type f | head -10Repository: pulp-platform/Deeploy
Length of output: 374
🏁 Script executed:
for dir in Deeploy/Targets/*/; do echo "=== $(basename $dir) ==="; cat "$dir/__init__.py" 2>/dev/null | head -20; doneRepository: pulp-platform/Deeploy
Length of output: 1218
🏁 Script executed:
rg "__all__" Deeploy/Targets/NE16 --type py -A 5Repository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
head -20 Deeploy/Targets/NE16/Tiler.pyRepository: pulp-platform/Deeploy
Length of output: 1270
🏁 Script executed:
rg "import.*Tiler" Deeploy --type py | grep -v "TilerExtension\|TilerModel" | head -20Repository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
rg "from.*NE16" Deeploy --type py | grep -v "^Binary" | head -30Repository: pulp-platform/Deeploy
Length of output: 2287
The proposed fix is incomplete: Tiler.py exists in the NE16 directory but is missing from the proposed imports
While the original concern is valid—from . import * does not expose submodules/packages as intended—the proposed fix lists only 9 modules and omits Tiler.py, which exists as a module in the directory and is imported internally by Engine.py.
Corrected fix
-from . import *
+from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses
+
+__all__ = [
+ "Bindings",
+ "Deployer",
+ "Engine",
+ "OptimizationPasses",
+ "Parsers",
+ "Platform",
+ "Tiler",
+ "Templates",
+ "TileConstraints",
+ "TopologyOptimizationPasses",
+]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from . import * | |
| from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses | |
| __all__ = [ | |
| "Bindings", | |
| "Deployer", | |
| "Engine", | |
| "OptimizationPasses", | |
| "Parsers", | |
| "Platform", | |
| "Tiler", | |
| "Templates", | |
| "TileConstraints", | |
| "TopologyOptimizationPasses", | |
| ] |
🧰 Tools
🪛 Ruff (0.15.9)
[error] 5-5: from . import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/__init__.py` at line 5, The __init__.py currently uses
explicit imports but omitted the Tiler module, causing Engine.py's internal
imports to fail; update Deeploy.Targets.NE16.__init__.py to explicitly import or
expose the Tiler module (the Tiler.py symbol) alongside the other exported
modules so it is available when the package is imported, ensuring Engine.py's
references to Tiler resolve correctly (alternatively, use a package import loop
such as using pkgutil.iter_modules to dynamically import all submodules if you
prefer automatic inclusion).
| inputOffsets = {}): | ||
| super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, | ||
| default_channels_first, deeployStateDir, inputOffsets) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether inputOffsets is mutated downstream (which would make shared default state dangerous).
rg -nP --type=py -C3 'def __init__\(|inputOffsets\s*=' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py
rg -nP --type=py -C2 'inputOffsets\[[^]]+\]\s*=|inputOffsets\.update\(|setdefault\(' Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.pyRepository: pulp-platform/Deeploy
Length of output: 12773
🏁 Script executed:
# Find the parent class of GAP9Deployer
rg -n "class GAP9Deployer" Deeploy/Targets/GAP9/Deployer.py -A 5Repository: pulp-platform/Deeploy
Length of output: 312
🏁 Script executed:
# Get the full NE16Deployer.__init__ context and check parent
head -n 35 Deeploy/Targets/NE16/Deployer.pyRepository: pulp-platform/Deeploy
Length of output: 1693
🏁 Script executed:
# Get the full GAP9Deployer.__init__ context to see parent class definition
head -n 70 Deeploy/Targets/GAP9/Deployer.pyRepository: pulp-platform/Deeploy
Length of output: 2726
🏁 Script executed:
# Search for any mutation of inputOffsets using broader patterns
rg -n "inputOffsets" Deeploy/Targets/NE16/Deployer.py Deeploy/Targets/GAP9/Deployer.py Deeploy/DeeployTypes.py -A 2 -B 2Repository: pulp-platform/Deeploy
Length of output: 1535
🏁 Script executed:
# Find PULPDeployer class and its __init__
fd "Deployer.py" Deeploy/Targets/PULPOpen/Repository: pulp-platform/Deeploy
Length of output: 101
🏁 Script executed:
# Search for PULPDeployer definition
rg -n "class PULPDeployer" Deeploy/ -A 10Repository: pulp-platform/Deeploy
Length of output: 974
🏁 Script executed:
# Search for how inputOffsets is used in the parent class hierarchy
rg -n "self.inputOffsets" Deeploy/ -B 2 -A 2Repository: pulp-platform/Deeploy
Length of output: 3019
🏁 Script executed:
# Read SignPropDeployer to see the full __init__ and how inputOffsets is mutated
head -n 55 Deeploy/CommonExtensions/NetworkDeployers/SignPropDeployer.pyRepository: pulp-platform/Deeploy
Length of output: 2075
🏁 Script executed:
# Check if inputOffsets is mutated elsewhere besides the initial population
rg -n "inputOffsets\[" Deeploy/ -B 1 -A 1Repository: pulp-platform/Deeploy
Length of output: 1113
Use None as default for inputOffsets parameter.
The {} default on line 28 creates a shared mutable object that gets mutated in SignPropDeployer.__init__() (which populates it with keys). Multiple instances created without providing inputOffsets will share the same dict, causing unintended state leakage.
Proposed fix
- inputOffsets = {}):
+ inputOffsets = None):
+ if inputOffsets is None:
+ inputOffsets = {}
super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name,
default_channels_first, deeployStateDir, inputOffsets)Note: GAP9Deployer has the same issue on line 54 and should be fixed identically.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| inputOffsets = {}): | |
| super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, | |
| default_channels_first, deeployStateDir, inputOffsets) | |
| inputOffsets = None): | |
| if inputOffsets is None: | |
| inputOffsets = {} | |
| super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, | |
| default_channels_first, deeployStateDir, inputOffsets) |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 28-28: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor
currently uses a mutable default inputOffsets = {} which causes shared state;
change the signature for the relevant classes (SignPropDeployer.__init__ and
GAP9Deployer.__init__) to use inputOffsets=None and inside each __init__ set
inputOffsets = {} if inputOffsets is None before passing it to super().__init__
(or before any mutations), so each instance gets its own dict and avoids state
leakage when SignPropDeployer populates the keys.
| def isPWConv(self, node) -> bool: | ||
| return node.op in ["Conv", "RequantizedConv"] and \ | ||
| isinstance(node.inputs[1], gs.Constant) and \ | ||
| node.attrs['kernel_shape'] == [1, 1] and \ | ||
| node.attrs['dilations'] == [1, 1] and \ | ||
| (node.attrs['strides'] == [1, 1] or self.enableStrides) | ||
|
|
||
| def isDWConv(self, node) -> bool: | ||
| return node.op in ["Conv", "RequantizedConv"] and \ | ||
| isinstance(node.inputs[1], gs.Constant) and \ | ||
| node.attrs['kernel_shape'] == [3, 3] and \ | ||
| node.attrs['dilations'] == [1, 1] and \ | ||
| node.attrs['group'] != 1 and \ | ||
| (node.attrs['strides'] == [1, 1] or self.enableStrides) | ||
|
|
||
| def canExecute(self, node: gs.Node) -> bool: | ||
| if self.enable3x3: | ||
| return self.isPWConv(node) or self.isDWConv(node) or self.isDenseConv(node) | ||
| else: | ||
| return self.isPWConv(node) |
There was a problem hiding this comment.
Tighten grouped-convolution classification before coloring.
isPWConv() currently accepts grouped 1x1 convs, and isDWConv() accepts any 3x3 conv with group != 1. The downstream DW path assumes true depthwise semantics, so generic grouped convolutions can be colored for NE16 and then fail later in parsing/tiling.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 81, The conv-classification
is too loose: tighten isPWConv to reject grouped 1x1 convolutions by adding a
check that node.attrs['group'] == 1, and tighten isDWConv to only accept true
depthwise convolutions by inspecting the conv weight tensor
(node.inputs[1].values) and requiring the per-group in_channels == 1 and that
the number of groups equals the number of output channels (e.g.,
weights.shape[1] == 1 and weights.shape[0] == node.attrs['group'] or group ==
out_channels); update isPWConv, isDWConv (and thereby canExecute) accordingly so
only bona fide pointwise (group==1) and true depthwise convs are classified for
NE16 coloring.
| weightMemoryOccupation = 0 | ||
|
|
||
| # Current weight memory occupation | ||
| for buffer in {**ctxt.globalObjects, **ctxt.localObjects}.values(): | ||
| if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name: | ||
| weightMemoryOccupation += _ne16WeightBufferSize(buffer) | ||
|
|
||
| ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] | ||
| for node in ne16Nodes: | ||
| if node.op in ["Conv", "RequantizedConv"]: | ||
|
|
||
| if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)): | ||
| continue | ||
|
|
||
| buffer = ctxt.lookup(node.inputs[1].name) | ||
| if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size: | ||
| buffer._memoryLevel = self._weightMemoryLevel.name | ||
| weightMemoryOccupation += _ne16WeightBufferSize(buffer) |
There was a problem hiding this comment.
Avoid double-counting reused weights and allow exact-fit placement.
If two NE16 nodes share the same weight buffer, this loop charges the same buffer twice once it has already been annotated in this pass. The strict < check also skips buffers that exactly fill the remaining SRAM. Both cases leave feasible placements unassigned.
Suggested fix
ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
for node in ne16Nodes:
if node.op in ["Conv", "RequantizedConv"]:
if not (ctxt.is_local(node.inputs[1].name) or ctxt.is_global(node.inputs[1].name)):
continue
buffer = ctxt.lookup(node.inputs[1].name)
- if weightMemoryOccupation + _ne16WeightBufferSize(buffer) < self._weightMemoryLevel.size:
- buffer._memoryLevel = self._weightMemoryLevel.name
- weightMemoryOccupation += _ne16WeightBufferSize(buffer)
+ if getattr(buffer, "_memoryLevel", None) == self._weightMemoryLevel.name:
+ continue
+
+ bufferSize = _ne16WeightBufferSize(buffer)
+ if weightMemoryOccupation + bufferSize <= self._weightMemoryLevel.size:
+ buffer._memoryLevel = self._weightMemoryLevel.name
+ weightMemoryOccupation += bufferSize
return ctxt, graph🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`
around lines 27 - 44, The loop that sums and assigns weight memory uses
weightMemoryOccupation and then re-adds the same buffer when multiple NE16 nodes
reference it and uses a strict "<" so exact-fit buffers are skipped; change the
logic so before calling _ne16WeightBufferSize(buffer) or adding to
weightMemoryOccupation you check whether buffer._memoryLevel is already equal to
self._weightMemoryLevel.name (skip counting/annotating if already placed) and
use a non-strict comparison (<=) against self._weightMemoryLevel.size when
deciding to assign buffer._memoryLevel; update references in this block
(ne16Nodes iteration, ctxt.lookup(node.inputs[1].name), buffer._memoryLevel,
_ne16WeightBufferSize, and self._weightMemoryLevel.size) accordingly so buffers
aren’t double-counted and exact-fit placements are allowed.
| weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) | ||
| weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1) | ||
| weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2) | ||
| weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify weight tensor shape after encoding and check if dimIdx 3 causes issues
echo "=== Check TilerModel.getTensorDimVar implementation ==="
ast-grep --pattern 'def getTensorDimVar($$$):
$$$
'
echo -e "\n=== Check if weight shape assertions exist in tiler ==="
rg -n "getTensorDimVar.*weight" --type py -C 2Repository: pulp-platform/Deeploy
Length of output: 17734
🏁 Script executed:
# Check if the variables declared at lines 41-44 are used later in the function
sed -n '1,100p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | cat -nRepository: pulp-platform/Deeploy
Length of output: 6259
🏁 Script executed:
# Check if similar variables are used in the comparable Neureka file
sed -n '40,80p' Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.py | cat -nRepository: pulp-platform/Deeploy
Length of output: 2692
🏁 Script executed:
# Find where NE16AdjustWeightMemoryLayoutPass is defined and check weight shape
find . -name "*NE16AdjustWeightMemoryLayoutPass*" -o -name "*Parsers.py" | head -5Repository: pulp-platform/Deeploy
Length of output: 246
🏁 Script executed:
# Search for all uses of the suspect variables in NE16DenseConstraint.py
rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.pyRepository: pulp-platform/Deeploy
Length of output: 370
🏁 Script executed:
# Check NE16 Parsers.py for weight shape information
cat -n Deeploy/Targets/NE16/Parsers.py | head -100Repository: pulp-platform/Deeploy
Length of output: 4765
🏁 Script executed:
# Compare with Neureka - check if those variables are used there
rg -n "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar" Deeploy/Targets/Neureka/TileConstraints/NeurekaDenseConstraint.pyRepository: pulp-platform/Deeploy
Length of output: 370
🏁 Script executed:
# Check if getTensorDimVar creates variables on-demand or retrieves them
cat -n Deeploy/TilingExtension/TilerModel.py | sed -n '125,145p'Repository: pulp-platform/Deeploy
Length of output: 1096
🏁 Script executed:
# Check addTensorDimToModel to see what dimensions it adds to the model
rg -n "def addTensorDimToModel" Deeploy/TilingExtension/TilerModel.py -A 15Repository: pulp-platform/Deeploy
Length of output: 762
🏁 Script executed:
# Verify the runtime error scenario: check if weight is 3D and dimIdx=3 would fail
python3 << 'EOF'
# Simulate the getTensorDimVar call
weight_shape = (64, 32*64, 128) # Example 3D weight: (cout, cinMajor, bits*spatial)
num_dims = len(weight_shape)
print(f"Weight shape: {weight_shape}")
print(f"Number of dimensions: {num_dims}")
print(f"Valid dimIdx range: 0 to {num_dims - 1}")
# Try to access dimIdx=3
for dimIdx in [0, 1, 2, 3]:
varName = f"weight_dim_{dimIdx}"
if dimIdx < num_dims:
print(f"✓ dimIdx={dimIdx}: {varName} would exist (size: {weight_shape[dimIdx]})")
else:
print(f"✗ dimIdx={dimIdx}: {varName} would NOT exist (KeyError)")
EOFRepository: pulp-platform/Deeploy
Length of output: 344
🏁 Script executed:
# Double-check the complete addGeometricalConstraint function to confirm these variables aren't used
wc -l Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py
sed -n '24,90p' Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py | grep -E "weightInChannelMajorVar|weightBitsVar|weightBandwidthVar"Repository: pulp-platform/Deeploy
Length of output: 425
Remove unused weight dimension variables that cause runtime KeyError.
Lines 42-44 declare variables that are never used and access a non-existent dimension. Per Parsers.py line 60, NE16 weight tensors are 3D (cout, cinMajor, bits*spatial) after encoding, with valid dimensions 0–2. Attempting to access dimIdx=3 via getTensorDimVar will fail at runtime with a KeyError.
Remove these three lines:
weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1)
weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2)
weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)Keep only line 41: weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0), which is actually used in the function.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41
- 44, The three unused weight-dimension variable declarations in
NE16DenseConstraint.py (weightInChannelMajorVar, weightBitsVar,
weightBandwidthVar) attempt to access non-existent dims and cause a runtime
KeyError; remove the lines that create weightInChannelMajorVar, weightBitsVar
and weightBandwidthVar and keep only weightOutChannelVar =
tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) which is
the only weight dim used in this function (adjust any references to those
removed symbols if present elsewhere in the function).
| # Map output dims to inputs dims | ||
| tilerModel.addConstraint(outputBatchVar == inputBatchVar) | ||
| tilerModel.addConstraint(outputHeightVar == inputHeightVar) | ||
| tilerModel.addConstraint(outputWidthVar == inputWidthVar) |
There was a problem hiding this comment.
Model pointwise output geometry with stride, not identity.
This forces outputHeightVar == inputHeightVar and outputWidthVar == inputWidthVar, but NE16Engine.canExecute() explicitly allows strided 1x1 convs when enableStrides is set. Those tiles need the usual (effective_input - 1) // stride + 1 relation, otherwise the model is wrong as soon as stride > 1.
Suggested fix
- # Map output dims to inputs dims
+ # Map output dims to input dims
tilerModel.addConstraint(outputBatchVar == inputBatchVar)
- tilerModel.addConstraint(outputHeightVar == inputHeightVar)
- tilerModel.addConstraint(outputWidthVar == inputWidthVar)
+ strides = parseDict["strides"]
+ padding = parseDict["pads"]
+ inputBuffer = ctxt.lookup(inputBufferName)
+ effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1]))
+ effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2]))
+ tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1)
+ tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines
43 - 46, The current constraints in NE16PointwiseConstraint.py bind
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar which breaks
strided 1x1 convs allowed by NE16Engine.canExecute when enableStrides is true;
replace those identity constraints (the tilerModel.addConstraint calls for
outputHeightVar/outputWidthVar) with the proper stride-aware relation using the
stride variables (e.g., strideH/strideW or the stride parameter used by
NE16Engine) so outputDim = (effective_input_dim - 1) // stride + 1 for height
and width respectively, preserving the existing batch equality (outputBatchVar
== inputBatchVar).
| values = weightTensor.values | ||
|
|
||
| # Extract weight offset and translate weights by the offset | ||
| weight_offset = values.min() | ||
| values = values - weight_offset | ||
| node.attrs["weight_offset"] = weight_offset | ||
|
|
||
| if "channels_first" in node.attrs: | ||
| channels_first = node.attrs["channels_first"] | ||
| else: | ||
| channels_first = default_channels_first | ||
|
|
||
| # Weight encode expects channels first | ||
| if not channels_first: | ||
| values = values.transpose(0, 3, 1, 2) | ||
|
|
||
| bits = 8 # Support only 8 bit weights for now | ||
| if node.attrs['group'] == 1: | ||
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False) | ||
| else: | ||
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True) |
There was a problem hiding this comment.
Potential integer overflow when subtracting weight offset.
If values has dtype int8 (range -128 to 127), the subtraction at line 85 can overflow before the astype(np.uint8) at line 99. For example, with weight_offset = -128 and a value of 127:
127 - (-128) = 255 # overflows int8, wraps to -1
Cast to a wider type before subtraction to avoid wraparound:
🐛 Proposed fix
# Adjust N-EUREKA's weights
values = weightTensor.values
# Extract weight offset and translate weights by the offset
weight_offset = values.min()
- values = values - weight_offset
+ values = values.astype(np.int32) - int(weight_offset)
node.attrs["weight_offset"] = weight_offset
if "channels_first" in node.attrs:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| values = weightTensor.values | |
| # Extract weight offset and translate weights by the offset | |
| weight_offset = values.min() | |
| values = values - weight_offset | |
| node.attrs["weight_offset"] = weight_offset | |
| if "channels_first" in node.attrs: | |
| channels_first = node.attrs["channels_first"] | |
| else: | |
| channels_first = default_channels_first | |
| # Weight encode expects channels first | |
| if not channels_first: | |
| values = values.transpose(0, 3, 1, 2) | |
| bits = 8 # Support only 8 bit weights for now | |
| if node.attrs['group'] == 1: | |
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False) | |
| else: | |
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True) | |
| values = weightTensor.values | |
| # Extract weight offset and translate weights by the offset | |
| weight_offset = values.min() | |
| values = values.astype(np.int32) - int(weight_offset) | |
| node.attrs["weight_offset"] = weight_offset | |
| if "channels_first" in node.attrs: | |
| channels_first = node.attrs["channels_first"] | |
| else: | |
| channels_first = default_channels_first | |
| # Weight encode expects channels first | |
| if not channels_first: | |
| values = values.transpose(0, 3, 1, 2) | |
| bits = 8 # Support only 8 bit weights for now | |
| if node.attrs['group'] == 1: | |
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = False) | |
| else: | |
| weightTensor.values = _weightEncode(values.astype(np.uint8), bits, depthwise = True) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 81 -
101, The subtraction of weight_offset from values in Passes.py can overflow if
values is int8; convert values to a wider integer dtype (e.g., np.int16 or
np.int32) before computing weight_offset and performing values = values -
weight_offset, then ensure the resulting values are clipped to the 0–255 range
and finally cast to np.uint8 just before calling _weightEncode (refer to
weightTensor.values, weight_offset, values, and _weightEncode) so the transpose
and depthwise logic remain unchanged.
| parser.add_argument('--ne16-wmem', action = 'store_true', help = 'Enable NE16 weight memory\n') | ||
| parser.add_argument('--enable-3x3', action = 'store_true', help = 'Enable 3x3 convolutions\n') |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify end-to-end wiring for NE16-specific runner flags.
rg -nP --type=py -C3 'ne16[_-]wmem|enable[_-]3x3|hasattr\(args,\s*.\s*ne16|hasattr\(args,\s*.\s*enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py DeeployTest/testUtils/core/execution.py
rg -nP --type=cmake -C3 'NE16|3x3|WMEM|NUM_CORES' CMakeLists.txt DeeployTest/CMakeLists.txt TargetLibraries/GAP9/CMakeLists.txtRepository: pulp-platform/Deeploy
Length of output: 6815
🏁 Script executed:
# Search for main() function and how it handles arguments
rg -n 'def main' DeeployTest/testUtils/deeployRunner.py -A 20
# Look for args usage with underscore conversion (--ne16-wmem → args.ne16_wmem)
rg -n 'args\.ne16|args\.enable' DeeployTest/testUtils/deeployRunner.py DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
# Check the entire flow in the runner file
cat -n DeeployTest/deeployRunner_tiled_gap9_w_ne16.pyRepository: pulp-platform/Deeploy
Length of output: 2136
🏁 Script executed:
# Get more of the main() function to see how it uses parser_setup_callback
sed -n '311,450p' DeeployTest/testUtils/deeployRunner.py | cat -n
# Also search for any reference to ne16_wmem or enable_3x3 after variable names are converted
rg -n 'ne16_wmem|enable_3x3' DeeployTest/testUtils/deeployRunner.py DeeployTest/testUtils/core/execution.pyRepository: pulp-platform/Deeploy
Length of output: 5539
Add handling for --ne16-wmem and --enable-3x3 flags in main().
The flags are defined on Lines 15-16, but unlike --cores (which is properly handled at deeployRunner.py:90-91), there is no downstream consumption of --ne16-wmem or --enable-3x3. These arguments are parsed but never converted to CMake options or used anywhere in execution. Either wire them into CMake configuration (e.g., -DENABLE_NE16_WMEM, -DENABLE_3X3_CONV), or remove them if not yet implemented.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DeeployTest/deeployRunner_tiled_gap9_w_ne16.py` around lines 15 - 16, The two
parsed flags --ne16-wmem and --enable-3x3 are never consumed; update main() to
translate args.ne16_wmem and args.enable_3x3 into CMake defines (e.g., add
"-DENABLE_NE16_WMEM=ON" when args.ne16_wmem is true and "-DENABLE_3X3_CONV=ON"
when args.enable_3x3 is true) and append them to the same cmake argument list
used for building (the variable that assembles cmake args / options or the call
that runs cmake/configure in main()); if the project expects OFF values when
flags are absent, explicitly add =OFF otherwise. Ensure you reference the
existing parser flags (--ne16-wmem / --enable-3x3) and modify the code path that
constructs cmake arguments in main() so these options are passed to CMake.
1153077 to
4edb011
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)
352-356:⚠️ Potential issue | 🔴 CriticalFix invalid designated initializer syntax for
weights_stride.The line
task.data.cfg.weights_stride = (ne16_stride_t) {...}is inside the.cfg = (ne16_cfg_t) { ... }designated initializer block. Within designated initializers, you must use.fieldname = valuesyntax. The current syntax is invalid C and will cause compilation errors.🐛 Proposed fix
- task.data.cfg.weights_stride = (ne16_stride_t) { + .weights_stride = (ne16_stride_t) { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356, Inside the designated initializer for task.data.cfg (the .cfg = (ne16_cfg_t) { ... } block) the assignment to weights_stride uses an invalid nested C cast-style initializer; replace it with a proper designated-field initializer by setting .weights_stride = { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} } (or include an explicit .weights_stride = (ne16_stride_t){ .d0 = ..., .d1 = ..., .d2 = ... } if you prefer the cast) so the field name .weights_stride is used with a brace-enclosed initializer compatible with ne16_stride_t within the ne16_cfg_t designated initializer.
🧹 Nitpick comments (8)
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py (1)
7-11: Minor style nit: Consider grouping all imports before module-level assignments.The assignment
float32_tPtr = PointerClass(float32_t)on line 10 appears between import statements. Per PEP 8 conventions, imports should typically be grouped together at the top of the file before other statements.♻️ Suggested reordering
from Deeploy.AbstractDataTypes import PointerClass from Deeploy.CommonExtensions.DataTypes import float32_t - -float32_tPtr = PointerClass(float32_t) from Deeploy.DeeployTypes import NetworkContext, NodeTemplate, OperatorRepresentation + +float32_tPtr = PointerClass(float32_t)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py` around lines 7 - 11, Move the module-level assignment float32_tPtr = PointerClass(float32_t) so all import statements are grouped first: keep the imports for PointerClass, float32_t, and DeeployTypes together at the top, then place the float32_tPtr assignment immediately after those imports; ensure you reference PointerClass and float32_t exactly as they are imported and leave NetworkContext, NodeTemplate, OperatorRepresentation imports unchanged.DeeployTest/test_platforms.py (1)
1004-1006: Unusedconfig_namevariable (consistent with existing tests).The
config_namevariable is unpacked but never used. This matches the pattern in all other tiled test functions in this file (e.g., lines 317, 346, 808, etc.). For consistency, you could prefix it with an underscore to signal intent:- test_name, l1, config_name = test_params + test_name, l1, _config_name = test_paramsThis is a low-priority style nit since the existing codebase follows the same pattern.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DeeployTest/test_platforms.py` around lines 1004 - 1006, In test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the unpacked variable config_name is unused; rename it to _config_name (or prefix with an underscore) in the function signature to indicate it's intentionally unused and match the existing tiled test patterns (refer to function test_gap9_w_ne16_tiled_kernels_l2_singlebuffer)..github/workflows/_runner-gap9-w-ne16-tiled.yml (1)
36-36: Consider logging pip install failures instead of silently ignoring them.The
|| truesuppresses all pip install failures silently. If the install genuinely fails, subsequent test runs will fail with confusing errors. Consider logging the failure while still allowing the workflow to proceed:- pip install -e . || true + pip install -e . || echo "WARNING: pip install failed, tests may not work correctly"Alternatively, if the install is expected to always succeed in CI, remove
|| trueto fail fast on broken dependencies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/_runner-gap9-w-ne16-tiled.yml at line 36, The workflow step currently runs "pip install -e . || true", which silently swallows installation failures; change this to either remove the "|| true" so the job fails fast on pip install errors, or replace it with logic that captures the pip exit status and logs the failure before continuing (e.g., run pip install -e . and on failure emit a clear error log message referencing the pip install step); locate the exact step containing "pip install -e . || true" in the workflow and update it accordingly so failures are not silently ignored.Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)
161-162: Prefix unused unpacked variables with underscore.Per static analysis hints, several unpacked variables are never used. Prefix them with underscore to indicate intentional non-use.
♻️ Suggested fix
- (BatchOffset, HOffset, WOffset, COffset) = cube.offset - (BatchSize, HSize, WSize, CSize) = cube.dims + (_BatchOffset, _HOffset, _WOffset, COffset) = cube.offset + (_BatchSize, HSize, WSize, CSize) = cube.dims- inBSize, inHSize, inWSize, inCSize = InCube.dims + _inBSize, _inHSize, inWSize, inCSize = InCube.dimsAlso applies to: 173-173
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 161 - 162, The tuple unpackings in NE16DenseConstraint where you extract cube.offset into (BatchOffset, HOffset, WOffset, COffset) and cube.dims into (BatchSize, HSize, WSize, CSize) (and the similar unpacking around line 173) include variables that are never used; update these unpackings to prefix the unused names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so static analysis recognizes intentional non-use while keeping any used variables unchanged and preserving the tuple structure in the methods within the NE16DenseConstraint class.Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)
79-80: Remove unusedpaddingvariable.The
paddingvariable is read fromparseDictbut never used inaddPolicyConstraint.♻️ Suggested fix
strides = parseDict["strides"] - padding = parseDict["pads"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines 79 - 80, In addPolicyConstraint (class NE16PointwiseConstraint) the local variable padding is assigned from parseDict["pads"] but never used; remove the unused assignment (the line `padding = parseDict["pads"]`) so there is no dead variable left, or if padding is intended to be used, instead reference it where needed—locate the parseDict usage in addPolicyConstraint and either delete the padding assignment or integrate padding into the constraint logic.Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (2)
29-31: Remove unuseddilationvariable.The
dilationvariable is extracted but never used in the function. Since NE16 doesn't support dilation (enforced by parser'sdilations == [1, 1]check), this variable is unnecessary.♻️ Suggested fix
strides = parseDict["strides"] padding = parseDict["pads"] - dilation = parseDict["dilations"]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines 29 - 31, Remove the unused local variable `dilation` that is assigned from `parseDict["dilations"]`; since the parser enforces `dilations == [1, 1]` and NE16 does not support dilation, delete the line `dilation = parseDict["dilations"]` (and any dead code referencing `dilation`) from the function so only `strides = parseDict["strides"]` and `padding = parseDict["pads"]` remain, ensuring no other code expects `dilation`.
160-162: Use already-looked-upoutputBufferinstead of redundant lookup.
outputBufferwas already retrieved at line 153. Line 162 performs an unnecessary second lookup.♻️ Suggested fix
InCube, padding_tuple = Conv2DTileConstraint.computeInputCube((weightH, weightW), pads, strides, weightC, cube, - ctxt.lookup(varOut).shape) + outputBuffer.shape)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines 160 - 162, Replace the redundant ctxt.lookup(varOut).shape call with the already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when calling Conv2DTileConstraint.computeInputCube (which returns InCube and padding_tuple), pass outputBuffer.shape instead of performing a second ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously retrieved outputBuffer variable.Deeploy/Targets/NE16/Parsers.py (1)
56-56: Use idiomatic boolean check instead of equality comparison toFalse.Per Ruff E712, avoid explicit equality comparisons to
False.♻️ Suggested fix
if not all([ - channels_first == False, + not channels_first, len(data_in.shape) == 4,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` at line 56, Replace the explicit equality check "channels_first == False" with the idiomatic boolean negation "not channels_first" wherever it appears (e.g., in the call/config constructing the parser using the channels_first variable) to comply with Ruff E712 and avoid direct comparison to False.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py`:
- Line 34: The list comprehension that builds ne16Nodes uses
node.attrs["engine"] directly and can raise KeyError for nodes without an
"engine"; change it to safely check for the key (e.g., use
node.attrs.get("engine") == self.ne16EngineName or "engine" in node.attrs and
node.attrs["engine"] == self.ne16EngineName) so only nodes with an engine
attribute are compared against self.ne16EngineName; update the expression that
defines ne16Nodes accordingly in MemoryLevelAnnotationPasses.py.
---
Duplicate comments:
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Inside the designated initializer for task.data.cfg (the
.cfg = (ne16_cfg_t) { ... } block) the assignment to weights_stride uses an
invalid nested C cast-style initializer; replace it with a proper
designated-field initializer by setting .weights_stride = { .d0 =
${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} } (or
include an explicit .weights_stride = (ne16_stride_t){ .d0 = ..., .d1 = ..., .d2
= ... } if you prefer the cast) so the field name .weights_stride is used with a
brace-enclosed initializer compatible with ne16_stride_t within the ne16_cfg_t
designated initializer.
---
Nitpick comments:
In @.github/workflows/_runner-gap9-w-ne16-tiled.yml:
- Line 36: The workflow step currently runs "pip install -e . || true", which
silently swallows installation failures; change this to either remove the "||
true" so the job fails fast on pip install errors, or replace it with logic that
captures the pip exit status and logs the failure before continuing (e.g., run
pip install -e . and on failure emit a clear error log message referencing the
pip install step); locate the exact step containing "pip install -e . || true"
in the workflow and update it accordingly so failures are not silently ignored.
In `@Deeploy/Targets/NE16/Parsers.py`:
- Line 56: Replace the explicit equality check "channels_first == False" with
the idiomatic boolean negation "not channels_first" wherever it appears (e.g.,
in the call/config constructing the parser using the channels_first variable) to
comply with Ruff E712 and avoid direct comparison to False.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 161-162: The tuple unpackings in NE16DenseConstraint where you
extract cube.offset into (BatchOffset, HOffset, WOffset, COffset) and cube.dims
into (BatchSize, HSize, WSize, CSize) (and the similar unpacking around line
173) include variables that are never used; update these unpackings to prefix
the unused names with an underscore (e.g., _BatchOffset, _HOffset, etc.) so
static analysis recognizes intentional non-use while keeping any used variables
unchanged and preserving the tuple structure in the methods within the
NE16DenseConstraint class.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 29-31: Remove the unused local variable `dilation` that is
assigned from `parseDict["dilations"]`; since the parser enforces `dilations ==
[1, 1]` and NE16 does not support dilation, delete the line `dilation =
parseDict["dilations"]` (and any dead code referencing `dilation`) from the
function so only `strides = parseDict["strides"]` and `padding =
parseDict["pads"]` remain, ensuring no other code expects `dilation`.
- Around line 160-162: Replace the redundant ctxt.lookup(varOut).shape call with
the already-fetched outputBuffer.shape in the NE16DepthwiseConstraint code: when
calling Conv2DTileConstraint.computeInputCube (which returns InCube and
padding_tuple), pass outputBuffer.shape instead of performing a second
ctxt.lookup(varOut) to avoid the duplicate lookup and use the previously
retrieved outputBuffer variable.
In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 79-80: In addPolicyConstraint (class NE16PointwiseConstraint) the
local variable padding is assigned from parseDict["pads"] but never used; remove
the unused assignment (the line `padding = parseDict["pads"]`) so there is no
dead variable left, or if padding is intended to be used, instead reference it
where needed—locate the parseDict usage in addPolicyConstraint and either delete
the padding assignment or integrate padding into the constraint logic.
In `@Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py`:
- Around line 7-11: Move the module-level assignment float32_tPtr =
PointerClass(float32_t) so all import statements are grouped first: keep the
imports for PointerClass, float32_t, and DeeployTypes together at the top, then
place the float32_tPtr assignment immediately after those imports; ensure you
reference PointerClass and float32_t exactly as they are imported and leave
NetworkContext, NodeTemplate, OperatorRepresentation imports unchanged.
In `@DeeployTest/test_platforms.py`:
- Around line 1004-1006: In test_gap9_w_ne16_tiled_kernels_l2_singlebuffer, the
unpacked variable config_name is unused; rename it to _config_name (or prefix
with an underscore) in the function signature to indicate it's intentionally
unused and match the existing tiled test patterns (refer to function
test_gap9_w_ne16_tiled_kernels_l2_singlebuffer).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0e7e2d4c-444f-4b78-b2b5-22aef5e9a86f
📒 Files selected for processing (32)
.github/workflows/_runner-gap9-w-ne16-tiled.yml.github/workflows/ci-platform-gap9-w-ne16-tiled.ymlCMakeLists.txtDeeploy/Targets/NE16/Bindings.pyDeeploy/Targets/NE16/Deployer.pyDeeploy/Targets/NE16/Engine.pyDeeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.pyDeeploy/Targets/NE16/OptimizationPasses/__init__.pyDeeploy/Targets/NE16/Parsers.pyDeeploy/Targets/NE16/Platform.pyDeeploy/Targets/NE16/Templates/AllocateTemplate.pyDeeploy/Targets/NE16/Templates/ConvTemplate.pyDeeploy/Targets/NE16/Templates/__init__.pyDeeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/RequantHelpers.pyDeeploy/Targets/NE16/TileConstraints/__init__.pyDeeploy/Targets/NE16/Tiler.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/__init__.pyDeeploy/Targets/NE16/__init__.pyDeeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.pyDeeployTest/CMakeLists.txtDeeployTest/conftest.pyDeeployTest/deeployRunner_tiled_gap9_w_ne16.pyDeeployTest/testMVP.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/platformMapping.pyDeeployTest/test_gap9_ne16_tiled_config.pyDeeployTest/test_platforms.pyTargetLibraries/GAP9/CMakeLists.txt
✅ Files skipped from review due to trivial changes (5)
- DeeployTest/testMVP.py
- DeeployTest/testUtils/core/execution.py
- DeeployTest/CMakeLists.txt
- Deeploy/Targets/NE16/Tiler.py
- DeeployTest/test_gap9_ne16_tiled_config.py
🚧 Files skipped from review as they are similar to previous changes (6)
- TargetLibraries/GAP9/CMakeLists.txt
- CMakeLists.txt
- DeeployTest/testUtils/platformMapping.py
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
- Deeploy/Targets/NE16/Platform.py
| if hasattr(buffer, "_memoryLevel") and buffer._memoryLevel == self._weightMemoryLevel.name: | ||
| weightMemoryOccupation += _ne16WeightBufferSize(buffer) | ||
|
|
||
| ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName] |
There was a problem hiding this comment.
Potential KeyError if node lacks engine attribute.
The list comprehension accesses node.attrs["engine"] without checking if the key exists. Nodes not assigned to an engine may lack this attribute.
🛡️ Proposed defensive fix
- ne16Nodes = [node for node in graph.nodes if node.attrs["engine"] == self.ne16EngineName]
+ ne16Nodes = [node for node in graph.nodes if node.attrs.get("engine") == self.ne16EngineName]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py` at
line 34, The list comprehension that builds ne16Nodes uses node.attrs["engine"]
directly and can raise KeyError for nodes without an "engine"; change it to
safely check for the key (e.g., use node.attrs.get("engine") ==
self.ne16EngineName or "engine" in node.attrs and node.attrs["engine"] ==
self.ne16EngineName) so only nodes with an engine attribute are compared against
self.ne16EngineName; update the expression that defines ne16Nodes accordingly in
MemoryLevelAnnotationPasses.py.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (5)
Deeploy/Targets/NE16/Deployer.py (1)
19-30:⚠️ Potential issue | 🟠 MajorAvoid shared
inputOffsetsstate across deployer instances.
inputOffsets = {}still creates one dict shared by everyNE16Deployerbuilt without an explicit argument, so per-instance state can leak between deployments.Suggested fix
- inputOffsets = {}): + inputOffsets = None): + if inputOffsets is None: + inputOffsets = {} super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, default_channels_first, deeployStateDir, inputOffsets)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Deployer.py` around lines 19 - 30, The constructor __init__ currently uses a mutable default inputOffsets = {} which shares state across instances; change the parameter default to inputOffsets: Optional[Dict] = None and inside __init__ (before calling super().__init__) set inputOffsets = {} if it's None, then pass that local dict into super().__init__; update the function signature and ensure any references to inputOffsets in NE16Deployer.__init__ use the per-instance dict created when None is supplied.Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (1)
90-106:⚠️ Potential issue | 🟠 MajorWiden before subtracting
weight_offset.If
weightTensor.valuesisint8,values - weight_offsethappens in the narrow dtype and can wrap before the later cast touint8, corrupting the encoded weights.Suggested fix
- weight_offset = values.min() - values = values - weight_offset + weight_offset = int(values.min()) + values = values.astype(np.int32) - weight_offset node.attrs["weight_offset"] = weight_offset🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 90 - 106, The subtraction values = values - weight_offset can underflow when weightTensor.values is int8; before computing the offset and subtracting, widen values to a safe integer type (e.g., int16/int32) so the subtraction is performed without wraparound, then clip/normalize and finally cast to np.uint8 before calling _weightEncode; update the operations around weight_offset, values, and the subsequent transpose/astype steps (referencing weight_offset, values, weightTensor.values, and _weightEncode) to perform widening, subtraction, any needed clamping, and only then cast to uint8 for encoding.Deeploy/Targets/NE16/Engine.py (1)
62-75:⚠️ Potential issue | 🟠 MajorGrouped convolutions are still being classified as NE16 PW/DW.
isPWConv()still accepts grouped 1x1 convs, andisDWConv()still accepts any grouped 3x3 conv. That will color generic grouped convolutions for NE16 even though the downstream paths only handle regular pointwise (group == 1) or true depthwise weights.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, isPWConv currently allows grouped 1x1 convs and isDWConv accepts any grouped 3x3 conv; change the predicates to only mark true pointwise (group == 1) and true depthwise (group == number of input channels) respectively: update isPWConv to include node.attrs['group'] == 1 in its condition, and update isDWConv to require node.attrs['group'] equal the convolution's input channel count (i.e., only treat it as depthwise when group == input_channels) while keeping the existing kernel_shape, dilation and stride checks.Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)
43-46:⚠️ Potential issue | 🟠 MajorModel pointwise output geometry from stride, not identity.
These constraints only describe stride-1 pointwise conv. If
NE16Engine.enableStridesis ever enabled,H_out/W_outbecome wrong immediately because the tiler still forcesoutputHeightVar == inputHeightVarandoutputWidthVar == inputWidthVar.Suggested fix
# Map output dims to inputs dims tilerModel.addConstraint(outputBatchVar == inputBatchVar) - tilerModel.addConstraint(outputHeightVar == inputHeightVar) - tilerModel.addConstraint(outputWidthVar == inputWidthVar) + strides = parseDict["strides"] + padding = parseDict["pads"] + inputBuffer = ctxt.lookup(inputBufferName) + effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1])) + effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2])) + tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1) + tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines 43 - 46, The current constraints in NE16PointwiseConstraint map outputHeightVar and outputWidthVar directly to inputHeightVar/inputWidthVar which only works for stride=1; update the constraints to compute output dims from the pointwise stride when NE16Engine.enableStrides is used (use the convolution output formula for 1x1: H_out = ceil(inputHeight / strideH), W_out = ceil(inputWidth / strideW)). Replace the equalities on outputHeightVar and outputWidthVar with expressions that use the pointwise stride variables (e.g., strideH/strideW or the engine's stride config) and implement the integer ceil division ((input + stride - 1) / stride) so tilerModel.addConstraint uses outputHeightVar == (inputHeightVar + strideHVar - 1) / strideHVar and similarly for width; keep the batch mapping (outputBatchVar == inputBatchVar) as-is.Deeploy/Targets/NE16/Templates/ConvTemplate.py (1)
352-356:⚠️ Potential issue | 🔴 CriticalUse a designated field initializer for
weights_stride.
task.data.cfg.weights_stride = ...is not valid inside a C compound literal. This needs to be.weights_stride = ...like the neighboring fields, otherwise the generated kernel code will not compile.Proposed fix
- task.data.cfg.weights_stride = (ne16_stride_t) { + .weights_stride = (ne16_stride_t) { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} },Is `task.data.cfg.weights_stride = ...` valid inside a C designated initializer, or must it be written as `.weights_stride = ...`?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356, Replace the invalid in-line assignment inside the C compound literal with a designated initializer: change the occurrence of "task.data.cfg.weights_stride = (ne16_stride_t) { ... }" so that within the compound literal for task.data.cfg you use ".weights_stride = (ne16_stride_t) { .d0 = ..., .d1 = ..., .d2 = ... }" (i.e., reference the struct member as .weights_stride) so the generated kernel compiles; update the initializer near the other designated fields in the same compound literal where task.data.cfg is constructed.
🧹 Nitpick comments (1)
Deeploy/Targets/NE16/TileConstraints/__init__.py (1)
5-5: Consider using explicit submodule imports for clarity.The
from . import *statement in a package__init__.pydoesn't automatically import submodules—it only imports names from__all__(if defined) in the current package's namespace. For a package that simply exposes its submodules, explicit imports are clearer:from . import NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpersThis same pattern appears in several other new NE16
__init__.pyfiles. If the current behavior is intentional (relying on consumers to import submodules directly), this can remain as-is.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the wildcard import in the package __init__ (from . import *) with explicit submodule imports to make exported symbols clear: import the submodules NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint and RequantHelpers (e.g., from . import NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers get those modules via the package namespace; alternatively, if you intended to expose names rather than modules, define an explicit __all__ listing the public symbols instead of using a star import.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 32-40: The constructor is mutating the shared lowering optimizer
in place (self.loweringOptimizer.passes) which causes cross-deployer pollution;
instead clone the optimizer or at least its passes list before modifying it
(e.g., deep-copy the optimizer or assign a new passes list based on
list(self.loweringOptimizer.passes)) and then perform the PULPNCHWtoNHWCPass ->
NCHWtoNHWCPass replacement and append ConvEngineDiscolorationPass() and
NE16OptimizationPass(self.default_channels_first, "NE16") to the cloned
optimizer so the module-scoped NE16Optimizer from Platform.py is not mutated
across instances.
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-40: The NE16 parser's parseNodeCtxt currently defaults
channels_first=True but the implementation only accepts False; change the
default parameter in parseNodeCtxt to channels_first: bool = False and update
the three RQS override method signatures to also default channels_first to False
so callers use NHWC by default and the method signatures match the only
supported layout (refer to parseNodeCtxt and the three RQS override methods to
locate and update their signatures).
In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 139-150: The _findAllMultiplicands function currently uses float
division and an exclusive upper bound which breaks integer factorization; change
the loop to iterate i from 2 through int(math.sqrt(x)) inclusive (or use
math.isqrt(x)+1) and use integer division (tmpX //= i) when extracting factors
so tmpX stays an int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.
---
Duplicate comments:
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 19-30: The constructor __init__ currently uses a mutable default
inputOffsets = {} which shares state across instances; change the parameter
default to inputOffsets: Optional[Dict] = None and inside __init__ (before
calling super().__init__) set inputOffsets = {} if it's None, then pass that
local dict into super().__init__; update the function signature and ensure any
references to inputOffsets in NE16Deployer.__init__ use the per-instance dict
created when None is supplied.
In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: isPWConv currently allows grouped 1x1 convs and isDWConv
accepts any grouped 3x3 conv; change the predicates to only mark true pointwise
(group == 1) and true depthwise (group == number of input channels)
respectively: update isPWConv to include node.attrs['group'] == 1 in its
condition, and update isDWConv to require node.attrs['group'] equal the
convolution's input channel count (i.e., only treat it as depthwise when group
== input_channels) while keeping the existing kernel_shape, dilation and stride
checks.
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: Replace the invalid in-line assignment inside the C
compound literal with a designated initializer: change the occurrence of
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" so that within the
compound literal for task.data.cfg you use ".weights_stride = (ne16_stride_t) {
.d0 = ..., .d1 = ..., .d2 = ... }" (i.e., reference the struct member as
.weights_stride) so the generated kernel compiles; update the initializer near
the other designated fields in the same compound literal where task.data.cfg is
constructed.
In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-46: The current constraints in NE16PointwiseConstraint map
outputHeightVar and outputWidthVar directly to inputHeightVar/inputWidthVar
which only works for stride=1; update the constraints to compute output dims
from the pointwise stride when NE16Engine.enableStrides is used (use the
convolution output formula for 1x1: H_out = ceil(inputHeight / strideH), W_out =
ceil(inputWidth / strideW)). Replace the equalities on outputHeightVar and
outputWidthVar with expressions that use the pointwise stride variables (e.g.,
strideH/strideW or the engine's stride config) and implement the integer ceil
division ((input + stride - 1) / stride) so tilerModel.addConstraint uses
outputHeightVar == (inputHeightVar + strideHVar - 1) / strideHVar and similarly
for width; keep the batch mapping (outputBatchVar == inputBatchVar) as-is.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 90-106: The subtraction values = values - weight_offset can
underflow when weightTensor.values is int8; before computing the offset and
subtracting, widen values to a safe integer type (e.g., int16/int32) so the
subtraction is performed without wraparound, then clip/normalize and finally
cast to np.uint8 before calling _weightEncode; update the operations around
weight_offset, values, and the subsequent transpose/astype steps (referencing
weight_offset, values, weightTensor.values, and _weightEncode) to perform
widening, subtraction, any needed clamping, and only then cast to uint8 for
encoding.
---
Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the wildcard import in the package __init__ (from . import *)
with explicit submodule imports to make exported symbols clear: import the
submodules NE16DenseConstraint, NE16DepthwiseConstraint, NE16PointwiseConstraint
and RequantHelpers (e.g., from . import NE16DenseConstraint,
NE16DepthwiseConstraint, NE16PointwiseConstraint, RequantHelpers) so consumers
get those modules via the package namespace; alternatively, if you intended to
expose names rather than modules, define an explicit __all__ listing the public
symbols instead of using a star import.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: eaa00d35-32b5-49a6-8eeb-a015c49962b3
📒 Files selected for processing (35)
.github/workflows/_runner-gap9-w-ne16-tiled.yml.github/workflows/ci-platform-gap9-w-ne16-tiled.ymlCMakeLists.txtDeeploy/Targets/NE16/Bindings.pyDeeploy/Targets/NE16/Deployer.pyDeeploy/Targets/NE16/Engine.pyDeeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.pyDeeploy/Targets/NE16/OptimizationPasses/__init__.pyDeeploy/Targets/NE16/Parsers.pyDeeploy/Targets/NE16/Platform.pyDeeploy/Targets/NE16/Templates/AllocateTemplate.pyDeeploy/Targets/NE16/Templates/ConvTemplate.pyDeeploy/Targets/NE16/Templates/__init__.pyDeeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/RequantHelpers.pyDeeploy/Targets/NE16/TileConstraints/__init__.pyDeeploy/Targets/NE16/Tiler.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/__init__.pyDeeploy/Targets/NE16/__init__.pyDeeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.pyDeeployTest/CMakeLists.txtDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npzDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnxDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npzDeeployTest/conftest.pyDeeployTest/deeployRunner_tiled_gap9_w_ne16.pyDeeployTest/testMVP.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/platformMapping.pyDeeployTest/test_gap9_ne16_tiled_config.pyDeeployTest/test_platforms.pyTargetLibraries/GAP9/CMakeLists.txt
✅ Files skipped from review due to trivial changes (4)
- DeeployTest/conftest.py
- Deeploy/Targets/NE16/Tiler.py
- DeeployTest/testUtils/core/execution.py
- Deeploy/Targets/NE16/TileConstraints/RequantHelpers.py
🚧 Files skipped from review as they are similar to previous changes (9)
- DeeployTest/testUtils/platformMapping.py
- TargetLibraries/GAP9/CMakeLists.txt
- Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
- .github/workflows/_runner-gap9-w-ne16-tiled.yml
- .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
- Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- Deeploy/Targets/NE16/Bindings.py
- DeeployTest/test_gap9_ne16_tiled_config.py
| def parseNodeCtxt(self, | ||
| ctxt: NetworkContext, | ||
| node: gs.Node, | ||
| channels_first: bool = True) -> Tuple[NetworkContext, bool]: |
There was a problem hiding this comment.
Make the NE16 parser default to NHWC.
parseNodeCtxt() defaults channels_first to True, but Line 56 rejects anything except False. Any caller that uses the default path will fail to parse NE16 convolutions. Flip this default here and mirror it in the three RQS overrides so the method signature matches the only supported layout.
Proposed fix
- def parseNodeCtxt(self,
- ctxt: NetworkContext,
- node: gs.Node,
- channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+ def parseNodeCtxt(self,
+ ctxt: NetworkContext,
+ node: gs.Node,
+ channels_first: bool = False) -> Tuple[NetworkContext, bool]:
@@
- channels_first == False,
+ not channels_first,- def parseNodeCtxt(self,
- ctxt: NetworkContext,
- node: gs.Node,
- channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+ def parseNodeCtxt(self,
+ ctxt: NetworkContext,
+ node: gs.Node,
+ channels_first: bool = False) -> Tuple[NetworkContext, bool]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 40, The NE16 parser's
parseNodeCtxt currently defaults channels_first=True but the implementation only
accepts False; change the default parameter in parseNodeCtxt to channels_first:
bool = False and update the three RQS override method signatures to also default
channels_first to False so callers use NHWC by default and the method signatures
match the only supported layout (refer to parseNodeCtxt and the three RQS
override methods to locate and update their signatures).
| weightBuffer = ctxt.lookup(weightBufferName) | ||
| # NE16 DW weight is packed as a single (1, 1, packed_bytes) block | ||
| # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16). | ||
| # Keep the outermost dim fixed at its full (=1) value regardless of | ||
| # the output channel tiling. | ||
| tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max()) | ||
|
|
There was a problem hiding this comment.
Pin the whole packed DW weight tensor, not just dim 0.
Line 57 fixes only weightShape[0], but Lines 227-228 always emit a load for the full encoded weight block. That lets the tiler under-account weight memory by shrinking the packed-bytes axis, then codegen still DMA-loads the entire tensor. Constrain every dimension of the encoded DW weight buffer to Max() for this path.
Proposed fix
weightBuffer = ctxt.lookup(weightBufferName)
# NE16 DW weight is packed as a single (1, 1, packed_bytes) block
# containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16).
- # Keep the outermost dim fixed at its full (=1) value regardless of
- # the output channel tiling.
- tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max())
+ # Keep the full encoded block resident regardless of output-channel tiling.
+ for dimIdx in range(len(weightBuffer.shape)):
+ weightDimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx)
+ tilerModel.addConstraint(weightDimVar == weightDimVar.Max())Also applies to: 225-228
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines
52 - 58, The current constraint only pins weightShape[0] via
weightOutChannelVar.Max(), but the DW packed weight tensor is emitted/loaded as
a full encoded block (so all dims must be pinned); update the tilerModel
constraints for the packed DW weight (located where weightBuffer =
ctxt.lookup(weightBufferName) and tilerModel.addConstraint(...) is called) to
set every shape dimension of that buffer to its Max() instead of only
weightOutChannelVar—e.g., iterate over the buffer's shape variables (weightShape
or equivalent per-dimension vars obtained from weightBuffer) and call
tilerModel.addConstraint(dimVar == dimVar.Max()) for each dimension so the tiler
cannot shrink any axis while codegen still DMA-loads the full packed block.
| def _findAllMultiplicands(x: int) -> List[int]: | ||
| multiplicands = [] | ||
| tmpX = x | ||
| for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number | ||
| while tmpX % i == 0: | ||
| multiplicands.append(i) | ||
| tmpX = tmpX / i | ||
|
|
||
| if x // math.prod(multiplicands) > 1: | ||
| multiplicands.append(x // math.prod(multiplicands)) | ||
|
|
||
| return multiplicands |
There was a problem hiding this comment.
Use integer factorization here.
tmpX = tmpX / i turns the accumulator into a float, and the current loop bound skips exact square roots. For dimensions like 4 or 49 this misses balanced factor pairs, so _bestReshapeOption() falls back to (dim, 1) and defeats the pointwise-reshape heuristic.
Suggested fix
def _findAllMultiplicands(x: int) -> List[int]:
multiplicands = []
tmpX = x
- for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number
+ for i in range(2, math.isqrt(x) + 1):
while tmpX % i == 0:
multiplicands.append(i)
- tmpX = tmpX / i
+ tmpX //= i
- if x // math.prod(multiplicands) > 1:
- multiplicands.append(x // math.prod(multiplicands))
+ if tmpX > 1:
+ multiplicands.append(tmpX)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 -
150, The _findAllMultiplicands function currently uses float division and an
exclusive upper bound which breaks integer factorization; change the loop to
iterate i from 2 through int(math.sqrt(x)) inclusive (or use math.isqrt(x)+1)
and use integer division (tmpX //= i) when extracting factors so tmpX stays an
int; ensure the final remaining factor appended for x //
math.prod(multiplicands) is computed with integer arithmetic and that
multiplicands contains ints so callers like _bestReshapeOption receive correct
balanced factor pairs.
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>
b8087fc to
b3f40e5
Compare
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.
(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
There was a problem hiding this comment.
Actionable comments posted: 14
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
Deeploy/Targets/GAP9/Platform.py (1)
8-54:⚠️ Potential issue | 🟡 MinorRun the formatter/import hooks for this file.
CI reports both yapf and isort modified this file, so the branch is not passing pre-commit yet.
#!/bin/bash pre-commit run yapf isort --files Deeploy/Targets/GAP9/Platform.pyAlso applies to: 281-283
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/Platform.py` around lines 8 - 54, The file containing the long import block (look for symbols like DeQuantTilingReadyBindings, BasicAllocateTemplate, RemoveEmptyConvBiasPass and PULPDMASliceBindings) is not formatted/organized per CI; run the import and formatter hooks (isort and yapf) to reorder and format the imports/line breaks, then stage and commit the resulting changes so pre-commit passes (this also applies to the nearby import group that includes PULP* bindings and parsers).
♻️ Duplicate comments (12)
Deeploy/Targets/NE16/__init__.py (1)
5-5:⚠️ Potential issue | 🟠 MajorReplace the self-import with explicit NE16 module exports.
This remains unresolved:
from . import *does not expose the new NE16 sibling modules. Use explicit imports and includeTileras well.Proposed fix
-from . import * +from . import Bindings, Deployer, Engine, OptimizationPasses, Parsers, Platform, Tiler, Templates, TileConstraints, TopologyOptimizationPasses + +__all__ = [ + "Bindings", + "Deployer", + "Engine", + "OptimizationPasses", + "Parsers", + "Platform", + "Tiler", + "Templates", + "TileConstraints", + "TopologyOptimizationPasses", +]#!/bin/bash # Verify package initializers that still use self-referential wildcard imports. rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/__init__.py` at line 5, Replace the wildcard self-import in Deeploy/Targets/NE16/__init__.py with explicit exports: import the sibling module objects you want to expose (e.g., the NE16 device class/name and the Tiler class) by name and populate __all__ accordingly so consumers can access NE16 and Tiler directly; specifically remove "from . import *", add explicit imports of the NE16 module symbols and Tiler, and set __all__ = ["NE16", "Tiler", ...] to list the public API.Deeploy/Targets/NE16/Deployer.py (2)
28-30:⚠️ Potential issue | 🟠 MajorUse
Nonefor theinputOffsetsdefault.
{}is shared acrossNE16Deployerinstances, so parent initialization can leak offsets between deployments.Proposed fix
- inputOffsets = {}): + inputOffsets = None): + if inputOffsets is None: + inputOffsets = {} super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, default_channels_first, deeployStateDir, inputOffsets)Verification:
#!/bin/bash python -m ruff check Deeploy/Targets/NE16/Deployer.py --select B006🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Deployer.py` around lines 28 - 30, The constructor NE16Deployer.__init__ currently uses a mutable default for the inputOffsets parameter (inputOffsets = {}), which can be shared across instances; change the signature to use inputOffsets=None and then when calling super().__init__ pass an empty dict when inputOffsets is None (e.g., use inputOffsets or {}), ensuring NE16Deployer.__init__ does not leak offsets between deployments.
29-40:⚠️ Potential issue | 🟠 MajorAvoid mutating the shared optimizer pass list.
self.loweringOptimizer.passesis modified in place. Since the NE16 optimizer is constructed at module scope, repeated deployer creation can append duplicate NE16 passes and alter other deployments’ lowering pipelines.Proposed fix
+import copy + ... - super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, + loweringOptimizer = copy.copy(loweringOptimizer) + loweringOptimizer.passes = list(loweringOptimizer.passes) + + super().__init__(graph, deploymentPlatform, inputTypes, loweringOptimizer, scheduler, name, default_channels_first, deeployStateDir, inputOffsets)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Deployer.py` around lines 29 - 40, The code mutates the shared loweringOptimizer.passes list causing duplicate/side-effect passes; fix by creating a new list copy and assigning it back instead of modifying in place: make a local copy (e.g., new_passes = list(self.loweringOptimizer.passes)), perform the replacement of PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes when self.Platform.engines[0].enable3x3 is true, then set self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(), NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass, ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer to avoid mutating shared state.Deeploy/Targets/NE16/Parsers.py (2)
37-56:⚠️ Potential issue | 🟠 Major
channels_firstdefault is stillTruewhile the check rejects everything butFalse.
parseNodeCtxtdefaultschannels_first=Trueon line 40 (and the three RQS overrides at lines 113, 153, 193 also default toTrue), but line 56 rejects anything exceptFalse. Any caller that uses the default path returns(ctxt, False)and the NE16 parser fails to bind. Flip all four defaults toFalse, and replace the== Falsecheck withnot channels_first.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` around lines 37 - 56, The parseNodeCtxt method in NE16's Parsers.py currently defaults channels_first=True but immediately rejects any True value (using "channels_first == False"); change the default of parseNodeCtxt to channels_first=False and also update the three RQS override methods that default channels_first=True to default to False, then simplify the conditional check from "channels_first == False" to "not channels_first" (ensure you update all occurrences in parseNodeCtxt and the three RQS override methods so callers using defaults succeed).
118-123:⚠️ Potential issue | 🔴 CriticalInput mapping still has no bounds check and
shiftis never set from inputs/attrs.The three RQS
parseNodeCtxtoverrides (lines 119-121 / 159-161 / 199-201) unconditionally mapnode.inputs[idx]into['data_in','weight','mul','add']. If a 5th tensor is present (some RQS pipelines emitshiftas an input) this throwsIndexError, and if it's absent, the template variable${shift}(referenced byConvTemplate.pyline 339scale_shift_addr = (uint32_t)${shift}) stays"NULL"with no validation. Either validatelen(node.inputs) == 4explicitly or map a 5th input/node.attrs['shift']intooperatorRepresentation['shift'].🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Parsers.py` around lines 118 - 123, The parseNodeCtxt overrides unconditionally map node.inputs into inputs = ['data_in','weight','mul','add'] causing IndexError for a 5th tensor and leaving operatorRepresentation['shift'] unset; update parseNodeCtxt (the methods that populate self.operatorRepresentation) to first check len(node.inputs) and either assert/raise if it is not 4 or 5, and if a 5th input exists map it to self.operatorRepresentation['shift'] = ctxt.lookup(node.inputs[4].name).name, and additionally if 'shift' can come from node.attrs check node.attrs.get('shift') and set operatorRepresentation['shift'] accordingly so ConvTemplate.py's ${shift} is always defined or an explicit error is raised.Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py (2)
88-102:⚠️ Potential issue | 🟠 Majorint8 subtraction still overflows before cast to
uint8.
values - weight_offsetexecutes invalues.dtype. For int8 weights,weight_offset = values.min()can be-128, so any127element wraps to-1and becomes255after the laterastype(uint8)— wrong absolute offset, wrong encoded weights. Please widen before subtracting.🐛 Proposed fix
- weight_offset = values.min() - values = values - weight_offset + weight_offset = int(values.min()) + values = values.astype(np.int32) - weight_offset node.attrs["weight_offset"] = weight_offset🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 88 - 102, The subtraction "values = values - weight_offset" risks int8 overflow; cast the tensor to a wider signed dtype (e.g., int16 or int32) before computing "weight_offset" subtraction so arithmetic is done in a safe width, perform "values = values.astype(wider_dtype) - int(weight_offset)" then (after any necessary range checks/clamping) cast to uint8 for encoding; update the code paths around weightTensor.values, weight_offset and the subsequent astype(uint8) to use the widened intermediate to avoid wraparound for int8 inputs.
139-150:⚠️ Potential issue | 🟠 Major
_findAllMultiplicandsstill uses float division and an exclusive sqrt bound.
range(2, math.ceil(math.sqrt(x)))misses the sqrt itself (e.g., forx = 4/49the loop is empty somultiplicands == []andmath.prod([]) == 1, returning[x]). AlsotmpX = tmpX / iconverts the accumulator tofloat, which then failstmpX % i == 0for large factors. Result:_bestReshapeOptioncollapses to(dim, 1)for many common spatial sizes and the pointwise-reshape heuristic never fires.♻️ Proposed fix
def _findAllMultiplicands(x: int) -> List[int]: multiplicands = [] tmpX = x - for i in range(2, math.ceil(math.sqrt(x))): # Ceil cause range doesn't include the last number + for i in range(2, math.isqrt(x) + 1): while tmpX % i == 0: multiplicands.append(i) - tmpX = tmpX / i - - if x // math.prod(multiplicands) > 1: - multiplicands.append(x // math.prod(multiplicands)) + tmpX //= i + if tmpX > 1: + multiplicands.append(tmpX) return multiplicands🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py` around lines 139 - 150, The function _findAllMultiplicands uses float division and an exclusive sqrt bound; change the loop to iterate i from 2 up to and including the integer square root (use math.isqrt(x) or int(math.sqrt(x)) + 1) and perform integer division on tmpX (use //= instead of /=) so tmpX remains an int for the modulo checks; also simplify the tail handling by appending tmpX if tmpX > 1 (rather than recomputing x // math.prod(...)) so _findAllMultiplicands reliably returns prime factors for inputs like 4 or 49 and avoids float-related modulo failures.Deeploy/Targets/NE16/Engine.py (1)
62-75:⚠️ Potential issue | 🟠 Major
isPWConv/isDWConvstill over-accept grouped convs.The classifier helpers still don't require
group == 1for PW or assert the DW shape condition (group == ch_im_outwithweights.shape[1] == 1). With the parsers enforcing stricter rules downstream, this meanscanExecutecan return True for a node that then fails parsing/tiling, instead of cleanly falling back to the GAP9 cluster engine.🔒️ Proposed fix
def isPWConv(self, node) -> bool: return node.op in ["Conv", "RequantizedConv"] and \ isinstance(node.inputs[1], gs.Constant) and \ node.attrs['kernel_shape'] == [1, 1] and \ node.attrs['dilations'] == [1, 1] and \ + node.attrs['group'] == 1 and \ (node.attrs['strides'] == [1, 1] or self.enableStrides) def isDWConv(self, node) -> bool: + weights = node.inputs[1].values if isinstance(node.inputs[1], gs.Constant) else None return node.op in ["Conv", "RequantizedConv"] and \ isinstance(node.inputs[1], gs.Constant) and \ node.attrs['kernel_shape'] == [3, 3] and \ node.attrs['dilations'] == [1, 1] and \ - node.attrs['group'] != 1 and \ + weights is not None and weights.shape[1] == 1 and \ + node.attrs['group'] == weights.shape[0] and \ (node.attrs['strides'] == [1, 1] or self.enableStrides)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Engine.py` around lines 62 - 75, The PW/DW classifiers are too permissive: update isPWConv to also require node.attrs['group'] == 1 to reject grouped convs, and update isDWConv to enforce depthwise conditions by checking the weight tensor on node.inputs[1] has shape where weights.shape[1] == 1 and node.attrs['group'] == node.attrs.get('channels_out' or 'ch_im_out' equivalent) (i.e., group == output channel count); locate these checks in the isPWConv and isDWConv functions and add the group and weight-shape validations so canExecute no longer returns True for convs that will fail downstream parsing/tiling.Deeploy/Targets/NE16/Templates/ConvTemplate.py (2)
352-356:⚠️ Potential issue | 🔴 Critical
.weights_strideis still the invalidtask.data.cfg.weights_stride = …form.Despite the previous "addressed in 773777f" note, the rendered template at these lines still emits
task.data.cfg.weights_stride = (ne16_stride_t) { … }in the middle of thene16_cfg_tcompound literal. That is not a designated initializer and will not compile. The fix needs to land in this file.🐛 Proposed fix
.output_stride = (ne16_stride_t) { .d0 = NE16_OUTPUT_BANDWIDTH_BYTES, .d1 = ${dim_im_out_y_stride}, .d2 = ${dim_im_out_x_stride} }, - task.data.cfg.weights_stride = (ne16_stride_t) { + .weights_stride = (ne16_stride_t) { .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 352 - 356, The template emits an invalid assignment "task.data.cfg.weights_stride = (ne16_stride_t) { ... }" inside the ne16_cfg_t compound literal; update the ConvTemplate.py rendering so it produces a designated initializer by replacing that emitted assignment with ".weights_stride = (ne16_stride_t){ .d0 = ${weightStrideD0}, .d1 = ${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e., remove the "task.data.cfg." prefix and emit the field as a proper member initializer inside the ne16_cfg_t literal), ensuring punctuation and comma placement matches other fields in the ne16_cfg_t literal.
145-151:⚠️ Potential issue | 🟠 Major
"add" in operatorRepresentationstill ignores the"NULL"sentinel.Same story as before:
mulis gated on!= "NULL", butuse_biasfalls back to a presence check.parseNodeCtxtinDeeploy/Targets/NE16/Parsers.pylines 74–76 writes"NULL"intooperatorRepresentation["add"]unconditionally, so for bias-less requantized convs bit 25 ofconf0ends up set whilescale_bias_addrin the emitted task struct isNULL. Please treat"NULL"as absent here too.🛡️ Proposed fix
if operatorRepresentation["mul"] != "NULL": mulBuff = ctxt.lookup(operatorRepresentation["mul"]) mulBits = mulBuff._type.referencedType.typeWidth - operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"], - operatorRepresentation["log2D"], mulBits, "add" - in operatorRepresentation, False) + use_bias = operatorRepresentation.get("add", "NULL") != "NULL" + operatorRepresentation["conf0"] |= getNormQuantConf0(operatorRepresentation["use_relu"], + operatorRepresentation["log2D"], mulBits, + use_bias, False)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py` around lines 145 - 151, The conf0 bit for bias presence is being set when operatorRepresentation contains the key "add" even if its value is the sentinel "NULL"; update the requantized branch in ConvTemplate.py so the third argument to getNormQuantConf0 uses an explicit non-NULL check (e.g. operatorRepresentation.get("add") != "NULL") instead of '"add" in operatorRepresentation', mirroring the mul check (operatorRepresentation["mul"] != "NULL"); locate the operatorRepresentation usage around the mulBuff lookup and change the boolean expression passed to getNormQuantConf0 accordingly so bias-less convs do not set bit 25.Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py (1)
43-56:⚠️ Potential issue | 🟠 MajorPointwise output geometry still modeled as identity — wrong for strided 1×1 / padded cases.
With
NE16Engine.enableStrides=True,isPWConvacceptsstrides != [1,1], butoutputHeightVar == inputHeightVar/outputWidthVar == inputWidthVarforcesstride == 1and no padding. Either addstrides == [1,1]assertions to the 1×1 path, or use the standard(effective_input - 1) // stride + 1relation like dense/DW do.🔒️ Proposed fix
- # Map output dims to inputs dims + strides = parseDict["strides"] + padding = parseDict["pads"] + inputBuffer = ctxt.lookup(inputBufferName) + + # Map output dims to input dims (stride- and padding-aware for 1x1) tilerModel.addConstraint(outputBatchVar == inputBatchVar) - tilerModel.addConstraint(outputHeightVar == inputHeightVar) - tilerModel.addConstraint(outputWidthVar == inputWidthVar) + effectiveHeight = inputHeightVar + ((padding[0] + padding[2]) * (inputHeightVar == inputBuffer.shape[1])) + effectiveWidth = inputWidthVar + ((padding[1] + padding[3]) * (inputWidthVar == inputBuffer.shape[2])) + tilerModel.addConstraint(outputHeightVar == (effectiveHeight - 1) // strides[0] + 1) + tilerModel.addConstraint(outputWidthVar == (effectiveWidth - 1) // strides[1] + 1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py` around lines 43 - 56, The current pointwise tiler constraints incorrectly force outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where tilerModel.addConstraint is called (the lines referencing outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A) add an assertion that strides == [1,1] when isPWConv can accept strides (check NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred) replace the identity constraints with the standard convolution output relation used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1, computing effective_input = input + pad_top + pad_bottom (use whatever padding variables exist in scope) and apply that formula for both height and width via tilerModel.addConstraint.Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py (1)
52-58:⚠️ Potential issue | 🟠 MajorStill only pinning weight dim 0; rest of packed DW block can be shrunk by the tiler.
The depthwise weight is emitted as a single packed block (
HyperRectangle((0,0,0), (weightShape[0], weightShape[1], weightShape[2]))on line 228), but only dim 0 is constrained toMax()here. The tiler is free to pick smaller values for dims 1 and 2, under-accounting weight memory while codegen still DMAs the full block.🔒️ Proposed fix
weightBuffer = ctxt.lookup(weightBufferName) - # NE16 DW weight is packed as a single (1, 1, packed_bytes) block - # containing all output channels (up to NE16_SUBTILE_INPUT_CHANNEL=16). - # Keep the outermost dim fixed at its full (=1) value regardless of - # the output channel tiling. - tilerModel.addConstraint(weightOutChannelVar == weightOutChannelVar.Max()) + # NE16 DW weight is one packed block loaded entirely every iteration; + # pin every axis to its Max so the tiler accounts for the full block. + for dimIdx in range(len(weightBuffer.shape)): + dimVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = dimIdx) + tilerModel.addConstraint(dimVar == dimVar.Max())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py` around lines 52 - 58, The current code only pins weightOutChannelVar to its Max(), allowing the other two packed weight dimensions to be shrunk; fix by retrieving the remaining dimension variables for the weight buffer (the vars corresponding to dims 1 and 2 of weightBuffer) and add equivalent constraints using tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed block (all dims of weightBuffer) is fixed to full size; use weightBuffer/weightBufferName to locate the buffer and weightOutChannelVar as a reference to the existing constraint when adding the other two constraints.
🧹 Nitpick comments (1)
Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py (1)
41-44: Unused weight-dim variables.
weightInChannelMajorVar,weightBitsVar, andweightBandwidthVarare created but never referenced inaddGeometricalConstraintor anywhere else. OnlyweightOutChannelVaris actually used (line 56). Either wire them into real constraints (e.g., pin the bits/bandwidth axes like the pointwise policy does) or drop them.♻️ Proposed cleanup
weightOutChannelVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 0) - weightInChannelMajorVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 1) - weightBitsVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 2) - weightBandwidthVar = tilerModel.getTensorDimVar(tensorName = weightBufferName, dimIdx = 3)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py` around lines 41 - 44, Remove or use the three unused weight-dim variables (weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in NE16DenseConstraint; they are allocated but never referenced in addGeometricalConstraint while only weightOutChannelVar is used. Either delete those getTensorDimVar calls to clean up the code, or wire them into the constraint logic inside addGeometricalConstraint (or the class constructor) — e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling decisions and weightInChannelMajorVar is used where input-channel-major indexing matters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Around line 62-80: The bias handling must normalize C.values to a 1-D array of
length Ko before any signed-input compensation; instead of only truncating when
bias_flat.size > Ko, handle bias_flat.size < Ko by expanding/broadcasting (e.g.,
repeat or np.resize) to length Ko and assert the resulting shape equals Ko,
ensure dtype is appropriate (int32/int64 as needed), then perform the
signed-input compensation using w_sum, scale_values (from
operatorRepresentation['mul']) or the int32 path, and finally assign the
corrected Ko-length array back to C.values; update references around C.values,
bias_flat, Ko, w_sum and operatorRepresentation so no silent broadcasting or
out-of-bounds write occurs.
In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py`:
- Around line 116-123: The code always sets NSize = buffA.shape[-1] which
ignores transposition; update the calculation to choose the reduction dimension
based on operatorRepresentation['transA'] (and consider transB if used
elsewhere): compute NSize = buffA.shape[-2] when transA == 1 else
buffA.shape[-1], and adjust any subsequent uses that compute A/B tile offsets or
shapes so they use this transpose-aware NSize (refer to variables transA,
transB, buffA, buffB, and NSize in NE16GEMMTileConstraint.py).
- Around line 131-167: The loop over outputCubes is using cube.offset (the local
rectangle) which ignores AbsoluteHyperRectangle.absoluteOffset, causing input
tile offsets to be computed from the wrong origin; update the code that builds
AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output
offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive
offsets from absoluteCube.rectangle instead of cube) so the
inputACubes/inputBCubes and RequantCube use global offsets; ensure this change
applies where batch-aware prepending of batch offsets occurs (the block that
updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to
inputACubes, inputBCubes and creating RequantCube (HyperRectangle).
In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py`:
- Around line 80-106: Ensure RequantizedGemm handling validates all requant
constants before mutating the node: check node.op == 'RequantizedGemm' still has
node.attrs['div'] and that node.attrs['div'].values is a scalar; verify
mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and
that mul_values length is either 1 or equal to Ko and bias_values length is at
least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias
after these guards and return/skip modification if any check fails to avoid
partially rewriting the graph.
In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py`:
- Around line 1200-1221: The identity bypass must be guarded so it only removes
dequant_node/quant_node when the operation truly preserves type and graph
outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output
and that quant_node and dequant_node have matching signed and bit_width (compare
quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents,
and ensure n_levels match and signed flags are equal); if bit widths or signed
differ (e.g., int16->int8) or the output is a declared graph output, do not
remove the nodes and instead let normal narrowing/clamping pass run—implement
these checks around the existing identity branch (using ratio, zp_d, zp_q,
signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the
input substitution, clear inputs/outputs and graph.cleanup().toposort() when all
guards pass.
- Around line 1234-1238: Add a normalized "n_levels" entry alongside
"n_levels_out" when building rqs_attrs so operatorRepresentation['n_levels'] is
present for RequantShiftChecker and callers that expect the normalized
attribute; update the rqs_attrs dict in the block that creates the constants
(the rqs_attrs variable) to include a gs.Constant key 'n_levels' (using the same
normalized n_levels value used for 'n_levels_out') so both 'n_levels' and
'n_levels_out' are emitted consistently with other passes.
In `@Deeploy/Targets/NE16/OptimizationPasses/__init__.py`:
- Line 5: Replace the wildcard import in __init__.py with explicit imports from
MemoryLevelAnnotationPasses.py so the symbol AnnotateNE16WeightMemoryLevel is
exported and discoverable; specifically, import the class/function name (e.g.,
AnnotateNE16WeightMemoryLevel) from MemoryLevelAnnotationPasses and either add
it to the module namespace or to __all__ in __init__.py (or remove the
initializer entirely) so callers can successfully do from
Deeploy.Targets.NE16.OptimizationPasses import AnnotateNE16WeightMemoryLevel.
In `@Deeploy/Targets/NE16/Platform.py`:
- Around line 7-21: The file's imports are not formatted/ordered per project
hooks; run the formatter and import-sorting hooks on this module (e.g., run
pre-commit or directly run yapf and isort) to reorder and reflow the import
block so CI passes; locate the top import section (references:
RequantizedGemmToPwPass, NE16Engine, GAP9Platform, GAP9_FloatGEMMMapper,
NE16AdjustGEMMWeightLayoutPass, PULPRQSGEMMLayer) and apply yapf/isort to fix
spacing, line breaks and import grouping consistently across that block (also
apply the same fixes around lines 34-43 where similar import formatting issues
exist).
- Around line 39-43: NE16Optimizer currently reuses the same pass instances from
PULPOptimizer by using [*PULPOptimizer.passes], causing shared mutable state;
change the construction to deep-copy the pass objects instead (e.g., use
copy.deepcopy on each element of PULPOptimizer.passes) so NE16Optimizer =
TopologyOptimizer([*deep-copied-passes, NE16AdjustGEMMWeightLayoutPass(),
RequantizedGemmToPwPass()], name="NE16Optimizer"); reference
PULPOptimizer.passes and NE16Optimizer and import/use copy.deepcopy (or use a
per-pass clone method if available) to ensure independent pass instances.
In `@Deeploy/Targets/NE16/Templates/__init__.py`:
- Line 5: The package __init__.py currently uses a self-referential wildcard
import ("from . import *") which doesn't actually import sibling modules (e.g.,
the NE16 allocation and conv template modules) and triggers Ruff F403; replace
the wildcard with explicit imports of the template modules (or explicit from
.module import <symbols>) and/or populate __all__ with the actual module or
symbol names (for example add explicit references to the NE16 allocation and
conv modules) so package-level imports expose the intended templates. Ensure you
reference and import the specific module names rather than relying on "from .
import *".
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py`:
- Line 5: Replace the package wildcard import in __init__.py (currently "from .
import *") with explicit relative imports of the concrete constraint modules
and/or classes so sibling modules are actually loaded and their classes are
exposed; e.g., use "from .constraint_module import ConstraintClass" for each
constraint module, and populate an __all__ list with the exported class names
(or import the modules with "from . import constraint_module" if you prefer
module exports) so the concrete constraint classes are available to consumers.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py`:
- Line 5: The package __init__.py currently uses a wildcard import which doesn't
re-export classes from Passes.py; update
Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import
and re-export the four pass classes by adding an explicit import from .Passes of
NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass,
ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are
exposed (e.g., in __all__) so consumers can import them directly from the
package.
In `@DeeployTest/test_platforms.py`:
- Around line 1008-1010: Unpack test_params using a throwaway name for the
unused config_name (e.g., test_name, l1, _ = test_params) and build the cmake
args using iterable unpacking instead of list concatenation (e.g.,
ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]);
apply the same two edits for the other two occurrences that create
ne16_cmake_args in the GAP9_w_NE16 tiled tests so Ruff warnings for unused
variables and list concatenation are resolved.
In `@TargetLibraries/GAP9/inc/ne16_utils.h`:
- Around line 11-12: The public header ne16_utils.h currently pulls in
implementation-only headers; replace the includes of "CNN_BasicKernels_fp32.h"
and "pmsis.h" with the standard <stdint.h> so the public struct uses only
fixed-width integer types, and remove the unavailable dependency. Then add
`#include` "pmsis.h" into ne16_utils.c (where gap_coreid(), gap_ncore(), and the
v4s vector type are used) so implementation functions in ne16_utils.c can access
those symbols.
---
Outside diff comments:
In `@Deeploy/Targets/GAP9/Platform.py`:
- Around line 8-54: The file containing the long import block (look for symbols
like DeQuantTilingReadyBindings, BasicAllocateTemplate, RemoveEmptyConvBiasPass
and PULPDMASliceBindings) is not formatted/organized per CI; run the import and
formatter hooks (isort and yapf) to reorder and format the imports/line breaks,
then stage and commit the resulting changes so pre-commit passes (this also
applies to the nearby import group that includes PULP* bindings and parsers).
---
Duplicate comments:
In `@Deeploy/Targets/NE16/__init__.py`:
- Line 5: Replace the wildcard self-import in Deeploy/Targets/NE16/__init__.py
with explicit exports: import the sibling module objects you want to expose
(e.g., the NE16 device class/name and the Tiler class) by name and populate
__all__ accordingly so consumers can access NE16 and Tiler directly;
specifically remove "from . import *", add explicit imports of the NE16 module
symbols and Tiler, and set __all__ = ["NE16", "Tiler", ...] to list the public
API.
In `@Deeploy/Targets/NE16/Deployer.py`:
- Around line 28-30: The constructor NE16Deployer.__init__ currently uses a
mutable default for the inputOffsets parameter (inputOffsets = {}), which can be
shared across instances; change the signature to use inputOffsets=None and then
when calling super().__init__ pass an empty dict when inputOffsets is None
(e.g., use inputOffsets or {}), ensuring NE16Deployer.__init__ does not leak
offsets between deployments.
- Around line 29-40: The code mutates the shared loweringOptimizer.passes list
causing duplicate/side-effect passes; fix by creating a new list copy and
assigning it back instead of modifying in place: make a local copy (e.g.,
new_passes = list(self.loweringOptimizer.passes)), perform the replacement of
PULPNCHWtoNHWCPass with NCHWtoNHWCPass on new_passes when
self.Platform.engines[0].enable3x3 is true, then set
self.loweringOptimizer.passes = new_passes + [ConvEngineDiscolorationPass(),
NE16OptimizationPass(self.default_channels_first, "NE16")]; operate on these
symbols (self.loweringOptimizer.passes, PULPNCHWtoNHWCPass, NCHWtoNHWCPass,
ConvEngineDiscolorationPass, NE16OptimizationPass) in the Deployer initializer
to avoid mutating shared state.
In `@Deeploy/Targets/NE16/Engine.py`:
- Around line 62-75: The PW/DW classifiers are too permissive: update isPWConv
to also require node.attrs['group'] == 1 to reject grouped convs, and update
isDWConv to enforce depthwise conditions by checking the weight tensor on
node.inputs[1] has shape where weights.shape[1] == 1 and node.attrs['group'] ==
node.attrs.get('channels_out' or 'ch_im_out' equivalent) (i.e., group == output
channel count); locate these checks in the isPWConv and isDWConv functions and
add the group and weight-shape validations so canExecute no longer returns True
for convs that will fail downstream parsing/tiling.
In `@Deeploy/Targets/NE16/Parsers.py`:
- Around line 37-56: The parseNodeCtxt method in NE16's Parsers.py currently
defaults channels_first=True but immediately rejects any True value (using
"channels_first == False"); change the default of parseNodeCtxt to
channels_first=False and also update the three RQS override methods that default
channels_first=True to default to False, then simplify the conditional check
from "channels_first == False" to "not channels_first" (ensure you update all
occurrences in parseNodeCtxt and the three RQS override methods so callers using
defaults succeed).
- Around line 118-123: The parseNodeCtxt overrides unconditionally map
node.inputs into inputs = ['data_in','weight','mul','add'] causing IndexError
for a 5th tensor and leaving operatorRepresentation['shift'] unset; update
parseNodeCtxt (the methods that populate self.operatorRepresentation) to first
check len(node.inputs) and either assert/raise if it is not 4 or 5, and if a 5th
input exists map it to self.operatorRepresentation['shift'] =
ctxt.lookup(node.inputs[4].name).name, and additionally if 'shift' can come from
node.attrs check node.attrs.get('shift') and set operatorRepresentation['shift']
accordingly so ConvTemplate.py's ${shift} is always defined or an explicit error
is raised.
In `@Deeploy/Targets/NE16/Templates/ConvTemplate.py`:
- Around line 352-356: The template emits an invalid assignment
"task.data.cfg.weights_stride = (ne16_stride_t) { ... }" inside the ne16_cfg_t
compound literal; update the ConvTemplate.py rendering so it produces a
designated initializer by replacing that emitted assignment with
".weights_stride = (ne16_stride_t){ .d0 = ${weightStrideD0}, .d1 =
${weightStrideD1}, .d2 = ${weightStrideD2} }," (i.e., remove the
"task.data.cfg." prefix and emit the field as a proper member initializer inside
the ne16_cfg_t literal), ensuring punctuation and comma placement matches other
fields in the ne16_cfg_t literal.
- Around line 145-151: The conf0 bit for bias presence is being set when
operatorRepresentation contains the key "add" even if its value is the sentinel
"NULL"; update the requantized branch in ConvTemplate.py so the third argument
to getNormQuantConf0 uses an explicit non-NULL check (e.g.
operatorRepresentation.get("add") != "NULL") instead of '"add" in
operatorRepresentation', mirroring the mul check (operatorRepresentation["mul"]
!= "NULL"); locate the operatorRepresentation usage around the mulBuff lookup
and change the boolean expression passed to getNormQuantConf0 accordingly so
bias-less convs do not set bit 25.
In `@Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py`:
- Around line 52-58: The current code only pins weightOutChannelVar to its
Max(), allowing the other two packed weight dimensions to be shrunk; fix by
retrieving the remaining dimension variables for the weight buffer (the vars
corresponding to dims 1 and 2 of weightBuffer) and add equivalent constraints
using tilerModel.addConstraint(dimVar == dimVar.Max()) so the entire packed
block (all dims of weightBuffer) is fixed to full size; use
weightBuffer/weightBufferName to locate the buffer and weightOutChannelVar as a
reference to the existing constraint when adding the other two constraints.
In `@Deeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.py`:
- Around line 43-56: The current pointwise tiler constraints incorrectly force
outputHeightVar==inputHeightVar and outputWidthVar==inputWidthVar, which forbids
strided/padded 1x1 convs; update the mapping in NE16PointwiseConstraint.py where
tilerModel.addConstraint is called (the lines referencing
outputHeightVar/inputHeightVar and outputWidthVar/inputWidthVar) to either (A)
add an assertion that strides == [1,1] when isPWConv can accept strides (check
NE16Engine.enableStrides / isPWConv and the strides variable) or (preferred)
replace the identity constraints with the standard convolution output relation
used by dense/DW paths: output = floor((effective_input - 1) / stride) + 1,
computing effective_input = input + pad_top + pad_bottom (use whatever padding
variables exist in scope) and apply that formula for both height and width via
tilerModel.addConstraint.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py`:
- Around line 88-102: The subtraction "values = values - weight_offset" risks
int8 overflow; cast the tensor to a wider signed dtype (e.g., int16 or int32)
before computing "weight_offset" subtraction so arithmetic is done in a safe
width, perform "values = values.astype(wider_dtype) - int(weight_offset)" then
(after any necessary range checks/clamping) cast to uint8 for encoding; update
the code paths around weightTensor.values, weight_offset and the subsequent
astype(uint8) to use the widened intermediate to avoid wraparound for int8
inputs.
- Around line 139-150: The function _findAllMultiplicands uses float division
and an exclusive sqrt bound; change the loop to iterate i from 2 up to and
including the integer square root (use math.isqrt(x) or int(math.sqrt(x)) + 1)
and perform integer division on tmpX (use //= instead of /=) so tmpX remains an
int for the modulo checks; also simplify the tail handling by appending tmpX if
tmpX > 1 (rather than recomputing x // math.prod(...)) so _findAllMultiplicands
reliably returns prime factors for inputs like 4 or 49 and avoids float-related
modulo failures.
---
Nitpick comments:
In `@Deeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.py`:
- Around line 41-44: Remove or use the three unused weight-dim variables
(weightInChannelMajorVar, weightBitsVar, weightBandwidthVar) created in
NE16DenseConstraint; they are allocated but never referenced in
addGeometricalConstraint while only weightOutChannelVar is used. Either delete
those getTensorDimVar calls to clean up the code, or wire them into the
constraint logic inside addGeometricalConstraint (or the class constructor) —
e.g., pin the bits/bandwidth axes or add explicit constraints that mirror the
pointwise policy — so that weightBitsVar and weightBandwidthVar influence tiling
decisions and weightInChannelMajorVar is used where input-channel-major indexing
matters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cc86fa11-0cab-48bf-89f1-7688eb67747c
📒 Files selected for processing (55)
.github/workflows/_runner-gap9-w-ne16-tiled.yml.github/workflows/_runner-snitch-tiled-sequential.yml.github/workflows/ci-platform-gap9-tiled.yml.github/workflows/ci-platform-gap9-w-ne16-tiled.yml.github/workflows/ci-platform-gap9.ymlCMakeLists.txtDeeploy/Targets/GAP9/Bindings.pyDeeploy/Targets/GAP9/DMA/L3Dma.pyDeeploy/Targets/GAP9/Parsers.pyDeeploy/Targets/GAP9/Platform.pyDeeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.pyDeeploy/Targets/GAP9/Templates/NE16GEMMTemplate.pyDeeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.pyDeeploy/Targets/GAP9/Tiler.pyDeeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/GAP9/TopologyOptimizationPasses/__init__.pyDeeploy/Targets/Generic/Bindings.pyDeeploy/Targets/Generic/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/Generic/TypeCheckers.pyDeeploy/Targets/NE16/Bindings.pyDeeploy/Targets/NE16/Deployer.pyDeeploy/Targets/NE16/Engine.pyDeeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.pyDeeploy/Targets/NE16/OptimizationPasses/__init__.pyDeeploy/Targets/NE16/Parsers.pyDeeploy/Targets/NE16/Platform.pyDeeploy/Targets/NE16/Templates/AllocateTemplate.pyDeeploy/Targets/NE16/Templates/ConvTemplate.pyDeeploy/Targets/NE16/Templates/__init__.pyDeeploy/Targets/NE16/TileConstraints/NE16DenseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/NE16PointwiseConstraint.pyDeeploy/Targets/NE16/TileConstraints/RequantHelpers.pyDeeploy/Targets/NE16/TileConstraints/__init__.pyDeeploy/Targets/NE16/Tiler.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyDeeploy/Targets/NE16/TopologyOptimizationPasses/__init__.pyDeeploy/Targets/NE16/__init__.pyDeeploy/Targets/PULPOpen/Bindings.pyDeeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.pyDeeployTest/CMakeLists.txtDeeployTest/Platforms/GAP9/CMakeLists.txtDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/inputs.npzDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/network.onnxDeeployTest/Tests/Kernels/Integer/Conv/Dense_2D_RQ/outputs.npzDeeployTest/conftest.pyDeeployTest/deeployRunner_tiled_gap9_w_ne16.pyDeeployTest/testMVP.pyDeeployTest/testUtils/core/execution.pyDeeployTest/testUtils/platformMapping.pyDeeployTest/test_gap9_ne16_tiled_config.pyDeeployTest/test_platforms.pyTargetLibraries/GAP9/CMakeLists.txtTargetLibraries/GAP9/inc/ne16_utils.hTargetLibraries/GAP9/src/ne16_utils.c
✅ Files skipped from review due to trivial changes (8)
- Deeploy/Targets/GAP9/TopologyOptimizationPasses/init.py
- DeeployTest/testMVP.py
- DeeployTest/Platforms/GAP9/CMakeLists.txt
- Deeploy/Targets/NE16/Tiler.py
- .github/workflows/_runner-gap9-w-ne16-tiled.yml
- Deeploy/Targets/GAP9/Templates/GAP9SDKDequantQuantTemplate.py
- DeeployTest/test_gap9_ne16_tiled_config.py
- Deeploy/Targets/NE16/Bindings.py
🚧 Files skipped from review as they are similar to previous changes (10)
- DeeployTest/testUtils/core/execution.py
- DeeployTest/conftest.py
- DeeployTest/CMakeLists.txt
- CMakeLists.txt
- Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py
- Deeploy/Targets/NE16/Templates/AllocateTemplate.py
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- DeeployTest/testUtils/platformMapping.py
- .github/workflows/ci-platform-gap9-w-ne16-tiled.yml
- Deeploy/Targets/NE16/OptimizationPasses/MemoryLevelAnnotationPasses.py
| # Truncate broadcast bias [M, O] → per-channel [Ko] for NE16 | ||
| bias_flat = C.values.flatten() | ||
| if bias_flat.size > Ko: | ||
| C.values = bias_flat[:Ko].copy() | ||
|
|
||
| # Signed input bias compensation | ||
| if input_signed: | ||
| # Compute w_sum BEFORE packing (needed for signed bias compensation) | ||
| w_sum = w_int8.astype(np.int64).sum(axis = 1) # [Ko] | ||
| bias_values = C.values.flatten().astype(np.int64) | ||
| if 'mul' in operatorRepresentation: | ||
| # RequantizedGemm: bias -= 128 * w_sum * scale | ||
| scale_buf = ctxt.lookup(operatorRepresentation['mul']) | ||
| scale_values = scale_buf.values.flatten().astype(np.int64) | ||
| bias_values -= 128 * w_sum * scale_values | ||
| else: | ||
| # Gemm int32: bias -= 128 * w_sum (no scale) | ||
| bias_values -= 128 * w_sum | ||
| C.values = bias_values.astype(np.int32) |
There was a problem hiding this comment.
Bias handling only covers size > Ko, not broadcast/scalar cases.
The if bias_flat.size > Ko branch handles a broadcast-expanded bias of shape [M, O], but when bias_flat.size < Ko (scalar bias, or per-batch bias of size 1 that wasn't pre-broadcast) C.values is left untouched. The signed-input branch below then calls C.values.flatten().astype(np.int64) and computes bias_values -= 128 * w_sum * scale_values with mismatched shapes, silently broadcasting the short bias over w_sum[:Ko] and writing back a non-Ko array that the NE16 kernel will later read as [Ko] (OOB/garbage).
Consider explicitly normalizing bias to [Ko] up front (broadcast-fill or np.resize) and asserting the resulting shape, e.g.:
🛡️ Proposed fix
- # Truncate broadcast bias [M, O] → per-channel [Ko] for NE16
- bias_flat = C.values.flatten()
- if bias_flat.size > Ko:
- C.values = bias_flat[:Ko].copy()
+ # Normalize bias to per-channel [Ko] for NE16
+ bias_flat = C.values.flatten()
+ if bias_flat.size == Ko:
+ C.values = bias_flat.copy()
+ elif bias_flat.size > Ko and bias_flat.size % Ko == 0:
+ C.values = bias_flat[:Ko].copy()
+ elif bias_flat.size == 1:
+ C.values = np.full((Ko,), bias_flat[0], dtype = bias_flat.dtype)
+ else:
+ raise ValueError(f"NE16 GEMM bias size {bias_flat.size} incompatible with Ko={Ko}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 62 - 80, The
bias handling must normalize C.values to a 1-D array of length Ko before any
signed-input compensation; instead of only truncating when bias_flat.size > Ko,
handle bias_flat.size < Ko by expanding/broadcasting (e.g., repeat or np.resize)
to length Ko and assert the resulting shape equals Ko, ensure dtype is
appropriate (int32/int64 as needed), then perform the signed-input compensation
using w_sum, scale_values (from operatorRepresentation['mul']) or the int32
path, and finally assign the corrected Ko-length array back to C.values; update
references around C.values, bias_flat, Ko, w_sum and operatorRepresentation so
no silent broadcasting or out-of-bounds write occurs.
| transA = operatorRepresentation['transA'] | ||
| transB = operatorRepresentation['transB'] | ||
|
|
||
| buffA = ctxt.lookup(operatorRepresentation['A']) | ||
| buffB = ctxt.lookup(operatorRepresentation['B']) | ||
|
|
||
| NSize = buffA.shape[-1] | ||
|
|
There was a problem hiding this comment.
Make NSize transpose-aware.
Line 122 always uses buffA.shape[-1], but for transA == 1 the reduction dimension is buffA.shape[-2]. This will serialize wrong A/B tile shapes for transposed GEMM.
Proposed fix
- NSize = buffA.shape[-1]
+ NSize = buffA.shape[-1 - transA]Based on learnings, GEMM tiling serialization must use transpose flags to adjust NSize and matrix offset/shape calculations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines
116 - 123, The code always sets NSize = buffA.shape[-1] which ignores
transposition; update the calculation to choose the reduction dimension based on
operatorRepresentation['transA'] (and consider transB if used elsewhere):
compute NSize = buffA.shape[-2] when transA == 1 else buffA.shape[-1], and
adjust any subsequent uses that compute A/B tile offsets or shapes so they use
this transpose-aware NSize (refer to variables transA, transB, buffA, buffB, and
NSize in NE16GEMMTileConstraint.py).
| for cube in outputCubes: | ||
| MOffset, OOffset = cube.offset[-2:] | ||
| MSize, OSize = cube.dims[-2:] | ||
|
|
||
| if len(cube.offset) > 2: | ||
| BatchSize = math.prod(cube.dims[:-2]) | ||
| else: | ||
| BatchSize = 1 | ||
|
|
||
| replacements["M"].append(MSize) | ||
| replacements["O"].append(OSize) | ||
| replacements["batch"].append(BatchSize) | ||
|
|
||
| if transA == 0: | ||
| AMatrixOffsets = (MOffset, 0) | ||
| AMatrixShape = (MSize, NSize) | ||
| else: | ||
| AMatrixOffsets = (0, MOffset) | ||
| AMatrixShape = (NSize, MSize) | ||
|
|
||
| if len(buffA.shape) > 2: | ||
| batchDimCount = len(buffA.shape) - 2 | ||
| AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets | ||
| AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape | ||
|
|
||
| inputACubes.append(HyperRectangle(AMatrixOffsets, AMatrixShape)) | ||
|
|
||
| if transB == 0: | ||
| BMatrixOffsets = (0, OOffset) | ||
| BMatrixShape = (NSize, OSize) | ||
| else: | ||
| BMatrixOffsets = (OOffset, 0) | ||
| BMatrixShape = (OSize, NSize) | ||
|
|
||
| inputBCubes.append(HyperRectangle(BMatrixOffsets, BMatrixShape)) | ||
|
|
||
| RequantCube = HyperRectangle((OOffset,), (OSize,)) |
There was a problem hiding this comment.
Use absolute output offsets when deriving input tiles.
The loop uses cube.offset, but cube is only absoluteCube.rectangle; this drops AbsoluteHyperRectangle.absoluteOffset. Non-zero output tiles can then reload A/B/C/mul/scale slices from the wrong global offsets.
Proposed fix
- for cube in outputCubes:
- MOffset, OOffset = cube.offset[-2:]
+ for absoluteCube in absoluteOutputCubes:
+ cube = absoluteCube.rectangle
+ MOffset, OOffset = absoluteCube.absoluteOffset[-2:]
MSize, OSize = cube.dims[-2:]
@@
if len(buffA.shape) > 2:
batchDimCount = len(buffA.shape) - 2
- AMatrixOffsets = tuple(cube.offset[:-2][-batchDimCount:]) + AMatrixOffsets
+ AMatrixOffsets = tuple(absoluteCube.absoluteOffset[:-2][-batchDimCount:]) + AMatrixOffsets
AMatrixShape = tuple(cube.dims[:-2][-batchDimCount:]) + AMatrixShape🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.py` around lines
131 - 167, The loop over outputCubes is using cube.offset (the local rectangle)
which ignores AbsoluteHyperRectangle.absoluteOffset, causing input tile offsets
to be computed from the wrong origin; update the code that builds
AMatrixOffsets, BMatrixOffsets, and RequantCube to use the absolute output
offsets (e.g. add AbsoluteHyperRectangle.absoluteOffset to cube.offset or derive
offsets from absoluteCube.rectangle instead of cube) so the
inputACubes/inputBCubes and RequantCube use global offsets; ensure this change
applies where batch-aware prepending of batch offsets occurs (the block that
updates AMatrixOffsets/AMatrixShape using buffA.shape) and when appending to
inputACubes, inputBCubes and creating RequantCube (HyperRectangle).
| if node.op == 'RequantizedGemm' and len(node.inputs) >= 4: | ||
| mulTensor = node.inputs[3] | ||
| biasTensor = node.inputs[2] | ||
|
|
||
| if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant): | ||
| mul_values = mulTensor.values.flatten().astype(np.int32) | ||
| log2D = int(np.log2(node.attrs['div'].values)) | ||
|
|
||
| # Broadcast scalar mul to per-channel if needed | ||
| if len(mul_values) == 1: | ||
| mul_values = np.full(Ko, mul_values[0], dtype = np.int32) | ||
|
|
||
| ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D) | ||
|
|
||
| # Rescale bias from mul/log2D domain to scale/scale_n domain | ||
| # bias_merged is already *= mul from PULPGEMMRequantMergePass | ||
| # NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D) | ||
| bias_values = biasTensor.values.flatten().astype(np.int64) | ||
| ne16_bias = np.zeros(Ko, dtype = np.int64) | ||
| for ko in range(Ko): | ||
| shift_diff = int(ne16_scale_n[ko]) - log2D | ||
| if shift_diff >= 0: | ||
| ne16_bias[ko] = bias_values[ko] << shift_diff | ||
| else: | ||
| ne16_bias[ko] = bias_values[ko] >> (-shift_diff) | ||
|
|
||
| ne16_bias = ne16_bias.astype(np.int32) |
There was a problem hiding this comment.
Validate requant constants before mutating the GEMM node.
This block can crash on missing/non-scalar div, mul lengths other than 1 or Ko, or bias shorter than Ko. Since the weight may already be transposed above, a failure here can leave a partially rewritten graph.
Proposed guard
if isinstance(mulTensor, gs.Constant) and isinstance(biasTensor, gs.Constant):
mul_values = mulTensor.values.flatten().astype(np.int32)
- log2D = int(np.log2(node.attrs['div'].values))
+ div_attr = node.attrs.get('div')
+ div_values = getattr(div_attr, "values", div_attr)
+ if div_values is None or np.size(div_values) != 1:
+ return graph
+ log2D = int(np.log2(np.asarray(div_values).item()))
# Broadcast scalar mul to per-channel if needed
if len(mul_values) == 1:
mul_values = np.full(Ko, mul_values[0], dtype = np.int32)
+ elif len(mul_values) != Ko:
+ return graph
ne16_scale, ne16_scale_n = _compute_ne16_scale_shift(mul_values, log2D)
# Rescale bias from mul/log2D domain to scale/scale_n domain
# bias_merged is already *= mul from PULPGEMMRequantMergePass
# NE16 needs: bias_ne16 = bias_merged * 2^(scale_n - log2D)
bias_values = biasTensor.values.flatten().astype(np.int64)
+ if len(bias_values) == 1:
+ bias_values = np.full(Ko, bias_values[0], dtype = np.int64)
+ elif len(bias_values) < Ko:
+ return graph
+ else:
+ bias_values = bias_values[:Ko]
ne16_bias = np.zeros(Ko, dtype = np.int64)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/TopologyOptimizationPasses/Passes.py` around lines 80 -
106, Ensure RequantizedGemm handling validates all requant constants before
mutating the node: check node.op == 'RequantizedGemm' still has
node.attrs['div'] and that node.attrs['div'].values is a scalar; verify
mulTensor (node.inputs[3]) and biasTensor (node.inputs[2]) are gs.Constant and
that mul_values length is either 1 or equal to Ko and bias_values length is at
least Ko; only call _compute_ne16_scale_shift and compute ne16_scale_n/ne16_bias
after these guards and return/skip modification if any check fails to avoid
partially rewriting the graph.
| signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1 | ||
| signed = bool(signed_val) | ||
| bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8 | ||
| n_levels = 2**bit_width | ||
|
|
||
| # Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q | ||
| ratio = s_d * s_q | ||
|
|
||
| # Identity case: ratio ~= 1.0, both zero points == 0 | ||
| EPSILON = 1e-6 | ||
| if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed: | ||
| input_tensor = dequant_node.inputs[0] | ||
| output_tensor = quant_node.outputs[0] | ||
| for downstream_node in list(output_tensor.outputs): | ||
| for i, inp in enumerate(downstream_node.inputs): | ||
| if inp == output_tensor: | ||
| downstream_node.inputs[i] = input_tensor | ||
| dequant_node.inputs.clear() | ||
| dequant_node.outputs.clear() | ||
| quant_node.inputs.clear() | ||
| quant_node.outputs.clear() | ||
| graph.cleanup().toposort() |
There was a problem hiding this comment.
Guard the identity bypass against output and dtype changes.
The fast path only checks ratio/zero-points and then removes both nodes. If the Quant output is a graph output, this leaves the declared output tensor without a producer. It also skips narrowing/clamping when Dequant and Quant bit widths differ, e.g. int16 dequantized then quantized to int8 with identity scale.
🐛 Proposed guard
signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1
signed = bool(signed_val)
bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8
+ dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item())
+ dequant_signed = bool(dequant_signed_val)
+ dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item())
n_levels = 2**bit_width
@@
- if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed:
+ if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and
+ signed == dequant_signed and bit_width == dequant_bit_width):
input_tensor = dequant_node.inputs[0]
output_tensor = quant_node.outputs[0]
+ if output_tensor in graph.outputs:
+ return graph
for downstream_node in list(output_tensor.outputs):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1 | |
| signed = bool(signed_val) | |
| bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8 | |
| n_levels = 2**bit_width | |
| # Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q | |
| ratio = s_d * s_q | |
| # Identity case: ratio ~= 1.0, both zero points == 0 | |
| EPSILON = 1e-6 | |
| if abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and signed: | |
| input_tensor = dequant_node.inputs[0] | |
| output_tensor = quant_node.outputs[0] | |
| for downstream_node in list(output_tensor.outputs): | |
| for i, inp in enumerate(downstream_node.inputs): | |
| if inp == output_tensor: | |
| downstream_node.inputs[i] = input_tensor | |
| dequant_node.inputs.clear() | |
| dequant_node.outputs.clear() | |
| quant_node.inputs.clear() | |
| quant_node.outputs.clear() | |
| graph.cleanup().toposort() | |
| signed_val = int(np.array(quant_node.attrs['signed']).item()) if 'signed' in quant_node.attrs else 1 | |
| signed = bool(signed_val) | |
| bit_width = int(np.array(quant_node.attrs['bit_width']).item()) if 'bit_width' in quant_node.attrs else 8 | |
| dequant_signed_val = int(np.array(dequant_node.attrs.get('signed', signed_val)).item()) | |
| dequant_signed = bool(dequant_signed_val) | |
| dequant_bit_width = int(np.array(dequant_node.attrs.get('bit_width', bit_width)).item()) | |
| n_levels = 2**bit_width | |
| # Compute effective ratio: y_float = (x_int - zp_d) * s_d * s_q + zp_q | |
| ratio = s_d * s_q | |
| # Identity case: ratio ~= 1.0, both zero points == 0 | |
| EPSILON = 1e-6 | |
| if (abs(ratio - 1.0) < EPSILON and abs(zp_d) < EPSILON and abs(zp_q) < EPSILON and | |
| signed == dequant_signed and bit_width == dequant_bit_width): | |
| input_tensor = dequant_node.inputs[0] | |
| output_tensor = quant_node.outputs[0] | |
| if output_tensor in graph.outputs: | |
| return graph | |
| for downstream_node in list(output_tensor.outputs): | |
| for i, inp in enumerate(downstream_node.inputs): | |
| if inp == output_tensor: | |
| downstream_node.inputs[i] = input_tensor | |
| dequant_node.inputs.clear() | |
| dequant_node.outputs.clear() | |
| quant_node.inputs.clear() | |
| quant_node.outputs.clear() | |
| graph.cleanup().toposort() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/Generic/TopologyOptimizationPasses/Passes.py` around lines
1200 - 1221, The identity bypass must be guarded so it only removes
dequant_node/quant_node when the operation truly preserves type and graph
outputs: before rewiring, verify the quant_node.outputs[0] is not a graph output
and that quant_node and dequant_node have matching signed and bit_width (compare
quant_node.attrs['signed'] and ['bit_width'] to dequant_node.attrs equivalents,
and ensure n_levels match and signed flags are equal); if bit widths or signed
differ (e.g., int16->int8) or the output is a declared graph output, do not
remove the nodes and instead let normal narrowing/clamping pass run—implement
these checks around the existing identity branch (using ratio, zp_d, zp_q,
signed, bit_width, n_levels, input_tensor, output_tensor) and only perform the
input substitution, clear inputs/outputs and graph.cleanup().toposort() when all
guards pass.
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from . import * |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16Repository: pulp-platform/Deeploy
Length of output: 391
🏁 Script executed:
fd -t f . Deeploy/Targets/NE16/Templates/ --max-depth 2Repository: pulp-platform/Deeploy
Length of output: 205
🏁 Script executed:
ls -la Deeploy/Targets/NE16/Templates/Repository: pulp-platform/Deeploy
Length of output: 376
🏁 Script executed:
cat Deeploy/Targets/NE16/Templates/__init__.pyRepository: pulp-platform/Deeploy
Length of output: 189
🏁 Script executed:
cat -n Deeploy/Targets/NE16/Templates/AllocateTemplate.pyRepository: pulp-platform/Deeploy
Length of output: 924
🏁 Script executed:
head -50 Deeploy/Targets/NE16/Templates/ConvTemplate.pyRepository: pulp-platform/Deeploy
Length of output: 1540
🏁 Script executed:
rg '__all__' Deeploy/Targets/NE16/Templates/Repository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
rg 'from Deeploy\.Targets\.NE16\.Templates import' --type pyRepository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
rg 'from \.\.Templates import' Deeploy/Targets/NE16 --type pyRepository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
rg 'from.*AllocateTemplate|from.*ConvTemplate|import.*AllocateTemplate|import.*ConvTemplate' --type pyRepository: pulp-platform/Deeploy
Length of output: 3880
🏁 Script executed:
rg 'ne16GenericGlobalInitTemplate|NE16ConvTemplate' --type pyRepository: pulp-platform/Deeploy
Length of output: 531
Export the actual template modules.
This self-referential wildcard import does not load sibling files like the NE16 allocation/conv templates, so package-level imports will not see those symbols and Ruff reports F403.
Proposed direction
-from . import *
+from .AllocateTemplate import *
+from .ConvTemplate import *📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from . import * | |
| from .AllocateTemplate import * | |
| from .ConvTemplate import * |
🧰 Tools
🪛 Ruff (0.15.10)
[error] 5-5: from . import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/Templates/__init__.py` at line 5, The package
__init__.py currently uses a self-referential wildcard import ("from . import
*") which doesn't actually import sibling modules (e.g., the NE16 allocation and
conv template modules) and triggers Ruff F403; replace the wildcard with
explicit imports of the template modules (or explicit from .module import
<symbols>) and/or populate __all__ with the actual module or symbol names (for
example add explicit references to the NE16 allocation and conv modules) so
package-level imports expose the intended templates. Ensure you reference and
import the specific module names rather than relying on "from . import *".
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from . import * |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16Repository: pulp-platform/Deeploy
Length of output: 391
🏁 Script executed:
#!/bin/bash
# List files in the TileConstraints directory
find Deeploy/Targets/NE16/TileConstraints -type f -name "*.py" | head -20Repository: pulp-platform/Deeploy
Length of output: 356
🏁 Script executed:
#!/bin/bash
# Check the content of the __init__.py file and surrounding files
ls -la Deeploy/Targets/NE16/TileConstraints/Repository: pulp-platform/Deeploy
Length of output: 535
🏁 Script executed:
#!/bin/bash
# Check what's exported from the constraint modules
cat -n Deeploy/Targets/NE16/TileConstraints/*.py | head -100Repository: pulp-platform/Deeploy
Length of output: 6259
🏁 Script executed:
#!/bin/bash
# View the actual __init__.py file content
cat -n Deeploy/Targets/NE16/TileConstraints/__init__.pyRepository: pulp-platform/Deeploy
Length of output: 224
🏁 Script executed:
#!/bin/bash
# Check if constraint modules define __all__
for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do
echo "=== $file ==="
rg -n '__all__' "$file" || echo "No __all__ found"
doneRepository: pulp-platform/Deeploy
Length of output: 327
🏁 Script executed:
#!/bin/bash
# Check what's actually exported from constraint modules using grep for class definitions
for file in Deeploy/Targets/NE16/TileConstraints/NE16*.py; do
echo "=== $file ==="
rg -n '^class ' "$file"
doneRepository: pulp-platform/Deeploy
Length of output: 651
Replace the self-import with explicit constraint exports.
from . import * imports from the current package namespace, not from sibling modules. Since the constraint modules are never imported, their classes are not exposed. Import the concrete constraint modules instead.
Proposed fix
-from . import *
+from .NE16DenseConstraint import *
+from .NE16DepthwiseConstraint import *
+from .NE16PointwiseConstraint import *📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from . import * | |
| from .NE16DenseConstraint import * | |
| from .NE16DepthwiseConstraint import * | |
| from .NE16PointwiseConstraint import * |
🧰 Tools
🪛 Ruff (0.15.10)
[error] 5-5: from . import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TileConstraints/__init__.py` at line 5, Replace the
package wildcard import in __init__.py (currently "from . import *") with
explicit relative imports of the concrete constraint modules and/or classes so
sibling modules are actually loaded and their classes are exposed; e.g., use
"from .constraint_module import ConstraintClass" for each constraint module, and
populate an __all__ list with the exported class names (or import the modules
with "from . import constraint_module" if you prefer module exports) so the
concrete constraint classes are available to consumers.
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from . import * |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify package initializers that still use self-referential wildcard imports.
rg -n '^\s*from \. import \*$' Deeploy/Targets/NE16Repository: pulp-platform/Deeploy
Length of output: 391
🏁 Script executed:
#!/bin/bash
# List files in TopologyOptimizationPasses directory
ls -la Deeploy/Targets/NE16/TopologyOptimizationPasses/Repository: pulp-platform/Deeploy
Length of output: 300
🏁 Script executed:
#!/bin/bash
# Check if Passes.py exists and show its content
if [ -f "Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py" ]; then
wc -l Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
head -50 Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py
fiRepository: pulp-platform/Deeploy
Length of output: 1989
🏁 Script executed:
#!/bin/bash
# Check __init__.py for __all__ definition
cat -n Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.pyRepository: pulp-platform/Deeploy
Length of output: 224
🏁 Script executed:
#!/bin/bash
# Check for __all__ definition in Passes.py
rg -n '__all__' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyRepository: pulp-platform/Deeploy
Length of output: 47
🏁 Script executed:
#!/bin/bash
# Find class definitions in Passes.py to identify the pass classes
rg -n '^class ' Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.pyRepository: pulp-platform/Deeploy
Length of output: 327
Import Passes explicitly.
from . import * does not re-export the pass classes from Passes.py; consumers importing from Deeploy.Targets.NE16.TopologyOptimizationPasses cannot access the four pass classes (NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass, ConvEngineDiscolorationPass, NE16OptimizationPass) without an explicit import.
Proposed fix
-from . import *
+from .Passes import *📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from . import * | |
| from .Passes import * |
🧰 Tools
🪛 Ruff (0.15.10)
[error] 5-5: from . import * used; unable to detect undefined names
(F403)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/NE16/TopologyOptimizationPasses/__init__.py` at line 5, The
package __init__.py currently uses a wildcard import which doesn't re-export
classes from Passes.py; update
Deeploy.Targets.NE16.TopologyOptimizationPasses.__init__ to explicitly import
and re-export the four pass classes by adding an explicit import from .Passes of
NE16AdjustWeightMemoryLayoutPass, NE16ReshapePointwiseConvolutionPass,
ConvEngineDiscolorationPass, and NE16OptimizationPass and ensure they are
exposed (e.g., in __all__) so consumers can import them directly from the
package.
| test_name, l1, config_name = test_params | ||
|
|
||
| ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
python -m ruff check DeeployTest/test_platforms.py --select RUF059,RUF005Repository: pulp-platform/Deeploy
Length of output: 25561
Clean up the Ruff warnings in the three new GAP9_w_NE16 tiled tests.
The unpacked config_name variable is never used, and Ruff prefers iterable unpacking over list concatenation. Apply these changes at lines 1008-1010, 1043-1045, and 1076-1078:
Proposed fix
- test_name, l1, config_name = test_params
+ test_name, l1, _config_name = test_params
- ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]
+ ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| test_name, l1, config_name = test_params | |
| ne16_cmake_args = cmake_args + [f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"] | |
| test_name, l1, _config_name = test_params | |
| ne16_cmake_args = [*cmake_args, f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"] |
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 1008-1008: Unpacked variable config_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
[warning] 1010-1010: Consider iterable unpacking instead of concatenation
Replace with iterable unpacking
(RUF005)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DeeployTest/test_platforms.py` around lines 1008 - 1010, Unpack test_params
using a throwaway name for the unused config_name (e.g., test_name, l1, _ =
test_params) and build the cmake args using iterable unpacking instead of list
concatenation (e.g., ne16_cmake_args = [*cmake_args,
f"NUM_CORES={GAP9_NE16_TILED_DEFAULT_CORES}"]); apply the same two edits for the
other two occurrences that create ne16_cmake_args in the GAP9_w_NE16 tiled tests
so Ruff warnings for unused variables and list concatenation are resolved.
| #include "CNN_BasicKernels_fp32.h" | ||
| #include "pmsis.h" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether the missing header exists and where ne16_utils.h is included.
fd -a 'CNN_BasicKernels_fp32\.h$' .
rg -n '#include\s+"CNN_BasicKernels_fp32\.h"|#include\s+"ne16_utils\.h"'Repository: pulp-platform/Deeploy
Length of output: 204
🏁 Script executed:
#!/bin/bash
# Read the header file to understand what's actually needed
echo "=== ne16_utils.h content ==="
cat -n TargetLibraries/GAP9/inc/ne16_utils.h
echo ""
echo "=== ne16_utils.c content ==="
cat -n TargetLibraries/GAP9/src/ne16_utils.cRepository: pulp-platform/Deeploy
Length of output: 1945
Move implementation dependencies out of the public header.
ne16_utils.h only needs fixed-width integer types for its public struct, but it includes CNN_BasicKernels_fp32.h (which does not exist in the repository) and pmsis.h (which is only used in the implementation). These unavailable/unnecessary headers in a public header break any code that includes it.
Replace the includes in the header with <stdint.h>, and add #include "pmsis.h" to ne16_utils.c where it is actually needed (for gap_coreid(), gap_ncore(), and v4s vector types).
Proposed fix
In TargetLibraries/GAP9/inc/ne16_utils.h:
-#include "CNN_BasicKernels_fp32.h"
-#include "pmsis.h"
+#include <stdint.h>In TargetLibraries/GAP9/src/ne16_utils.c:
`#include` "ne16_utils.h"
+#include "pmsis.h"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #include "CNN_BasicKernels_fp32.h" | |
| #include "pmsis.h" | |
| `#include` <stdint.h> |
🧰 Tools
🪛 Clang (14.0.6)
[error] 11-11: 'CNN_BasicKernels_fp32.h' file not found
(clang-diagnostic-error)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@TargetLibraries/GAP9/inc/ne16_utils.h` around lines 11 - 12, The public
header ne16_utils.h currently pulls in implementation-only headers; replace the
includes of "CNN_BasicKernels_fp32.h" and "pmsis.h" with the standard <stdint.h>
so the public struct uses only fixed-width integer types, and remove the
unavailable dependency. Then add `#include` "pmsis.h" into ne16_utils.c (where
gap_coreid(), gap_ncore(), and the v4s vector type are used) so implementation
functions in ne16_utils.c can access those symbols.
The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.
KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but
the template was using (v4s){0,0,0,0} (signed), causing a compilation
error on GCC with -Werror.
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py (1)
41-41: Match theNodeTemplate.alignToContextreturn contract.The override returns
operatorRepresentation, so the annotation should useOperatorRepresentationinstead ofDict.♻️ Proposed annotation cleanup
-from typing import Dict, List, Tuple +from typing import List, Tuple @@ def alignToContext(self, ctxt: NetworkContext, - operatorRepresentation: OperatorRepresentation) -> Tuple[NetworkContext, Dict, List[str]]: + operatorRepresentation: OperatorRepresentation + ) -> Tuple[NetworkContext, OperatorRepresentation, List[str]]:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 41, The return type annotation for NodeTemplate.alignToContext is incorrect: the method currently declares Tuple[NetworkContext, Dict, List[str]] but the override actually returns operatorRepresentation, so update the second generic from Dict to OperatorRepresentation; modify the signature of alignToContext to return Tuple[NetworkContext, OperatorRepresentation, List[str]] and ensure any related type imports/usages reference OperatorRepresentation (and update any docstrings or comments mentioning the old Dict type).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 60: The unpacking assigns Ko, Ki = w_int8.shape but Ki is never used,
causing a Ruff unused-variable warning; change the unpack to use a dummy name
(e.g., Ko, _ = w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear
and the linter is satisfied, leaving any other code that references Ko
unchanged.
- Around line 93-102: The in-place int8->uint8 offsetting mutates the input
buffer `${A}` via ne16_int8_to_uint8 and corrupts aliased or reused inputs;
instead allocate a hoisted transient uint8 buffer sized `${batch} * ${M} *
${N}`, populate a ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out =
(uint8_t *)transient_buf, call pi_cl_team_fork(NUM_CORES, (void
*)ne16_int8_to_uint8, &that_arg), and use the transient buffer for downstream
NE16 calls; alternatively, if you must keep the in-place path, add a clear
comment and enforce single-consumer/non-aliased binding for `${A}` so callers
know mutation will occur.
---
Nitpick comments:
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py`:
- Line 41: The return type annotation for NodeTemplate.alignToContext is
incorrect: the method currently declares Tuple[NetworkContext, Dict, List[str]]
but the override actually returns operatorRepresentation, so update the second
generic from Dict to OperatorRepresentation; modify the signature of
alignToContext to return Tuple[NetworkContext, OperatorRepresentation,
List[str]] and ensure any related type imports/usages reference
OperatorRepresentation (and update any docstrings or comments mentioning the old
Dict type).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ba38bca8-0085-4560-bbc7-d05070820224
📒 Files selected for processing (2)
Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.pyTargetLibraries/GAP9/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
- TargetLibraries/GAP9/CMakeLists.txt
|
|
||
| # Weight packing and signed bias compensation | ||
| w_int8 = B.values.astype(np.int8) # [Ko, Ki], still int8 at this point | ||
| Ko, Ki = w_int8.shape |
There was a problem hiding this comment.
Silence the Ruff unused-variable warning.
Ki is unpacked but never used; use a dummy name.
🧹 Proposed lint fix
- Ko, Ki = w_int8.shape
+ Ko, _ = w_int8.shape📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Ko, Ki = w_int8.shape | |
| Ko, _ = w_int8.shape |
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 60-60: Unpacked variable Ki is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` at line 60, The unpacking
assigns Ko, Ki = w_int8.shape but Ki is never used, causing a Ruff
unused-variable warning; change the unpack to use a dummy name (e.g., Ko, _ =
w_int8.shape or Ko, _Ki = w_int8.shape) so the intent is clear and the linter is
satisfied, leaving any other code that references Ko unchanged.
| % if input_signed: | ||
| // Signed input: add 128 offset to convert int8 -> uint8 (multi-core SIMD) | ||
| { | ||
| ne16_int8_to_uint8_T _offset_arg = { | ||
| .In = (int8_t *)${A}, | ||
| .Out = (uint8_t *)${A}, | ||
| .size = ${batch} * ${M} * ${N} | ||
| }; | ||
| pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8, &_offset_arg); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | head -200Repository: pulp-platform/Deeploy
Length of output: 10496
🏁 Script executed:
cat -n Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py | tail -50Repository: pulp-platform/Deeploy
Length of output: 2599
🏁 Script executed:
# Search for references to NE16GEMMTemplate or similar buffer management patterns
rg "NE16GEMM|buffer.*signed|input_signed" --max-count=20Repository: pulp-platform/Deeploy
Length of output: 7878
🏁 Script executed:
cat -n Deeploy/Targets/GAP9/TileConstraints/NE16GEMMTileConstraint.pyRepository: pulp-platform/Deeploy
Length of output: 10247
🏁 Script executed:
# Check PULPOpen ConvTemplate for signed input handling
rg -A 15 "if input_signed:" Deeploy/Targets/PULPOpen/Templates/ConvTemplate.py | head -100Repository: pulp-platform/Deeploy
Length of output: 2025
🏁 Script executed:
# Search for ne16_int8_to_uint8 references to understand the conversion pattern
rg "ne16_int8_to_uint8" --max-count=20Repository: pulp-platform/Deeploy
Length of output: 741
🏁 Script executed:
cat -n TargetLibraries/GAP9/src/ne16_utils.cRepository: pulp-platform/Deeploy
Length of output: 1257
Avoid mutating ${A} during signed-input offsetting.
The in-place conversion at lines 93–102 and 168–177 corrupts ${A} permanently for any downstream consumer, skip connection, graph input reuse, or aliased buffer. The ne16_int8_to_uint8 utility mutates the buffer in-place, and no tile constraint or binding enforces single-consumer or non-aliased input guarantees. Use a hoisted transient uint8 input buffer for the offset path, or explicitly document and enforce that this binding only accepts single-consumer, non-aliased activations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Deeploy/Targets/GAP9/Templates/NE16GEMMTemplate.py` around lines 93 - 102,
The in-place int8->uint8 offsetting mutates the input buffer `${A}` via
ne16_int8_to_uint8 and corrupts aliased or reused inputs; instead allocate a
hoisted transient uint8 buffer sized `${batch} * ${M} * ${N}`, populate a
ne16_int8_to_uint8_T that sets .In = (int8_t *)${A} and .Out = (uint8_t
*)transient_buf, call pi_cl_team_fork(NUM_CORES, (void *)ne16_int8_to_uint8,
&that_arg), and use the transient buffer for downstream NE16 calls;
alternatively, if you must keep the in-place path, add a clear comment and
enforce single-consumer/non-aliased binding for `${A}` so callers know mutation
will occur.
Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.
profileTiling generates code calling getCycles() in Network.c but the header declaring it was not included. Add CycleCounter.h to both GAP9 and NE16 platform include lists, and expose the GAP9 inc/ directory to the network target so the header is found at compile time.
GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.
The --enable-3x3 flag was parsed by the runner script but never passed to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3 and Dense 3x3 convolutions silently fell back to the PULP cluster instead of dispatching to NE16. Add the flag and set it on the engine.
The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.
Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8) is too small — NE16 dispatch overhead dominates at only 12.8% utilization. Also wire --enable-3x3 through deeployRunner gen_args.
Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform
GAP9_w_NE16that mirrors theSiracusa_w_neurekapattern.Added
Deeploy/Targets/NE16/— full Target:Platform.py(NE16Platform extends GAP9Platform, engines=[NE16Engine, GAP9ClusterEngine]),Engine.py,Bindings.py,Parsers.py,Tiler.py,Deployer.py(extends GAP9Deployer to reuse ClDma transformers),Templates/{Allocate,Conv}Template.py,TileConstraints/{NE16Pointwise,NE16Depthwise,NE16Dense}Constraint.py,TopologyOptimizationPasses/Passes.py(incl._weightEncodeported frompulp-nnx/test/Ne16Weight.py).DeeployTest/deeployRunner_tiled_gap9_w_ne16.pyandDeeployTest/test_gap9_ne16_tiled_config.py— new runner + minimal test config covering PW 1x1, DW 3x3, Dense 3x3 RQ Conv.TargetLibraries/GAP9/CMakeLists.txt— forGAP9_w_NE16platform,add_subdirectoryonpulp-nnxwithUSE_NE16=ONand link it intodeeploygap9.Changed
DeeployTest/testUtils/platformMapping.py— registerGAP9_w_NE16in the platforms list,mapPlatform,setupMemoryPlatform, andmapDeployer.DeeployTest/testMVP.py— includeGAP9_w_NE16in theEngineColoringDeployerWrapperbranch (without it,NE16AdjustWeightMemoryLayoutPassnever fires and parsing backtracks).DeeployTest/testUtils/core/execution.py— build the GAP9 SDKimagetarget forGAP9_w_NE16too (sochip.soc.mram.binis produced beforegvsoc run).CMakeLists.txt,DeeployTest/CMakeLists.txt— acceptGAP9_w_NE16alongsideGAP9in the platform branches.Deeploy/Targets/NE16/Templates/ConvTemplate.py— swap Neureka-inherited subtile constants for NE16 spec perne16_task_defs.h:CIN_SUBTILE32/28 → 16 (single mode), output6→3, per-(cout,cinMajor) weight bytes PW 16, DW/Dense 144.Fixed
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py— work around a pre-existingImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes'by defining it locally viaPointerClass(float32_t).Test plan
Ran on gvsoc
gap9.evk(insideghcr.io/pulp-platform/deeploy-gap9:devel), full pipelinegen -> parse -> lower -> codegen -> CMake -> build -> gapy image -> gvsoc flash run:Kernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/DW_2D_RQ(--enable-3x3)Kernels/Integer/Conv/Regular_2D_RQ(--enable-3x3)Follow-up (out of scope for this PR): PW
Unsigned_RQvariant currently produces non-zero errors — likely unsigned weight-offset orout_typeconf0 handling; tracked separately.PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.