Add FP32 operators and tiling support for MicroLlama on Snitch #153

lee2716 · 2026-01-31T11:15:25Z

Add FP32 operator support to enable MicroLlama model deployment on Snitch
cluster, with both untiled and tiled execution modes.

Added

FP32 templates: FloatAddTemplate, FloatDivTemplate, FloatMulTemplate,
FloatHardSwishTemplate, FloatRMSNormTemplate, FloatMatMulTemplate
FP32 parsers: SnitchDivParser, SnitchMulParser, SnitchRMSNormParser
FP32 bindings: BasicDivBindings, BasicMulBindings, BasicHardSwishBindings,
BasicRMSNormBindings (untiled)
FP32 bindings: SnitchDivBindings, SnitchMulBindings,
SnitchHardSwishBindings, SnitchRMSNormBindings (tiled)
C kernels: Add_fp32.c, Div_fp32.c, Mul_fp32.c, HardSwish.c, RMSNrom_fp32.c
with multi-core parallelization
TileConstraints: FloatDivTileConstraint, FloatMulTileConstraint,
ReshapeTileConstraint
SnitchTiledPlatform and SnitchTiledMapping for tiled execution
testRunner_tiled_snitch.py for tiled model testing
FP32 kernel tests with ONNX models, inputs, and expected outputs
CI workflows for both Snitch and Snitch_tiled platforms

Changed

Platform.py: Added separate SnitchPlatform (untiled) and SnitchTiledPlatform
(tiled) with distinct mappings
platformMapping.py: Added Snitch_tiled platform support
Tiler.py: Registered new FP32 tile constraints
SnitchClusterTiling.py: Added tiled code transformation passes

Fixed

N/A

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

Add support for FP32 operators required by MicroLlama model: - RMSNorm: Fused RMS normalization - HardSwish: Activation function - Div: Element-wise division - Mul: Element-wise multiplication - MatMul: Matrix multiplication - Add: Element-wise addition (FP32 support) - Reshape, Transpose, Concat, Gather: Shape operations Components added: - Generic: Parsers, TypeCheckers, Layers, Bindings - Snitch Templates: FloatAdd, FloatDiv, FloatHardSwish, FloatMul, FloatRMSNorm, FloatMatMul, Reshape, Transpose, Gather - Snitch Kernels: C implementations for all FP32 operators - Test data: Hardswish, RMSNorm_fused kernels, microLlama_fp32_1 model This enables running MicroLlama FP32 model on Snitch in untiled mode: python testRunner_snitch.py -t Tests/Models/microLlama/microLlama_fp32_1

Add SnitchTiledPlatform with TileConstraints for FP32 operators: - FloatDivTileConstraint: Division tiling with scalar broadcast - FloatMulTileConstraint: Multiplication tiling with scalar broadcast - ReshapeTileConstraint: Pass-through tiling for reshape Updates: - SnitchClusterTiling with tiled code transformation passes - Tiler.py with new tile constraints registration - platformMapping.py adds Snitch_tiled platform - testRunner_tiled_snitch.py for tiled model testing - CI workflows for both untiled and tiled Snitch

coderabbitai · 2026-01-31T11:24:00Z

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added Snitch_tiled platform variant for tiled execution mode.
- Introduced support for RMSNorm, HardSwish, Div, Mul, MatMul, Concat, Transpose, Reshape, and Gather operations.
- Added broadcasting support for arithmetic operations.
- Enhanced performance profiling with cycle contribution reporting.
Tests
- Expanded FP32 kernel test coverage with new test configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Walkthrough

Extends Snitch architecture support with a new "Snitch_tiled" platform variant, adding FP32 kernels for RMSNorm, HardSwish, Div, Mul, MatMul, Concat, Transpose, Reshape, and Gather operations with tiling-ready bindings and implementations for both tiled and non-tiled modes.

Changes

Cohort / File(s)	Summary
CI Workflow Test Configurations `.github/workflows/ci-platform-snitch.yml`, `.github/workflows/ci-platform-snitch-tiled.yml`	Added six new FP32 kernel test configurations for RMSNorm_fused, MatMul, Add/Regular, Hardswish, and Div to test suite coverage.
CMake Platform Configuration `CMakeLists.txt`	Extended platform enumeration to include Snitch_tiled variant; updated conditional branches and toolchain selection to recognize and handle the new platform option.
Generic Deeploy Layers `Deeploy/Targets/Generic/Layers.py`	Added RMSNormLayer and HardSwishLayer classes with operation count computations for normalization and activation kernels.
Generic Deeploy Type System `Deeploy/Targets/Generic/TypeCheckers.py`	Introduced FloatAddChecker, RMSNormChecker, and HardSwishChecker for sign and numeric level propagation in floating-point operations.
Generic Deeploy Operator Parsing & Bindings `Deeploy/Targets/Generic/Parsers.py`, `Deeploy/Targets/Generic/Bindings.py`	Extended AddParser with broadcasting support for multi-dimensional tensor shapes and strides; added float32_t concatenation binding.
Snitch Parsers & Platform Integration `Deeploy/Targets/Snitch/Parsers.py`, `Deeploy/Targets/Snitch/Platform.py`	Added specialized Snitch parsers for RMSNorm, HardSwish, Div, Mul with broadcasting awareness; introduced SnitchTiled platform and mapping infrastructure with support for both tiled and untiled operator modes.
Snitch Bindings & Template Support `Deeploy/Targets/Snitch/Bindings.py`, `Deeploy/Targets/Snitch/Templates/Float*Template.py`	Extended bindings for new operators (RMSNorm, HardSwish, Div, Mul, MatMul, Concat, Transpose, Reshape, Gather) with tiled and basic variants; introduced corresponding FP32 templates with broadcast-aware and scalar-aware variants.
Snitch Tiling Infrastructure `Deeploy/Targets/Snitch/Tiler.py`, `Deeploy/Targets/Snitch/TileConstraints/*`	Added tiling-ready bindings for new operators; introduced tile constraint classes (FloatDivTileConstraint, FloatMulTileConstraint, ReshapeTileConstraint) for geometric constraint and tiling solution serialization.
Snitch Profiling & Code Generation `Deeploy/Targets/Snitch/CodeTransformationPasses/SnitchClusterTiling.py`	Updated cycle profiling templates to include fixed-width cycle difference printing and consolidated cycle contribution reporting (total cycles, DMA portion, and percentage breakdowns).
Kernel Headers `TargetLibraries/Snitch/inc/kernel/*.h`, `TargetLibraries/Snitch/inc/macros.h`	Added kernel declarations for Add, Div, HardSwish, Mul, and RMSNorm FP32 operations; wrapped MAX/MIN/CLAMP macro definitions with include guards to prevent redefinition; renamed softmax_fp32 to Softmax_fp32 for naming consistency.
Kernel Implementations `TargetLibraries/Snitch/src/*.c`	Implemented FP32 kernels (Add, Div, HardSwish, Mul, RMSNorm) with multi-core parallelization and broadcast support; refactored Gemm_fp32 to remove SSR logic and use direct loops; added optional instruction counter gating in CycleCounter.
Generic Macro Guards `TargetLibraries/Generic/inc/macros.h`	Added include guards around MAX, MIN, and CLAMP macro definitions to prevent redefinition errors.
Test Infrastructure `DeeployTest/testRunner_tiled_snitch.py`, `DeeployTest/testUtils/platformMapping.py`, `DeeployTest/testUtils/typeMapping.py`	Updated platform mapping to recognize SnitchTiledPlatform; modified type inference to prioritize floating-point dtype detection; added HardSwish test case.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Refactor tiling code generation #105: Overlapping changes to Snitch tiling bindings, templates, and tile constraints with direct code-level integration.
Support Fully Asynchronous DMAs #114: Related changes to SnitchClusterTiling profiling templates and tiling pipeline infrastructure.

Suggested labels

Feature

Suggested reviewers

Victor-Jung
Xeratec
diaconuccalin

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.82% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main addition: FP32 operators and tiling support for MicroLlama on Snitch, matching the core objective of the changeset.
Description check	✅ Passed	The description is directly related to the changeset, detailing specific FP32 operators, templates, parsers, bindings, kernels, tile constraints, and platform changes added to support MicroLlama on Snitch.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

Deeploy/Targets/Snitch/Platform.py (1)

197-277: ⚠️ Potential issue | 🟠 Major

Replace mutable default arguments in SnitchConstantBuffer, SnitchPlatform, and SnitchTiledPlatform.

Using list defaults and constructing engines in function signatures risks shared state across instances. Additionally, includeList is accepted but unused in both platform classes; wiring it into the default engine construction fixes that.

Proposed fix

-    def __init__(self, name: str = '', shape = [1], values = [0]):
-        super().__init__(name, shape, values)
+    def __init__(self, name: str = '', shape = None, values = None):
+        if shape is None:
+            shape = [1]
+        if values is None:
+            values = [0]
+        super().__init__(name, shape, values)
@@
 class SnitchPlatform(DeploymentPlatform):
 
     def __init__(self,
-                 engines = [SnitchClusterEngine("SnitchCluster")],
+                 engines = None,
                  variableBuffer = SnitchVariableBuffer,
                  constantBuffer = SnitchConstantBuffer,
                  structBuffer = SnitchStructBuffer,
                  transientBuffer = SnitchTransientBuffer,
                  includeList: List[str] = _includeList):
+        if engines is None:
+            engines = [SnitchClusterEngine("SnitchCluster", includeList = includeList)]
         super().__init__(engines, variableBuffer, constantBuffer, structBuffer, transientBuffer)
@@
 class SnitchTiledPlatform(DeploymentPlatform):
 
     def __init__(self,
-                 engines = [SnitchTiledClusterEngine("SnitchCluster")],
+                 engines = None,
                  variableBuffer = SnitchVariableBuffer,
                  constantBuffer = SnitchConstantBuffer,
                  structBuffer = SnitchStructBuffer,
                  transientBuffer = SnitchTransientBuffer,
                  includeList: List[str] = _includeList):
+        if engines is None:
+            engines = [SnitchTiledClusterEngine("SnitchCluster", includeList = includeList)]
         super().__init__(engines, variableBuffer, constantBuffer, structBuffer, transientBuffer)

🤖 Fix all issues with AI agents

In `@Deeploy/Targets/Generic/Parsers.py`:
- Around line 494-525: AddParser.parseNodeCtxt computes broadcasting strides
incorrectly when input ranks differ because it checks i < len(shape1/shape2)
instead of left-padding shapes with 1s per ONNX rules; update the code to
left-pad shape1 and shape2 to length ndim (length of out_shape) with leading 1s
and then compute strides1 and strides2 exactly as in
SnitchAddParser._compute_broadcast_strides (treat dimensions equal to out_shape
as non-broadcast and set stride 0 for broadcast dims, compute cumulative strides
from the right otherwise), and apply the same padding+stride logic to both
strides1 and strides2 so examples like shape1=[4], out_shape=[2,3,4] produce
strides1=[0,0,1].

In `@Deeploy/Targets/Snitch/Parsers.py`:
- Around line 239-285: In parseNodeCtxt of class SnitchDivParser, after
computing shape1, shape2, out_shape and their sizes
(operatorRepresentation['size1'], ['size2'], ['size']), add a guard that rejects
non-scalar broadcasting by returning (ctxt, False) whenever input2 is not a
scalar and its total size does not equal the output size (i.e.,
operatorRepresentation['size2'] != 1 and operatorRepresentation['size2'] !=
operatorRepresentation['size']). This ensures parseNodeCtxt fails for non-scalar
broadcasts (preventing the elementwise path from reading past input2) until a
general broadcast kernel is implemented; use the existing operatorRepresentation
keys and node.inputs/node.outputs to locate where to place the check.
- Around line 288-334: In SnitchMulParser.parseNodeCtxt, detect non-scalar
broadcasting and reject the node: after computing shape1, shape2, and out_shape
and setting size1/size2/size, add a guard that if shape2 (or shape1) does not
exactly match out_shape and operatorRepresentation['size2'] != 1 (i.e., input2
is not a scalar) then treat this as unsupported broadcasting — set an
error/unsupported marker and return ctxt, False so the generic elementwise
kernel is not used; keep the existing operatorRepresentation fields (shape1,
shape2, out_shape, size*, is_scalar) and reference parseNodeCtxt,
operatorRepresentation, size2, and is_scalar when implementing the check.

In `@Deeploy/Targets/Snitch/Platform.py`:
- Around line 25-55: The file imports RQAddMapper from
Deeploy.Targets.PULPOpen.Platform but then defines a local RQAddMapper =
NodeMapper(...) which shadows the import and triggers F811; fix it by removing
RQAddMapper from the import list on the first line (or alternatively rename the
local NodeMapper variable, e.g., LocalRQAddMapper) so there is no name collision
between the imported symbol and the local definition; ensure any other code
referencing the external RQAddMapper is updated if you choose to rename the
local variable.

In `@Deeploy/Targets/Snitch/Templates/MatMulTemplate.py`:
- Around line 42-56: The loop always advances input pointers
ref_${data_out}_${A} and ref_${data_out}_${B} for each batch, which breaks
broadcasting; update the batched MatMul loop to only increment those pointers
when the corresponding A_batched or B_batched flag is true (leave
ref_${data_out}_${data_out} always incremented), i.e. after calling
MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(...)
wrap the pointer advances in conditional checks using the parser-provided
A_batched and B_batched flags so broadcasted inputs (batch dim == 1) are not
advanced.

In `@Deeploy/Targets/Snitch/Templates/ReshapeTemplate.py`:
- Around line 18-44: The code incorrectly assumes
operatorRepresentation['shape'] and operatorRepresentation['indices'] live in
ctxt.globalObjects; replace these unsafe direct lookups by checking
ctxt.is_global(name) before accessing ctxt.globalObjects and avoid the
unnecessary 'indices' handling (remove the 'indices' block since Reshape never
sets it). Concretely, in the Reshape template handler (look for
operatorRepresentation, ctxt.lookup, bufferIn/bufferOut logic), guard any access
to ctxt.globalObjects[operatorRepresentation["shape"]] with
ctxt.is_global(operatorRepresentation["shape"]) and only set _deploy/_live when
that returns True; drop the indices branch entirely and rely on ctxt.lookup for
non-global buffers.

In `@Deeploy/Targets/Snitch/TileConstraints/FloatDivTileConstraint.py`:
- Around line 59-68: Add an explicit upfront shape validation before calling
tilerModel.getTensorDimVar: if the inputs are non-scalar (len(input1Shape) > 0)
compare input1Shape and input2Shape and raise a clear exception (or return an
error) when they differ, referencing the variables inputBuffer1Name,
inputBuffer2Name, input1Shape and input2Shape; keep the existing call to
tilerModel.addTensorDimToModel and the per-dimension constraints
(getTensorDimVar / addConstraint) only after this explicit equality check so
getTensorDimVar is never invoked on mismatched shapes.

In `@Deeploy/Targets/Snitch/TileConstraints/FloatMulTileConstraint.py`:
- Around line 59-68: Before adding per-dimension constraints, validate that
input2’s shape is compatible with input1: if len(input2Shape) !=
len(input1Shape) and input2 is not scalar (i.e., len(input2Shape) != 0), raise a
clear error (e.g., ValueError) explaining the mismatched shapes; this prevents
obscure failures from tilerModel.getTensorDimVar when you later call
addTensorDimToModel and iterate over input1Shape. Reference the symbols
input1Shape, input2Shape, inputBuffer2Name, tilerModel.addTensorDimToModel, and
tilerModel.getTensorDimVar to locate where to insert the check.

In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py`:
- Around line 23-47: In addGeometricalConstraint remove the two unused local
variables by deleting the assignments "inputBuffer =
ctxt.lookup(inputBufferName)" and "outputBuffer = ctxt.lookup(outputBufferName)"
(or alternatively use them if needed); the current ctxt.lookup calls create
unused values causing Ruff F841—ensure any required buffer lookups are either
consumed (e.g., used in subsequent logic) or removed from the function body.
- Around line 83-134: The mapping from output tiles to input HyperRectangle(s)
in the loop over outputCubes (using outputShape, inputShape, outOffset, outSize,
inStrides, inCubeOffset, inCubeDims and creating HyperRectangle) is incorrect
for tiles that cross input dimension boundaries; fix by detecting when the
linear range [outOffset, outOffset+outSize) spans multiple input strides and
emit one or more input HyperRectangle pieces that together cover exactly outSize
elements instead of assuming a single axis-aligned block. Concretely, replace
the simplistic inCubeDims/inCubeOffset construction with logic that: 1) computes
the remainingCount = outSize and a cursor = outOffset, 2) in a loop maps cursor
to multi-dimensional coords using inStrides to get the start coord, determines
the maximal contiguous run length along the last (fastest) dimension without
crossing that dimension limit (using inputShape and start coord), 3) create a
HyperRectangle for that run, subtract its size from remainingCount, advance
cursor by run length, and repeat until remainingCount == 0; ensure the created
HyperRectangle tuples (offset, dims) sum exactly to outSize and add all pieces
instead of a single inputCube.

In `@TargetLibraries/Snitch/inc/kernel/HardSwish.h`:
- Around line 1-5: Update the SPDX header in HardSwish.h to correct the
copyright year from 2026 to the proper year (2024 or 2025); locate the
top-of-file comment block in TargetLibraries/Snitch/inc/kernel/HardSwish.h and
change the year in the SPDX-FileCopyrightText line so it matches the source
file.

In `@TargetLibraries/Snitch/src/Div_fp32.c`:
- Around line 10-49: The function Div_fp32 currently performs element-wise
division using input2[i] and does not handle scalar broadcasting; update the
top-of-file comment for Div_fp32 to remove the "If input2 is scalar (size=1):
divides all elements..." claim and clearly state that Div_fp32 assumes input1
and input2 have the same size (element-wise) and that scalar denominators must
call Div_fp32_scalar (or another scalar-specific routine); additionally, add a
short runtime assert/check in Div_fp32 (referencing the variables size, input2,
and function name Div_fp32) or document that input2_size must equal size to
prevent out-of-bounds access if scalar inputs are routed incorrectly.

In `@TargetLibraries/Snitch/src/Mul_fp32.c`:
- Around line 1-5: The SPDX header in the file Mul_fp32.c has an incorrect
future year (2026); update the copyright year(s) in the top comment block (the
SPDX header lines and the SPDX-FileCopyrightText entry) to the correct current
year or an appropriate range (e.g., 2024 or "2023-2024") so the
SPDX-FileCopyrightText and SPDX-License-Identifier comment reflects accurate
dates.

🧹 Nitpick comments (7)

Deeploy/Targets/Snitch/Templates/FloatSoftmaxTemplate.py (1)

33-36: Pre-existing issue: Implementation marked as broken with memory leak.

The comment on line 33 explicitly states this implementation is broken and has a memory leak. Additionally, multi-core parallelization is disabled (compute_num hardcoded to 1).

Would you like help addressing this implementation issue, or should a tracking issue be opened to ensure it's resolved before production use?
TargetLibraries/Snitch/src/CycleCounter.c (1)
9-11: Confusing comment formatting for the disabled macro.

The #define ENABLE_INSTR_COUNTER is embedded at the end of the comment on line 10, which makes it unclear that this is a commented-out preprocessor directive meant to be enabled. Consider placing it on a separate line for clarity.
✏️ Suggested formatting improvement
-// Define ENABLE_INSTR_COUNTER to enable instruction counting (causes warnings
-// in gvsoc) `#define` ENABLE_INSTR_COUNTER
+// Define ENABLE_INSTR_COUNTER to enable instruction counting
+// (causes warnings in gvsoc)
+// `#define` ENABLE_INSTR_COUNTER
TargetLibraries/Snitch/src/Gemm_fp32.c (1)
36-60: Non-transposed GEMM is mathematically correct; note significant performance trade-off.

The indexing B[k * ldB + n] correctly accesses B[k,n] for the non-transposed case. The shift from SSR-based vectorized loops to simple triple-nested scalar loops is a significant performance regression on Snitch hardware. If this is intentional for correctness validation or as a fallback path, consider adding a brief comment clarifying the purpose.

Minor nit: Line 41 has an extra blank line after (void)setup_SSR; that isn't present in gemm_fp32_transB_opt (line 14).

,
Optional: remove extra blank line for consistency
 void gemm_fp32_opt(uint32_t M, uint32_t N, uint32_t K, float32_t *A,
                    uint32_t ldA, float32_t *B, uint32_t ldB, float32_t *C,
                    uint32_t ldC, float32_t *Y, uint32_t BETA,
                    uint32_t setup_SSR) {
   (void)setup_SSR;
-
   uint32_t compute_id = snrt_global_compute_core_idx();
Deeploy/Targets/Snitch/Templates/GatherTemplate.py (1)
8-17: Use integer division // instead of / for width calculation.

On line 11, int(data_in_type.referencedType.typeWidth/8) performs floating-point division then truncates. While this works correctly for typical type widths (8, 16, 32, 64), using // is more idiomatic and explicit for integer division in Python.
♻️ Suggested fix
 referenceTemplate = NodeTemplate("""
 // Gather (Name: ${nodeName}, Op: ${nodeOp})
 <%
-width = int(data_in_type.referencedType.typeWidth/8)
+width = data_in_type.referencedType.typeWidth // 8
 %>
 if (snrt_cluster_core_idx() == 0) {
Deeploy/Targets/Generic/TypeCheckers.py (1)

649-663: Consider renaming for clarity to distinguish quantized vs. floating-point variants.

Both HardswishChecker (line 419) and HardSwishChecker (line 649) exist with intentionally different implementations:

HardswishChecker (quantized): 2^(4 * typeWidth), used with int8_t input in PULPOpen bindings

HardSwishChecker (floating-point): 2^(typeWidth), used with float32_t input in Snitch bindings

The subtle casing difference makes these easy to confuse. To improve clarity, consider renaming one explicitly: either QuantizedHardswishChecker for the quantized variant or FloatHardSwishChecker for the floating-point variant.
TargetLibraries/Snitch/src/Mul_fp32.c (1)
9-26: Misleading documentation about broadcasting support.

The comment states "If input2 is scalar (size=1): multiplies all elements of input1 by input2[0]", but Mul_fp32 performs strict element-wise multiplication without any broadcasting logic. The scalar case is handled separately by Mul_fp32_scalar. Consider updating the comment to reflect the actual behavior.
Suggested documentation fix
 /*
  * Element-wise Multiplication (FP32)
  *
  * Computes: output[i] = input1[i] * input2[i]
  *
- * Supports ONNX broadcasting rules:
- * - If input2 is scalar (size=1): multiplies all elements of input1 by
- * input2[0]
- * - If both have same size: element-wise multiplication
+ * Performs element-wise multiplication. Both inputs must have the same size.
+ * For scalar multiplication, use Mul_fp32_scalar instead.
  *
Deeploy/Targets/Snitch/Templates/FloatRMSNormTemplate.py (1)
27-29: Consider using type-templated kernel name for consistency.

The kernel name RMSNorm_fp32 is hardcoded, while FloatMatMulTemplate uses type-templated names like MatMul_fp${A_type.referencedType.typeWidth}_.... If this template is intended exclusively for FP32, this is acceptable. Otherwise, consider templating the type width for future extensibility.

Based on learnings, Deeploy templates should use explicit bitwidth types for type consistency with templated kernel calls.
Optional: Type-templated kernel name
 FloatRMSNormTemplateStr = r"""
-RMSNorm_fp32(${data_in}, ${weight}, ${data_out}, ${size}, ${lastDimLength}, ${eps});
+RMSNorm_fp${data_in_type.referencedType.typeWidth}(${data_in}, ${weight}, ${data_out}, ${size}, ${lastDimLength}, ${eps});
 """

coderabbitai · 2026-01-31T11:24:03Z

Deeploy/Targets/Generic/Parsers.py

+        if need_broadcast:
+            # Calculate strides for broadcasting
+            ndim = len(out_shape)
+
+            # Compute strides for input 1
+            strides1 = [1] * ndim
+            for i in range(ndim - 1, -1, -1):
+                if i < len(shape1) and shape1[i] == out_shape[i]:
+                    if i == ndim - 1:
+                        strides1[i] = 1
+                    else:
+                        strides1[i] = strides1[i + 1] * shape1[i + 1] if (
+                            i + 1 < len(shape1) and shape1[i + 1] == out_shape[i + 1]) else strides1[i + 1]
+                else:
+                    strides1[i] = 0  # Broadcast dimension
+
+            # Compute strides for input 2
+            strides2 = [1] * ndim
+            for i in range(ndim - 1, -1, -1):
+                if i < len(shape2) and shape2[i] == out_shape[i]:
+                    if i == ndim - 1:
+                        strides2[i] = 1
+                    else:
+                        strides2[i] = strides2[i + 1] * shape2[i + 1] if (
+                            i + 1 < len(shape2) and shape2[i + 1] == out_shape[i + 1]) else strides2[i + 1]
+                else:
+                    strides2[i] = 0  # Broadcast dimension
+
+            self.operatorRepresentation['ndim'] = ndim
+            self.operatorRepresentation['strides1'] = strides1
+            self.operatorRepresentation['strides2'] = strides2
+            self.operatorRepresentation['out_shape'] = out_shape


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's read the actual code in the file wc -l Deeploy/Targets/Generic/Parsers.py

Repository: pulp-platform/Deeploy

Length of output: 104

🏁 Script executed:

#!/bin/bash # Read the relevant section of the code sed -n '480,530p' Deeploy/Targets/Generic/Parsers.py

Repository: pulp-platform/Deeploy

Length of output: 2170

🏁 Script executed:

#!/bin/bash # Search for test files that test AddParser with broadcasting find . -type f -name "*.py" -path "*test*" | head -20

Repository: pulp-platform/Deeploy

Length of output: 832

🏁 Script executed:

#!/bin/bash # Look for AddParser tests specifically rg -l "AddParser" --type=py | grep -i test

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for test files that might test AddParser rg "AddParser" --type=py DeeployTest/

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Search for Add operator tests in test files rg -i "Add\|broadcast" --type=py DeeployTest/ -A3 -B3 | head -100

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Check the AddParser class definition more carefully sed -n '450,530p' Deeploy/Targets/Generic/Parsers.py

Repository: pulp-platform/Deeploy

Length of output: 3564

🏁 Script executed:

#!/bin/bash # Look for test ONNX models that might use Add with broadcasting find . -name "*.onnx" -o -name "*.pb" | head -20

Repository: pulp-platform/Deeploy

Length of output: 1241

🏁 Script executed:

#!/bin/bash # Let's examine one of the Add test cases to understand the shapes used python3 << 'EOF' import sys try: import onnx model = onnx.load('./DeeployTest/Tests/Kernels/FP32/Add/Large/network.onnx') print("Model graph inputs and outputs:") for input_tensor in model.graph.input: print(f"Input: {input_tensor.name}") if input_tensor.type.tensor_type.shape.dim: dims = [d.dim_value for d in input_tensor.type.tensor_type.shape.dim] print(f" Shape: {dims}") for output_tensor in model.graph.output: print(f"Output: {output_tensor.name}") if output_tensor.type.tensor_type.shape.dim: dims = [d.dim_value for d in output_tensor.type.tensor_type.shape.dim] print(f" Shape: {dims}") for node in model.graph.node: print(f"Node: {node.op_type} - {node.name}") print(f" Inputs: {list(node.input)}") print(f" Outputs: {list(node.output)}") except Exception as e: print(f"Error: {e}", file=sys.stderr) sys.exit(1) EOF

Repository: pulp-platform/Deeploy

Length of output: 94

🏁 Script executed:

#!/bin/bash # Check the other Add test case python3 << 'EOF' import sys try: import onnx model = onnx.load('./DeeployTest/Tests/Kernels/FP32/Add/Regular/network.onnx') print("Model graph inputs and outputs:") for input_tensor in model.graph.input: print(f"Input: {input_tensor.name}") if input_tensor.type.tensor_type.shape.dim: dims = [d.dim_value for d in input_tensor.type.tensor_type.shape.dim] print(f" Shape: {dims}") for output_tensor in model.graph.output: print(f"Output: {output_tensor.name}") if output_tensor.type.tensor_type.shape.dim: dims = [d.dim_value for d in output_tensor.type.tensor_type.shape.dim] print(f" Shape: {dims}") for node in model.graph.node: print(f"Node: {node.op_type} - {node.name}") print(f" Inputs: {list(node.input)}") print(f" Outputs: {list(node.output)}") except Exception as e: print(f"Error: {e}", file=sys.stderr) sys.exit(1) EOF

Repository: pulp-platform/Deeploy

Length of output: 94

🏁 Script executed:

#!/bin/bash # Look for other test files that might test broadcasting scenarios find DeeployTest/Tests -name "*.onnx" | xargs -I {} sh -c 'echo "=== {} ===" && python3 << EOF 2>&1 | head -20 import onnx try: model = onnx.load("{}") for node in model.graph.node: if node.op_type == "Add": print(f"Found Add node: {node.name}") except: pass EOF' | grep -A10 "Found Add"

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

#!/bin/bash # Let's manually trace through the stride calculation logic # to verify the reviewer's concern about rank differences python3 << 'EOF' # Simulate the stride calculation as it appears in the code def compute_strides_original(shape, out_shape): """Original code logic""" ndim = len(out_shape) strides = [1] * ndim for i in range(ndim - 1, -1, -1): if i < len(shape) and shape[i] == out_shape[i]: if i == ndim - 1: strides[i] = 1 else: strides[i] = strides[i + 1] * shape[i + 1] if ( i + 1 < len(shape) and shape[i + 1] == out_shape[i + 1]) else strides[i + 1] else: strides[i] = 0 return strides # Test case 1: Same rank (should work) shape1 = [1, 3, 4] out_shape = [2, 3, 4] result = compute_strides_original(shape1, out_shape) print(f"Test 1 - shape1={shape1}, out_shape={out_shape}") print(f" Result: {result}") print(f" Expected: [0, 4, 1]") print() # Test case 2: Different rank (reviewer's concern) shape1 = [4] out_shape = [2, 3, 4] result = compute_strides_original(shape1, out_shape) print(f"Test 2 - shape1={shape1}, out_shape={out_shape}") print(f" Result: {result}") print(f" Expected: [0, 0, 1] (shape1 implicitly [1, 1, 4] per ONNX broadcasting)") print() # Test case 3: Another different rank scenario shape1 = [3, 4] out_shape = [2, 3, 4] result = compute_strides_original(shape1, out_shape) print(f"Test 3 - shape1={shape1}, out_shape={out_shape}") print(f" Result: {result}") print(f" Expected: [0, 4, 1] (shape1 implicitly [1, 3, 4])") print() # Now test the suggested fix def compute_strides_fixed(shape, out_shape): """Fixed version with padding""" ndim = len(out_shape) # Pad shape from the left to match ndim padded_shape = [1] * (ndim - len(shape)) + shape strides = [1] * ndim for i in range(ndim - 1, -1, -1): if padded_shape[i] == out_shape[i]: if i == ndim - 1: strides[i] = 1 else: strides[i] = strides[i + 1] * padded_shape[i + 1] if ( padded_shape[i + 1] == out_shape[i + 1]) else strides[i + 1] else: strides[i] = 0 return strides print("=" * 50) print("WITH FIX (padding shapes):") print("=" * 50) # Test case 2 with fix shape1 = [4] out_shape = [2, 3, 4] result = compute_strides_fixed(shape1, out_shape) print(f"Test 2 - shape1={shape1}, out_shape={out_shape}") print(f" Result: {result}") print(f" Expected: [0, 0, 1]") print() # Test case 3 with fix shape1 = [3, 4] out_shape = [2, 3, 4] result = compute_strides_fixed(shape1, out_shape) print(f"Test 3 - shape1={shape1}, out_shape={out_shape}") print(f" Result: {result}") print(f" Expected: [0, 4, 1]") EOF

Repository: pulp-platform/Deeploy

Length of output: 706

🏁 Script executed:

#!/bin/bash # Look for documentation or comments about ONNX broadcasting in the codebase rg -i "onnx.*broadcast|broadcast.*onnx" --type=py -B2 -A2 | head -50

Repository: pulp-platform/Deeploy

Length of output: 412

🏁 Script executed:

#!/bin/bash # Check if there are any other stride calculations in the codebase for reference rg "stride" Deeploy/ --type=py -B2 -A2 | head -100

Repository: pulp-platform/Deeploy

Length of output: 8804

🏁 Script executed:

#!/bin/bash # Let's look at the correct implementation in Snitch/Parsers.py rg "_compute_broadcast_strides" Deeploy/Targets/Snitch/Parsers.py -A50 | head -80

Repository: pulp-platform/Deeploy

Length of output: 2043

🏁 Script executed:

#!/bin/bash # Check if AddParser in Generic/Parsers.py should use a similar approach sed -n '1,100p' Deeploy/Targets/Generic/Parsers.py | grep -i "import\|class"

Repository: pulp-platform/Deeploy

Length of output: 368

🏁 Script executed:

#!/bin/bash # Let's examine the Snitch AddParser implementation more closely rg "class.*AddParser" Deeploy/Targets/Snitch/Parsers.py -A100 | head -150

Repository: pulp-platform/Deeploy

Length of output: 3652

🏁 Script executed:

#!/bin/bash # Check if Generic/Parsers.py imports or reuses any functions from elsewhere grep -n "class AddParser" Deeploy/Targets/Generic/Parsers.py -A150 | head -200

Repository: pulp-platform/Deeploy

Length of output: 6781

Stride calculation is broken for inputs with different ranks — fix required.

The stride computation in AddParser.parseNodeCtxt (lines 494-525) fails when input shapes have fewer dimensions than the output shape. The code checks i < len(shape1) without padding shapes first, violating ONNX broadcasting semantics which pad shorter shapes with 1s on the left.

For example, shape1=[4] and out_shape=[2,3,4] produces strides1=[0,0,0] instead of [0,0,1].

The correct approach is already implemented in Deeploy/Targets/Snitch/Parsers.py (SnitchAddParser._compute_broadcast_strides). Pad both shapes from the left before computing strides:

if need_broadcast: # Calculate strides for broadcasting ndim = len(out_shape) + # Pad shapes from the left to match ndim (ONNX broadcasts from right) + padded_shape1 = [1] * (ndim - len(shape1)) + shape1 + padded_shape2 = [1] * (ndim - len(shape2)) + shape2 + # Compute strides for input 1 strides1 = [1] * ndim for i in range(ndim - 1, -1, -1): - if i < len(shape1) and shape1[i] == out_shape[i]: + if padded_shape1[i] == out_shape[i]: if i == ndim - 1: strides1[i] = 1 else: - strides1[i] = strides1[i + 1] * shape1[i + 1] if ( - i + 1 < len(shape1) and shape1[i + 1] == out_shape[i + 1]) else strides1[i + 1] + strides1[i] = strides1[i + 1] * padded_shape1[i + 1] if ( + padded_shape1[i + 1] == out_shape[i + 1]) else strides1[i + 1] else: strides1[i] = 0 # Broadcast dimension

Apply the same fix to the strides2 computation below.

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Generic/Parsers.py` around lines 494 - 525, AddParser.parseNodeCtxt computes broadcasting strides incorrectly when input ranks differ because it checks i < len(shape1/shape2) instead of left-padding shapes with 1s per ONNX rules; update the code to left-pad shape1 and shape2 to length ndim (length of out_shape) with leading 1s and then compute strides1 and strides2 exactly as in SnitchAddParser._compute_broadcast_strides (treat dimensions equal to out_shape as non-broadcast and set stride 0 for broadcast dims, compute cumulative strides from the right otherwise), and apply the same padding+stride logic to both strides1 and strides2 so examples like shape1=[4], out_shape=[2,3,4] produce strides1=[0,0,1].

coderabbitai · 2026-01-31T11:24:03Z

Deeploy/Targets/Snitch/Parsers.py

+class SnitchDivParser(DivParser):
+    """
+    Snitch-specific Div Parser.
+    Inherits from Generic DivParser and adds shape/broadcasting information.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+        """
+        Extend Generic parser to add shape and broadcasting information.
+        """
+        # Call parent method first
+        ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)
+
+        if not ret:
+            return ctxt, False
+
+        # Get shape information
+        data_in_1 = ctxt.lookup(node.inputs[0].name)
+        data_in_2 = ctxt.lookup(node.inputs[1].name)
+        data_out = ctxt.lookup(node.outputs[0].name)
+
+        shape1 = list(data_in_1.shape)
+        shape2 = list(data_in_2.shape)
+        out_shape = list(data_out.shape)
+
+        # Store shape information
+        self.operatorRepresentation['shape1'] = shape1
+        self.operatorRepresentation['shape2'] = shape2
+        self.operatorRepresentation['out_shape'] = out_shape
+
+        # Calculate sizes
+        self.operatorRepresentation['size1'] = int(np.prod(shape1))
+        self.operatorRepresentation['size2'] = int(np.prod(shape2))
+
+        # Update output size (may differ due to broadcasting)
+        self.operatorRepresentation['size'] = int(np.prod(out_shape))
+
+        # Check if scalar broadcasting (input2 is scalar)
+        self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
+
+        return ctxt, True


⚠️ Potential issue | 🟠 Major

Reject non-scalar broadcasting for Div until a general broadcast kernel exists.

Right now, any non-scalar broadcast falls through to the elementwise path, which will read past input2.

✅ Suggested guard

# Check if scalar broadcasting (input2 is scalar) self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1) + + # Non-scalar broadcasting isn't supported by Div_fp32 + if shape1 != shape2 and not self.operatorRepresentation['is_scalar']: + return ctxt, False

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class SnitchDivParser(DivParser):

"""

Snitch-specific Div Parser.

Inherits from Generic DivParser and adds shape/broadcasting information.

"""

def __init__(self):

super().__init__()

def parseNodeCtxt(self,

ctxt: NetworkContext,

node: gs.Node,

channels_first: bool = True) -> Tuple[NetworkContext, bool]:

"""

Extend Generic parser to add shape and broadcasting information.

"""

# Call parent method first

ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)

if not ret:

return ctxt, False

# Get shape information

data_in_1 = ctxt.lookup(node.inputs[0].name)

data_in_2 = ctxt.lookup(node.inputs[1].name)

data_out = ctxt.lookup(node.outputs[0].name)

shape1 = list(data_in_1.shape)

shape2 = list(data_in_2.shape)

out_shape = list(data_out.shape)

# Store shape information

self.operatorRepresentation['shape1'] = shape1

self.operatorRepresentation['shape2'] = shape2

self.operatorRepresentation['out_shape'] = out_shape

# Calculate sizes

self.operatorRepresentation['size1'] = int(np.prod(shape1))

self.operatorRepresentation['size2'] = int(np.prod(shape2))

# Update output size (may differ due to broadcasting)

self.operatorRepresentation['size'] = int(np.prod(out_shape))

# Check if scalar broadcasting (input2 is scalar)

self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)

return ctxt, True

class SnitchDivParser(DivParser):

"""

Snitch-specific Div Parser.

Inherits from Generic DivParser and adds shape/broadcasting information.

"""

def __init__(self):

super().__init__()

def parseNodeCtxt(self,

ctxt: NetworkContext,

node: gs.Node,

channels_first: bool = True) -> Tuple[NetworkContext, bool]:

"""

Extend Generic parser to add shape and broadcasting information.

"""

# Call parent method first

ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)

if not ret:

return ctxt, False

# Get shape information

data_in_1 = ctxt.lookup(node.inputs[0].name)

data_in_2 = ctxt.lookup(node.inputs[1].name)

data_out = ctxt.lookup(node.outputs[0].name)

shape1 = list(data_in_1.shape)

shape2 = list(data_in_2.shape)

out_shape = list(data_out.shape)

# Store shape information

self.operatorRepresentation['shape1'] = shape1

self.operatorRepresentation['shape2'] = shape2

self.operatorRepresentation['out_shape'] = out_shape

# Calculate sizes

self.operatorRepresentation['size1'] = int(np.prod(shape1))

self.operatorRepresentation['size2'] = int(np.prod(shape2))

# Update output size (may differ due to broadcasting)

self.operatorRepresentation['size'] = int(np.prod(out_shape))

# Check if scalar broadcasting (input2 is scalar)

self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)

# Non-scalar broadcasting isn't supported by Div_fp32

if shape1 != shape2 and not self.operatorRepresentation['is_scalar']:

return ctxt, False

return ctxt, True

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/Parsers.py` around lines 239 - 285, In parseNodeCtxt of class SnitchDivParser, after computing shape1, shape2, out_shape and their sizes (operatorRepresentation['size1'], ['size2'], ['size']), add a guard that rejects non-scalar broadcasting by returning (ctxt, False) whenever input2 is not a scalar and its total size does not equal the output size (i.e., operatorRepresentation['size2'] != 1 and operatorRepresentation['size2'] != operatorRepresentation['size']). This ensures parseNodeCtxt fails for non-scalar broadcasts (preventing the elementwise path from reading past input2) until a general broadcast kernel is implemented; use the existing operatorRepresentation keys and node.inputs/node.outputs to locate where to place the check.

coderabbitai · 2026-01-31T11:24:04Z

Deeploy/Targets/Snitch/Parsers.py

+class SnitchMulParser(MulParser):
+    """
+    Snitch-specific Mul Parser.
+    Inherits from Generic MulParser and adds shape/broadcasting information.
+    """
+
+    def __init__(self):
+        super().__init__()
+
+    def parseNodeCtxt(self,
+                      ctxt: NetworkContext,
+                      node: gs.Node,
+                      channels_first: bool = True) -> Tuple[NetworkContext, bool]:
+        """
+        Extend Generic parser to add shape and broadcasting information.
+        """
+        # Call parent method first
+        ctxt, ret = super().parseNodeCtxt(ctxt, node, channels_first)
+
+        if not ret:
+            return ctxt, False
+
+        # Get shape information
+        data_in_1 = ctxt.lookup(node.inputs[0].name)
+        data_in_2 = ctxt.lookup(node.inputs[1].name)
+        data_out = ctxt.lookup(node.outputs[0].name)
+
+        shape1 = list(data_in_1.shape)
+        shape2 = list(data_in_2.shape)
+        out_shape = list(data_out.shape)
+
+        # Store shape information
+        self.operatorRepresentation['shape1'] = shape1
+        self.operatorRepresentation['shape2'] = shape2
+        self.operatorRepresentation['out_shape'] = out_shape
+
+        # Calculate sizes
+        self.operatorRepresentation['size1'] = int(np.prod(shape1))
+        self.operatorRepresentation['size2'] = int(np.prod(shape2))
+
+        # Update output size (may differ due to broadcasting)
+        self.operatorRepresentation['size'] = int(np.prod(out_shape))
+
+        # Check if scalar broadcasting (input2 is scalar)
+        self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1)
+
+        return ctxt, True


⚠️ Potential issue | 🟠 Major

Reject non-scalar broadcasting for Mul until a general broadcast kernel exists.

Otherwise the elementwise kernel will index beyond input2 for broadcasted shapes.

✅ Suggested guard

# Check if scalar broadcasting (input2 is scalar) self.operatorRepresentation['is_scalar'] = (self.operatorRepresentation['size2'] == 1) + + # Non-scalar broadcasting isn't supported by Mul_fp32 + if shape1 != shape2 and not self.operatorRepresentation['is_scalar']: + return ctxt, False

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/Parsers.py` around lines 288 - 334, In SnitchMulParser.parseNodeCtxt, detect non-scalar broadcasting and reject the node: after computing shape1, shape2, and out_shape and setting size1/size2/size, add a guard that if shape2 (or shape1) does not exactly match out_shape and operatorRepresentation['size2'] != 1 (i.e., input2 is not a scalar) then treat this as unsupported broadcasting — set an error/unsupported marker and return ctxt, False so the generic elementwise kernel is not used; keep the existing operatorRepresentation fields (shape1, shape2, out_shape, size*, is_scalar) and reference parseNodeCtxt, operatorRepresentation, size2, and is_scalar when implementing the check.

coderabbitai · 2026-01-31T11:24:04Z

Deeploy/Targets/Snitch/Platform.py

 from Deeploy.Targets.PULPOpen.Platform import RQAddMapper
-from Deeploy.Targets.Snitch.Parsers import SnitchGEMMParser, SnitchRQGEMMParser
+from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \
+    BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \
+    SnitchRQAddBindings, SnitchRqGemmBindings
+from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \
+    SnitchRMSNormParser, SnitchRQGEMMParser
 from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate
-from Deeploy.Targets.Snitch.Tiler import SnitchAddTileReadyBindings, SnitchGemmTilingReadyBindings, \
-    SnitchiNoNormTilingReadyBindings, SnitchiSoftmaxTilingReadyBindings, SnitchRQAddTilingReadyBindings, \
-    SnitchRqGemmTilingReadyBindings

+# =============================================================================
+# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)
+# These are used by generateNetwork.py (testRunner_snitch.py)
+# =============================================================================
 GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)
 Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)
 Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)
 UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)
+ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)
+TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)
+ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)

 RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])

-MatMulMapper = NodeMapper(MatMulParser(), BasicMatMulBindings)
-GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmTilingReadyBindings)
-RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmTilingReadyBindings)
-iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
-SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)
-iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormTilingReadyBindings)
+# These use TiledTransformer but work in both modes (original upstream behavior)
+GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)
+RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)
+iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)
+SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)
+iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)
 iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)
-RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddTilingReadyBindings)
-AddMapper = NodeMapper(AddParser(), SnitchAddTileReadyBindings)
+RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)
+AddMapper = NodeMapper(AddParser(), SnitchAddBindings)


⚠️ Potential issue | 🟡 Minor

Avoid shadowing the imported RQAddMapper.

The local RQAddMapper = NodeMapper(...) redefines the imported symbol, which is already unused and triggers F811. Removing the import avoids confusion.

Proposed fix

-from Deeploy.Targets.PULPOpen.Platform import RQAddMapper

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from Deeploy.Targets.PULPOpen.Platform import RQAddMapper

from Deeploy.Targets.Snitch.Parsers import SnitchGEMMParser, SnitchRQGEMMParser

from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \

BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \

SnitchRQAddBindings, SnitchRqGemmBindings

from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \

SnitchRMSNormParser, SnitchRQGEMMParser

from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate

from Deeploy.Targets.Snitch.Tiler import SnitchAddTileReadyBindings, SnitchGemmTilingReadyBindings, \

SnitchiNoNormTilingReadyBindings, SnitchiSoftmaxTilingReadyBindings, SnitchRQAddTilingReadyBindings, \

SnitchRqGemmTilingReadyBindings

# =============================================================================

# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)

# These are used by generateNetwork.py (testRunner_snitch.py)

# =============================================================================

GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)

Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)

Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)

UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)

ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)

TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)

ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)

RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])

MatMulMapper = NodeMapper(MatMulParser(), BasicMatMulBindings)

GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmTilingReadyBindings)

RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmTilingReadyBindings)

iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)

SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxTilingReadyBindings)

iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormTilingReadyBindings)

# These use TiledTransformer but work in both modes (original upstream behavior)

GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)

RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)

iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)

SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)

iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)

iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)

RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddTilingReadyBindings)

AddMapper = NodeMapper(AddParser(), SnitchAddTileReadyBindings)

RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)

AddMapper = NodeMapper(AddParser(), SnitchAddBindings)

from Deeploy.Targets.Snitch.Bindings import BasicDivBindings, BasicHardSwishBindings, BasicMulBindings, \

BasicRMSNormBindings, SnitchAddBindings, SnitchGemmBindings, SnitchiNoNormBindings, SnitchiSoftmaxBindings, \

SnitchRQAddBindings, SnitchRqGemmBindings

from Deeploy.Targets.Snitch.Parsers import HardSwishParser, SnitchDivParser, SnitchGEMMParser, SnitchMulParser, \

SnitchRMSNormParser, SnitchRQGEMMParser

from Deeploy.Targets.Snitch.Templates import AllocateTemplate, FreeTemplate

# =============================================================================

# Mappers for UNTILED mode (using BasicBindings with BasicTransformer)

# These are used by generateNetwork.py (testRunner_snitch.py)

# =============================================================================

GatherMapper = NodeMapper(GatherParser(), BasicGatherBindings)

Pad1DMapper = NodeMapper(Pad1DParser(), BasicPad1DBindings)

Pad2DMapper = NodeMapper(Pad2DParser(), BasicPad2DBindings)

UnsqueezeMapper = NodeMapper(UnsqueezeParser(), BasicReshapeBindings)

ReshapeMapper = NodeMapper(ReshapeParser(), BasicReshapeBindings)

TransposeMapper = NodeMapper(TransposeParser(), BasicTransposeBindings)

ConcatMapper = NodeMapper(ConcatParser(), BasicConcatBindings)

RQIntegerDivMapper = NodeMapper(RQIntegerDivParser(), [BasicRQIntegerDivBinding])

# These use TiledTransformer but work in both modes (original upstream behavior)

GemmMapper = NodeMapper(SnitchGEMMParser(), SnitchGemmBindings)

RqGemmMapper = NodeMapper(SnitchRQGEMMParser(), SnitchRqGemmBindings)

iSoftmaxMapper = NodeMapper(iSoftmaxParser(), SnitchiSoftmaxBindings)

SoftmaxMapper = NodeMapper(SoftmaxParser(), SnitchiSoftmaxBindings)

iNoNormMapper = NodeMapper(iNoNormParser(), SnitchiNoNormBindings)

iLayerNormMapper = NodeMapper(iLayerNormParser(), BasicLayerNormBindings)

RQAddMapper = NodeMapper(RQAddParser(), SnitchRQAddBindings)

AddMapper = NodeMapper(AddParser(), SnitchAddBindings)

🧰 Tools

🪛 Ruff (0.14.14)

[error] 54-54: Redefinition of unused RQAddMapper from line 25: RQAddMapper redefined here

(F811)

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/Platform.py` around lines 25 - 55, The file imports RQAddMapper from Deeploy.Targets.PULPOpen.Platform but then defines a local RQAddMapper = NodeMapper(...) which shadows the import and triggers F811; fix it by removing RQAddMapper from the import list on the first line (or alternatively rename the local NodeMapper variable, e.g., LocalRQAddMapper) so there is no name collision between the imported symbol and the local definition; ensure any other code referencing the external RQAddMapper is updated if you choose to rename the local variable.

coderabbitai · 2026-01-31T11:24:04Z

Deeploy/Targets/Snitch/Templates/MatMulTemplate.py

+    for(uint32_t i=0;i<${batch};i++){
+        MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(
+            ref_${data_out}_${A},
+            ref_${data_out}_${B},
+            ref_${data_out}_${data_out},
+            ${M},
+            ${N},
+            ${O},
+            ${A_offset}, ${B_offset}, ${C_offset}
+        );
+
+        ref_${data_out}_${A} += ${M} * ${N};
+        ref_${data_out}_${B} += ${N} * ${O};
+        ref_${data_out}_${data_out} += ${M} * ${O};
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for MatMul-related parser and binding logic rg -n "MatMul" Deeploy/Targets/Snitch -g '*.py' -A 3

Repository: pulp-platform/Deeploy

Length of output: 8307

🏁 Script executed:

# Search in Generic targets for broadcast/batch handling patterns rg -n "MatMul" Deeploy/Targets/Generic -g '*.py' -A 3

Repository: pulp-platform/Deeploy

Length of output: 10799

🏁 Script executed:

# Examine the MatMulTemplate.py file directly cat -n Deeploy/Targets/Snitch/Templates/MatMulTemplate.py

Repository: pulp-platform/Deeploy

Length of output: 2731

🏁 Script executed:

# Search for batch and broadcast handling in the codebase rg -n "broadcast" Deeploy/Targets -g '*.py' -B 2 -A 2

Repository: pulp-platform/Deeploy

Length of output: 22048

🏁 Script executed:

# Read MatMulParser implementation sed -n '1775,1850p' Deeploy/Targets/Generic/Parsers.py

Repository: pulp-platform/Deeploy

Length of output: 3828

🏁 Script executed:

# Check MatMulLayer validation sed -n '191,225p' Deeploy/Targets/Generic/Layers.py

Repository: pulp-platform/Deeploy

Length of output: 1374

🏁 Script executed:

# Search for batch variable computation in MatMul context rg -n "batch" Deeploy/Targets/Generic/Parsers.py -B 3 -A 3 | grep -A 5 -B 5 "MatMul"

Repository: pulp-platform/Deeploy

Length of output: 47

🏁 Script executed:

# Check if there's any broadcast constraint validation in the parser rg -n "broadcast.*MatMul\|MatMul.*broadcast" Deeploy/Targets -g '*.py'

Repository: pulp-platform/Deeploy

Length of output: 47

Guard against broadcasted A/B in batched MatMul.
The template unconditionally advances A and B pointers each batch iteration. The parser already computes A_batched and B_batched flags to indicate which inputs carry the full batch dimension, but the template ignores them. When either input is broadcast (batch dimension = 1), those unconditional increments will skip past valid data and corrupt results.

Conditionally increment only inputs where the batched flag is true, similar to how FloatAddTemplate and other Snitch templates handle broadcasting.

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/Templates/MatMulTemplate.py` around lines 42 - 56, The loop always advances input pointers ref_${data_out}_${A} and ref_${data_out}_${B} for each batch, which breaks broadcasting; update the batched MatMul loop to only increment those pointers when the corresponding A_batched or B_batched flag is true (leave ref_${data_out}_${data_out} always incremented), i.e. after calling MatMul_s${A_type.referencedType.typeWidth}_s${B_type.referencedType.typeWidth}_s${data_out_type.referencedType.typeWidth}(...) wrap the pointer advances in conditional checks using the parser-provided A_batched and B_batched flags so broadcasted inputs (batch dim == 1) are not advanced.

coderabbitai · 2026-01-31T11:24:04Z

Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py

+    def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:
+
+        inputBufferName = parseDict['data_in']
+        outputBufferName = parseDict['data_out']
+
+        pointer: List[str] = []
+
+        for key, value in parseDict.items():
+            if not isinstance(value, str):
+                continue
+
+            if ctxt.is_global(value) or ctxt.is_local(value):
+                pointer.append(value)
+
+        # Add I/O dimensions to the model as variables
+        for bufferName in [inputBufferName, outputBufferName]:
+            _buffer = ctxt.lookup(bufferName)
+            tilerModel.addTensorDimToModel(ctxt, bufferName)
+
+            for idx, shapeDim in enumerate(_buffer.shape):
+                tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)
+
+        # Constrain total elements to be equal
+        inputBuffer = ctxt.lookup(inputBufferName)
+        outputBuffer = ctxt.lookup(outputBufferName)


⚠️ Potential issue | 🟡 Minor

Remove unused buffer lookups to avoid Ruff F841.

inputBuffer and outputBuffer are assigned but never used.

🧹 Suggested cleanup

- inputBuffer = ctxt.lookup(inputBufferName) - outputBuffer = ctxt.lookup(outputBufferName)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:

inputBufferName = parseDict['data_in']

outputBufferName = parseDict['data_out']

pointer: List[str] = []

for key, value in parseDict.items():

if not isinstance(value, str):

continue

if ctxt.is_global(value) or ctxt.is_local(value):

pointer.append(value)

# Add I/O dimensions to the model as variables

for bufferName in [inputBufferName, outputBufferName]:

_buffer = ctxt.lookup(bufferName)

tilerModel.addTensorDimToModel(ctxt, bufferName)

for idx, shapeDim in enumerate(_buffer.shape):

tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)

# Constrain total elements to be equal

inputBuffer = ctxt.lookup(inputBufferName)

outputBuffer = ctxt.lookup(outputBufferName)

def addGeometricalConstraint(tilerModel: TilerModel, parseDict: Dict, ctxt: NetworkContext) -> TilerModel:

inputBufferName = parseDict['data_in']

outputBufferName = parseDict['data_out']

pointer: List[str] = []

for key, value in parseDict.items():

if not isinstance(value, str):

continue

if ctxt.is_global(value) or ctxt.is_local(value):

pointer.append(value)

# Add I/O dimensions to the model as variables

for bufferName in [inputBufferName, outputBufferName]:

_buffer = ctxt.lookup(bufferName)

tilerModel.addTensorDimToModel(ctxt, bufferName)

for idx, shapeDim in enumerate(_buffer.shape):

tilerModel.addConstraint(tilerModel.getTensorDimVar(tensorName = bufferName, dimIdx = idx) <= shapeDim)

# Constrain total elements to be equal

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 30-30: Loop control variable key not used within loop body

Rename unused key to _key

(B007)

[error] 46-46: Local variable inputBuffer is assigned to but never used

Remove assignment to unused variable inputBuffer

(F841)

[error] 47-47: Local variable outputBuffer is assigned to but never used

Remove assignment to unused variable outputBuffer

(F841)

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py` around lines 23 - 47, In addGeometricalConstraint remove the two unused local variables by deleting the assignments "inputBuffer = ctxt.lookup(inputBufferName)" and "outputBuffer = ctxt.lookup(outputBufferName)" (or alternatively use them if needed); the current ctxt.lookup calls create unused values causing Ruff F841—ensure any required buffer lookups are either consumed (e.g., used in subsequent logic) or removed from the function body.

coderabbitai · 2026-01-31T11:24:04Z

Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py

+        for cube in outputCubes:
+            # Calculate the flat offset and size for the output cube
+            outSize = np.prod(cube.dims)
+            replacements["size"].append(outSize)
+
+            # For reshape, we need to map output cube to input cube
+            # Calculate flat index range for output cube
+            outOffset = 0
+            outStrides = []
+            stride = 1
+            for dim in reversed(outputShape):
+                outStrides.insert(0, stride)
+                stride *= dim
+
+            for i, (off, dim) in enumerate(zip(cube.offset, cube.dims)):
+                outOffset += off * outStrides[i]
+
+            # Convert flat offset to input coordinates
+            inStrides = []
+            stride = 1
+            for dim in reversed(inputShape):
+                inStrides.insert(0, stride)
+                stride *= dim
+
+            inOffset = []
+            remaining = outOffset
+            for i, stride in enumerate(inStrides):
+                inOffset.append(remaining // stride)
+                remaining = remaining % stride
+
+            # Calculate input cube dimensions
+            # For simplicity, treat as 1D cube in input space
+            inCubeDims = list(inputShape)
+            inCubeOffset = [0] * len(inputShape)
+
+            # Set the last dimension to the size, and offset based on flat index
+            totalSize = outSize
+            if len(inputShape) > 0:
+                # Compute proper input cube that covers the same elements
+                # Use a simple approach: linearize the input
+                inCubeOffset = list(inOffset)
+                inCubeDims = [1] * len(inputShape)
+                inCubeDims[-1] = min(totalSize, inputShape[-1] - inCubeOffset[-1])
+                remaining = totalSize - inCubeDims[-1]
+
+                for i in range(len(inputShape) - 2, -1, -1):
+                    if remaining <= 0:
+                        break
+                    inCubeDims[i] = min(remaining // np.prod(inputShape[i + 1:]) + 1, inputShape[i])
+                    remaining -= (inCubeDims[i] - 1) * np.prod(inputShape[i + 1:])
+
+            inputCube = HyperRectangle(tuple(inCubeOffset), tuple(inCubeDims))


⚠️ Potential issue | 🟠 Major

Reshape tile → input cube mapping can drop or mis-map elements.

The current construction of inCubeOffset/inCubeDims assumes a contiguous flat segment always maps to a single axis-aligned input hyperrectangle. When an output tile crosses input row/plane boundaries, inCubeDims can under-cover outSize or include unrelated elements, leading to incorrect DMA loads. Consider either constraining tiling to contiguous last-dimension segments (or a flattened 1D representation) or splitting such tiles into multiple input cubes.

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 97-97: Loop control variable dim not used within loop body

(B007)

[warning] 97-97: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

[warning] 109-109: Loop control variable i not used within loop body

(B007)

🤖 Prompt for AI Agents

In `@Deeploy/Targets/Snitch/TileConstraints/ReshapeTileConstraint.py` around lines 83 - 134, The mapping from output tiles to input HyperRectangle(s) in the loop over outputCubes (using outputShape, inputShape, outOffset, outSize, inStrides, inCubeOffset, inCubeDims and creating HyperRectangle) is incorrect for tiles that cross input dimension boundaries; fix by detecting when the linear range [outOffset, outOffset+outSize) spans multiple input strides and emit one or more input HyperRectangle pieces that together cover exactly outSize elements instead of assuming a single axis-aligned block. Concretely, replace the simplistic inCubeDims/inCubeOffset construction with logic that: 1) computes the remainingCount = outSize and a cursor = outOffset, 2) in a loop maps cursor to multi-dimensional coords using inStrides to get the start coord, determines the maximal contiguous run length along the last (fastest) dimension without crossing that dimension limit (using inputShape and start coord), 3) create a HyperRectangle for that run, subtract its size from remainingCount, advance cursor by run length, and repeat until remainingCount == 0; ensure the created HyperRectangle tuples (offset, dims) sum exactly to outSize and add all pieces instead of a single inputCube.

coderabbitai · 2026-01-31T11:24:04Z

TargetLibraries/Snitch/inc/kernel/HardSwish.h

+/*
+ * SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ */


⚠️ Potential issue | 🟡 Minor

Copyright year appears to be incorrect.

Same as the source file, the header has year 2026. Should be 2024 or 2025.

🤖 Prompt for AI Agents

In `@TargetLibraries/Snitch/inc/kernel/HardSwish.h` around lines 1 - 5, Update the SPDX header in HardSwish.h to correct the copyright year from 2026 to the proper year (2024 or 2025); locate the top-of-file comment block in TargetLibraries/Snitch/inc/kernel/HardSwish.h and change the year in the SPDX-FileCopyrightText line so it matches the source file.

coderabbitai · 2026-01-31T11:24:04Z

TargetLibraries/Snitch/src/Div_fp32.c

+ * Element-wise Division (FP32)
+ *
+ * Computes: output[i] = input1[i] / input2[i]
+ *
+ * Supports ONNX broadcasting rules:
+ * - If input2 is scalar (size=1): divides all elements of input1 by input2[0]
+ * - If both have same size: element-wise division
+ *
+ * input1:         Numerator tensor (float32)
+ * input2:         Denominator tensor (float32)
+ * output:         Output tensor (same shape as input1)
+ * size:           Total number of elements in input1
+ *
+ * multi-core      = yes
+ * parallelization = element-wise across input1
+ */
+void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,
+              uint32_t size) {
+
+  uint32_t core_id = snrt_global_compute_core_idx();
+  uint32_t numThreads = snrt_global_compute_core_num();
+
+  // Parallelize across elements
+  uint32_t elements_per_core = size / numThreads;
+  uint32_t remainder = size % numThreads;
+
+  uint32_t start_elem, num_elems;
+  if (core_id < remainder) {
+    num_elems = elements_per_core + 1;
+    start_elem = core_id * num_elems;
+  } else {
+    num_elems = elements_per_core;
+    start_elem = core_id * elements_per_core + remainder;
+  }
+
+  // Check if input2 is a scalar (size=1, broadcasted)
+  // Note: This assumes the parser has set input2_size correctly
+  // For now, we assume element-wise division (same size)
+  for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {
+    output[i] = input1[i] / input2[i];


⚠️ Potential issue | 🟡 Minor

Align Div_fp32 docs with the actual scalar handling.
The function always uses input2[i], so the broadcast claim is misleading and could mask out-of-bounds access if a scalar is ever routed here. Consider documenting that Div_fp32_scalar is required for scalar inputs.

💡 Suggested doc fix

- * Supports ONNX broadcasting rules: - * - If input2 is scalar (size=1): divides all elements of input1 by input2[0] - * - If both have same size: element-wise division + * Expects input1 and input2 to have identical sizes (element-wise division). + * Use Div_fp32_scalar for scalar broadcasting. @@ - // Check if input2 is a scalar (size=1, broadcasted) - // Note: This assumes the parser has set input2_size correctly - // For now, we assume element-wise division (same size)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

* Element-wise Division (FP32)

*

* Computes: output[i] = input1[i] / input2[i]

*

* Supports ONNX broadcasting rules:

* - If input2 is scalar (size=1): divides all elements of input1 by input2[0]

* - If both have same size: element-wise division

*

* input1: Numerator tensor (float32)

* input2: Denominator tensor (float32)

* output: Output tensor (same shape as input1)

* size: Total number of elements in input1

*

* multi-core = yes

* parallelization = element-wise across input1

*/

void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,

uint32_t size) {

uint32_t core_id = snrt_global_compute_core_idx();

uint32_t numThreads = snrt_global_compute_core_num();

// Parallelize across elements

uint32_t elements_per_core = size / numThreads;

uint32_t remainder = size % numThreads;

uint32_t start_elem, num_elems;

if (core_id < remainder) {

num_elems = elements_per_core + 1;

start_elem = core_id * num_elems;

} else {

num_elems = elements_per_core;

start_elem = core_id * elements_per_core + remainder;

}

// Check if input2 is a scalar (size=1, broadcasted)

// Note: This assumes the parser has set input2_size correctly

// For now, we assume element-wise division (same size)

for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {

output[i] = input1[i] / input2[i];

* Element-wise Division (FP32)

*

* Computes: output[i] = input1[i] / input2[i]

*

* Expects input1 and input2 to have identical sizes (element-wise division).

* Use Div_fp32_scalar for scalar broadcasting.

*

* input1: Numerator tensor (float32)

* input2: Denominator tensor (float32)

* output: Output tensor (same shape as input1)

* size: Total number of elements in input1

*

* multi-core = yes

* parallelization = element-wise across input1

*/

void Div_fp32(float32_t *input1, float32_t *input2, float32_t *output,

uint32_t size) {

uint32_t core_id = snrt_global_compute_core_idx();

uint32_t numThreads = snrt_global_compute_core_num();

// Parallelize across elements

uint32_t elements_per_core = size / numThreads;

uint32_t remainder = size % numThreads;

uint32_t start_elem, num_elems;

if (core_id < remainder) {

num_elems = elements_per_core + 1;

start_elem = core_id * num_elems;

} else {

num_elems = elements_per_core;

start_elem = core_id * elements_per_core + remainder;

}

for (uint32_t i = start_elem; i < start_elem + num_elems; i++) {

output[i] = input1[i] / input2[i];

🤖 Prompt for AI Agents

In `@TargetLibraries/Snitch/src/Div_fp32.c` around lines 10 - 49, The function Div_fp32 currently performs element-wise division using input2[i] and does not handle scalar broadcasting; update the top-of-file comment for Div_fp32 to remove the "If input2 is scalar (size=1): divides all elements..." claim and clearly state that Div_fp32 assumes input1 and input2 have the same size (element-wise) and that scalar denominators must call Div_fp32_scalar (or another scalar-specific routine); additionally, add a short runtime assert/check in Div_fp32 (referencing the variables size, input2, and function name Div_fp32) or document that input2_size must equal size to prevent out-of-bounds access if scalar inputs are routed incorrectly.

coderabbitai · 2026-01-31T11:24:04Z

TargetLibraries/Snitch/src/Mul_fp32.c

+/*
+ * SPDX-FileCopyrightText: 2026 ETH Zurich and University of Bologna
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ */


⚠️ Potential issue | 🟡 Minor

Copyright year appears to be incorrect.

The SPDX header has year 2026, which is in the future.

🤖 Prompt for AI Agents

In `@TargetLibraries/Snitch/src/Mul_fp32.c` around lines 1 - 5, The SPDX header in the file Mul_fp32.c has an incorrect future year (2026); update the copyright year(s) in the top comment block (the SPDX header lines and the SPDX-FileCopyrightText entry) to the correct current year or an appropriate range (e.g., 2024 or "2023-2024") so the SPDX-FileCopyrightText and SPDX-License-Identifier comment reflects accurate dates.

lee2716 added 2 commits January 31, 2026 12:09

lee2716 requested review from Victor-Jung and Xeratec as code owners January 31, 2026 11:15

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

Add FP32 operators and tiling support for MicroLlama on Snitch #153

Are you sure you want to change the base?

Add FP32 operators and tiling support for MicroLlama on Snitch #153

Uh oh!

Conversation

lee2716 commented Jan 31, 2026

Added

Changed

Fixed

PR Merge Checklist

Uh oh!

coderabbitai bot commented Jan 31, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant