Skip to content

Conversation

@MarijnS95
Copy link
Member

@MarijnS95 MarijnS95 commented May 21, 2025

https://github.com/ispc/ispc/releases/tag/v1.29.1

TODO: Still need to compare performance, but perhaps this helps on newer architectures. Might also have to evaluate if we're simply missing some TargetISA flags relevant for newer SoCs?

@MarijnS95 MarijnS95 requested a review from Jasper-Bekkers May 21, 2025 08:36
@MarijnS95
Copy link
Member Author

Turns out there are a bunch of new generic target ISAs to streamline which vector sizes/widths to select, as well as Apple-specific CPU targets :)

@MarijnS95
Copy link
Member Author

MarijnS95 commented May 26, 2025

On the MacBook Air M4

Main @ 6e7b616 (ISPC 1.20...)

Downsample `square_test.png` using ispc_downsampler
                        time:   [38.827 ms 38.848 ms 38.884 ms]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

This ispc-1.27 PR @ 3556673

Downsample `square_test.png` using ispc_downsampler
                        time:   [48.220 ms 48.253 ms 48.287 ms]
                        change: [+24.103% +24.190% +24.278%] (p = 0.00 < 0.05)
                        Performance has regressed.

Recompiling locally on the M4 Air (ispc 1.27.0 from brew using cargo b -rF ispc):

Downsample `square_test.png` using ispc_downsampler
                        time:   [46.576 ms 46.586 ms 46.596 ms]
                        change: [-3.5237% -3.4550% -3.3855%] (p = 0.00 < 0.05)
                        Performance has improved.

That's a significant performance deficit, which we should investigate before merging. Even playing around with the new CPU flags from Twinklebear/ispc-rs#42, or the generic ISAs, or removing .target_isas() altogether to compile natively for the host yields no improvement.

Funny thing is, with the ISPC test this M4 Air whines a little, but it doesn't during resize 😓

@MarijnS95 MarijnS95 marked this pull request as draft May 26, 2025 10:31
@MarijnS95 MarijnS95 changed the title Regenerate binaries on ISPC 1.27 Regenerate binaries on ISPC 1.28 Aug 13, 2025
@MarijnS95
Copy link
Member Author

MarijnS95 commented Aug 13, 2025

Looks like performance is not restored in 1.28, or we're still doing something wrong. Barely any change compared against 1.27 (which was 24% slower than main per the above):

This PR @ 00d0256

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [48.412 ms 48.475 ms 48.539 ms]
                        change: [+29.098% +29.471% +29.821%] (p = 0.00 < 0.05)
                        Performance has regressed.

@MarijnS95 MarijnS95 changed the title Regenerate binaries on ISPC 1.28 Regenerate binaries on ISPC 1.29.1 Dec 24, 2025
@MarijnS95
Copy link
Member Author

MarijnS95 commented Dec 24, 2025

Re-running this test on my host, recompiled on this ISPC version:

ispc --version
Intel(r) Implicit SPMD Program Compiler (Intel(r) ISPC), 1.28.2 (build commit  @ 20250924, LLVM 20.1.8)

On latest main @ f2ddfab (but not using those prebuilts)

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [46.776 ms 46.875 ms 46.969 ms]

Then following the suggestion from @Jasper-Bekkers in Traverse-Research/intel-tex-rs-2#42 to only use i32x4 because NEON is 128-bits slightly regresses performance:

cargo bench
Downsample `square_test.png` using ispc_downsampler
                        time:   [48.003 ms 48.101 ms 48.196 ms]
                        change: [+2.3395% +2.6161% +2.9034%] (p = 0.00 < 0.05)
                        Performance has regressed.

Also, this M4 chip is supposed to save SME (Scalable Matrix Extensions) but not SVE (Scalable Vector Extensions) and confirmed with sysctl -a hw.optional (and NEON is confirmed as well).

Perhaps this needs to be reported upstream as I'm slightly out of ideas how to best bisect this compiler performance regression.

@MarijnS95
Copy link
Member Author

Just went back in history to generate the blobs for all missing versions:

ISPC 1.23 @ 754d4bf

Downsample `square_test.png` using ispc_downsampler
                        time:   [37.430 ms 37.550 ms 37.666 ms]

ISPC 1.24 @

Downsample `square_test.png` using ispc_downsampler
                        time:   [37.180 ms 37.317 ms 37.454 ms]
                        change: [-1.1514% -0.6188% -0.1473%] (p = 0.01 < 0.05)
                        Change within noise threshold.

ISPC 1.25.3

Downsample `square_test.png` using ispc_downsampler
                        time:   [38.024 ms 38.151 ms 38.315 ms]
                        change: [+1.6876% +2.2352% +2.8037%] (p = 0.00 < 0.05)
                        Performance has regressed.

ISPC 1.26

Downsample `square_test.png` using ispc_downsampler
                        time:   [49.251 ms 49.422 ms 49.588 ms]
                        change: [+29.523% +30.093% +30.690%] (p = 0.00 < 0.05)
                        Performance has regressed.

1.26 is where this regression happened.

Turns out that 1.26 release is exactly where a bunch of Apple improvements have been announced. Unfortunately, playing with that new --darwin-version-min flag, or the new CPU targets (which are only available up to A17, the "predecessor" to M4 in the iPhone space) mentioned above, don't make a difference. I couldn't immediately find if those iPhone skews have support for vector extensions at all..?

@Jasper-Bekkers
Copy link
Member

Jasper-Bekkers commented Dec 24, 2025

Yeah I closed thar PR because later I realized why there was a big delta: I was profiling on battery.

@MarijnS95
Copy link
Member Author

@Jasper-Bekkers Oh I'm also exclusively developing on battery (the perks of Apple putting RTGs in these MacBooks 🤤) but the ±37ms vs ±45ms regression remains consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants