CUDA boost #40

neocoretechs · 2025-11-14T22:54:22Z

neocoretechs
Nov 14, 2025

I am finishing up custom CUDA kernels and device helpers to boost model runner performance. I am using FFI to call pinpointed crosscuts that move the critical data up, leave it there, process, then move it down with minimal traffic: device helpers: GGUF dequant, sdot. kernels: matmul, softmax, rmsnorm. I am adding a USE_CUDA flag to turn the features on or off. Will provide DLL and so for aarch64 and windoze. Testbed: windoze 11/ Nvidia A2000 CUDA 13, Jetson Orin Nano, CUDA 13 Ubuntu.

mikepapadim · 2025-11-15T10:05:22Z

mikepapadim
Nov 15, 2025
Collaborator

hello @neocoretechs, did you try the https://github.com/beehive-lab/GPULlama3.java

1 reply

neocoretechs Nov 16, 2025
Author

I am aware of that effort. Not really interested in committing to entire gargantuan ecosystem of unknown provenance. Dirty little secret of GPU programing is its all application specific. General purpose frameworks quagmire you in a deep dark thicket of data transfer and bus bottlenecks. My effort is targeted crosscuts, custom kernels and device helpers, small footprint, entirely integrated with original effort, drop in replacement for forward inference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA boost #40

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUDA boost #40

Uh oh!

neocoretechs Nov 14, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

mikepapadim Nov 15, 2025 Collaborator

Uh oh!

neocoretechs Nov 16, 2025 Author

neocoretechs
Nov 14, 2025

Replies: 1 comment 1 reply

mikepapadim
Nov 15, 2025
Collaborator

neocoretechs Nov 16, 2025
Author