BarraCUDA

An open-source CUDA compiler, from scratch, in C99

BarraCUDA started because I wanted to know whether I could take a .cu file and turn it into GPU machine code without asking anyone's permission. It turns out you can, though the AMD ISA reference manual is 2,400 pages and not all of them agree with each other.

BarraCUDA is an open-source CUDA compiler that takes the same .cu files you would feed to nvcc and compiles them to AMD GPU machine code, NVIDIA PTX, and Tenstorrent Tensix Metalium C++. It is written from scratch in C99 with zero dependencies. It does its own lexing, its own parsing, its own instruction selection, and its own binary encoding. You type make and it builds with gcc.

The entire pipeline is built from first principles: a preprocessor that handles the full C macro language, a recursive descent parser, semantic analysis with type checking and scope resolution, an SSA intermediate representation, a register allocator that understands GPU divergence, and three separate backend code generators that produce real binaries for real hardware. BarraCUDA-compiled kernels have been tested on AMD MI300X, NVIDIA RTX 4060 Ti, and AMD RDNA 3 silicon. They produce correct results.

How It Got Here

I learned C from Low Level's Zero to Hero course on YouTube. That is not the typical origin story for someone who ends up writing a GPU compiler, but I wanted to build something real, something that took source code in one end and produced actual machine code out the other. A compiler that targeted GPUs seemed like the hardest thing I could attempt, which made it the most interesting.

The first version targeted AMD RDNA 3 and was released in February 2026. RDNA 2 and RDNA 4 followed within days. Then CDNA 2 and CDNA 3 for the data centre GPUs. Then the NVIDIA PTX backend, which compiles CUDA to NVIDIA's own intermediate format without involving any NVIDIA toolchain whatsoever, loaded via the CUDA Driver API and JIT-compiled by the NVIDIA driver itself. Then the Tenstorrent Tensix backend. The whole thing took about five weeks from first commit to three working backends.

The Register Allocator

The part I am most proud of is the register allocator, because it was the part where I learned the most and had the most help.

Fernando Magno Quintão Pereira, who runs the Compilers Lab at UFMG (Universidade Federal de Minas Gerais in Brazil), reached out after seeing the project and pointed me to the divergence analysis papers by Sampaio, Souza, Collange, and Pereira from 2013. GPUs are not CPUs. When threads in a warp take different paths through a branch, the values in their registers "diverge," which means a single register might hold 64 different values across 64 threads simultaneously. If you need to spill that register to memory, you have to save all 64 values. But if every thread holds the same value, you can spill it with a single lane read, which is 64 times cheaper.

The SSA register allocator exploits this asymmetry. It analyses which values are uniform across the warp and which have diverged, and makes spill decisions accordingly. On a 654-line Monte Carlo neutron transport kernel, it eliminated all 186 register spills, dropped scratch memory traffic by 78%, and reduced total instructions by 28%. About 1,300 lines of C99, all static memory, no malloc. None of the ideas are mine. The researchers did the hard thinking decades ago and wrote it down in papers that anyone can read, which is one of the quieter miracles of academia.

Testing It with Nuclear Physics

Some might say that testing your compiler by simulating nuclear reactors is a touch unorthodox. Well, I have always been about unorthodox here.

Moa is a Monte Carlo neutron transport code I wrote in C99. It tracks individual neutrons through constructive solid geometry, simulating their interactions with atomic nuclei using evaluated nuclear data from the ENDF/B-VII.1 library. Fission sites from each generation become the source for the next, and the eigenvalue converges over hundreds of batches through power iteration. It is validated against three international criticality safety benchmarks: Godiva, Jezebel, and Flattop. The computed results fall within statistical uncertainty of published reference values. The physics is correct.

The GPU kernel in Moa is compiled by BarraCUDA. On an NVIDIA RTX 4060 Ti running the Flattop benchmark, which involves tracking neutrons through an 18-centimetre uranium reflector where each particle undergoes about 200 collisions before escaping, the GPU processes 91,000 particles per second. (If you read that and went "I have no idea what this means," do not worry, neither do I half the time and I wrote the damn thing.) A single CPU core manages roughly 320. That is a 39x speedup, achieved by a compiler that is five weeks old, performing no optimisation beyond register allocation, running a nuclear physics simulation that produces the correct k-eigenvalue to three decimal places.

The reason this works so well on a GPU is that neutron transport is one of the few problems in physics where every particle genuinely does not care what every other particle is doing. Each neutron lives and dies on its own terms, which makes it a perfect fit for hardware that runs thousands of threads simultaneously, each one blissfully unaware of the others.

Moa is named after the bird, which stood three and a half metres tall and has been extinct for about six hundred years. The project logo is a moa inside atomic orbitals, which is the kind of thing that happens when you name a nuclear physics code after a flightless bird from New Zealand and then need to design an icon for it.

Error Messages in Your Language

Every diagnostic in BarraCUDA has a language-neutral error code. The actual error text is loaded from external translation files at runtime, which means you can get your compiler errors in any language someone has bothered to translate them into. The first non-English language was te reo Māori, not because it was strategic but because I live in Aotearoa New Zealand and these are my neighbours.

There are roughly 7,000 languages spoken on Earth and about 40% of them are endangered. When every error message and every compiler diagnostic is English-only, it sends a quiet message: this is not for you. Indigenous and endangered languages are especially welcome. Te reo Māori, Welsh, Hawaiian, Navajo, Scots Gaelic, Samoan, any of the hundreds of languages that technology has quietly decided do not matter. If you want to see your language in a compiler diagnostic, this is that project. The translation format is dead simple and needs zero compiler knowledge, which also makes it a rather nice entry point into compiler development if you have ever been curious but did not know where to start.

The People

Fernando Pereira and the Compilers Lab at UFMG, whose guidance made the register allocator possible. Steven Muchnick, whose book Advanced Compiler Design and Implementation is the reason this compiler does anything right. Cooper, Harvey, and Kennedy for dominators. Braun and Hack for SSA spilling. The academic community whose papers I read and whose ideas I turned into C, because I am just a hobbyist who reads papers and writes code, and the actual hard work was done by the researchers.

Abe Kornelis, for being an amazing teacher. His work on z390 is well worth your time. Low Level, for the C course and the YouTube channel. That is where I learned the language this compiler is written in. And to the people who have sent messages of kindness and critique, thank you from a forever student and a happy hobbyist.

He aha te mea nui o te ao. He tāngata, he tāngata, he tāngata.

What is the most important thing in the world? It is people, it is people, it is people.

The source is on GitHub. Apache 2.0. Issues and pull requests in any language are welcome, just include an English translation alongside.

Pen sketch of Rangitoto Island from across the harbour

Get in Touch

zanehambly@gmail.com

GitHub

Based in New Zealand. GMT+12. Replies arrive from the future.