Teaching Machines to Think Like Machines

The breakthrough came while debugging a sorting algorithm at 2 AM. Not my sorting algorithm - a transformer’s.

I’d been training models to sort short sequences, something any first-year computer science student could code in their sleep. The transformer learned it perfectly. 99.96% accuracy. But I had no idea how it was doing it. The attention patterns looked like abstract art. Beautiful, incomprehensible, utterly opaque.

That’s when I stumbled onto RASP - the Restricted Access Sequence Processing Language. A programming language that speaks transformer. Not metaphorically. Literally. Programs written in RASP compile directly to transformer architectures with exact layer counts and attention head specifications.

For the first time, I could write the sorting algorithm the way a transformer would actually implement it. Two lines of code that mapped perfectly to the attention patterns I’d been staring at for months.

position = aggregate(select(tokens, tokens, <), indices)
sorted_tokens = aggregate(select(indices, position, ==), tokens)

Suddenly, the abstract art made sense. Each line corresponded to a layer. Each comparison to an attention head. The transformer wasn’t doing magic - it was executing a very specific, very logical program.

Here’s what nobody tells you about transformers: we built these incredibly sophisticated machines without understanding their programming model. RNNs had clear sequential processing, obvious parallels to state machines. Transformers? Everything happens in parallel through attention mechanisms we barely comprehended.

We became masters at training them without understanding what they actually computed. Like discovering fire without understanding combustion. Useful, but missing something fundamental.

RASP changes that. It reveals transformers implement a very specific computational model - one with clear constraints and capabilities. Transformers do exactly three things: transform sequences element-wise, select positions to attend to based on pairwise relationships, and aggregate information from selected positions. That’s it. Everything else is composition.

Let me show you how profound this is with a simple example: computing histograms.

Traditional programming would use hash maps, iterate through sequences, maintain counters. But transformers can’t do loops or mutable state. So how do they count token frequencies?

The RASP solution is elegant:

same_token = select(tokens, tokens, ==)
hist = selector_width(same_token)

For the sequence “hello”, the first line creates a 5×5 boolean matrix where position (i,j) is true if tokens i and j match:

     h  e  l  l  o
h  [ T  F  F  F  F ]
e  [ F  T  F  F  F ]  
l  [ F  F  T  T  F ]
l  [ F  F  T  T  F ]
o  [ F  F  F  F  T ]

The second line counts True values in each row, producing [1, 1, 2, 2, 1] - the histogram. But here’s what’s beautiful: that boolean matrix is exactly what an attention head computes. The histogram operation maps directly to a single transformer layer.

This isn’t a programming trick. It’s how transformers actually think about counting.

The real magic happens when RASP programs compile to transformer architectures. Take that sorting example I mentioned. The compiler analyzes the dependency graph and determines it needs exactly 2 layers with 1 attention head each. Step 2 depends on step 1, so they can’t happen simultaneously.

When researchers trained transformers using those exact specifications, they achieved 99.96% accuracy. The theoretical model perfectly predicted empirical reality.

I’ve been working through increasingly complex examples. The “most frequent” task takes a sequence like “ $abbccddd" and outputs unique tokens in descending frequency order: "$ dbca”. It combines histogram computation, duplicate detection, and sorting:

# Compute histogram
same_token = select(tokens, tokens, ==)
hist = selector_width(same_token)

# Mark first occurrence of each token
first_occurrence = select(indices, indices, <=) and same_token
is_first = selector_width(first_occurrence) == 1

# Sort by frequency (descending), then by first occurrence
freq_comparison = select(hist, hist, >) or 
                 (select(hist, hist, ==) and select(indices, indices, <))
freq_position = aggregate(freq_comparison, is_first)

# Rearrange to sorted order
result = aggregate(select(indices, freq_position, ==), tokens)

This compiles to 3 layers and 2 attention heads. Tested on actual transformers: 99.96% accuracy.

What strikes me is how these solutions emerge naturally once you think in terms of attention patterns. You’re not fighting the transformer’s computational model - you’re working with it.

RASP has resolved longstanding questions about transformer capabilities. Can they recognize context-free languages like balanced parentheses? Yes - RASP provides explicit programs with exact compilation targets. Dyck-1 languages need 2 layers and 1 head. Dyck-2 needs 3 layers and 1 head.

The solutions become obvious once you can express them in RASP. Track running balance of brackets, check for negative values, use attention to propagate illegal states forward. Simple, elegant, provably correct.

Sorting was particularly interesting. Traditional algorithms need Ω(n log n) comparisons. RASP’s approach makes exactly n² comparisons - one for each position pair. This fits perfectly with attention mechanisms that naturally compute pairwise relationships.

But it also explains why sparse attention loses expressive power. Longformer’s local attention windows can’t make all pairwise comparisons needed for sorting. RASP predicted this limitation before anyone tested it empirically.

Counting reveals another insight. With a beginning-of-sequence token, counting needs only 1 layer and 1 head. Without it, you need 2 heads to create an artificial reference point. This explains why separator tokens are so crucial in practical applications - they’re computational necessities, not just organizational conveniences.

The compilation process itself fascinates me. RASP analyzes dependency graphs and maps operations to either feed-forward transformations or attention heads. Aggregates must be placed after their dependencies. Multiple aggregates with different selectors need separate heads. Same selectors can share heads.

The compiler outputs minimal architectures. Not just sufficient - minimal. This gives tight bounds on transformer complexity for specific tasks.

This fundamentally changes how we approach transformer research. Architecture design becomes principled. Instead of guessing layer counts, write the RASP program and compile it. Interpretability gets concrete - compare learned attention patterns to RASP solutions. New architectures become analyzable - RASP reveals exactly which capabilities modifications preserve or sacrifice.

I’m particularly excited about mechanistic interpretability implications. RASP gives us vocabulary for describing expected transformer internals. When attention heads learn histogram computation, we know what patterns to look for.

Of course, RASP has limitations. It only models encoder transformers - decoder causal masking adds constraints RASP doesn’t capture. It assumes symbolic, rule-based solutions rather than explaining how transformers learn statistical patterns from data. Embedding dimensions matter but RASP abstracts them away. It’s a high-level model that doesn’t explain gradient descent discovery processes.

But these limitations don’t diminish its value. RASP bridges symbolic programming and neural computation in ways I never thought possible.

I keep thinking about extensions. Causal RASP for decoders? RASP for multimodal transformers processing image patches? Languages for state space models, linear attention, mixture of experts? Program synthesis going backwards from trained transformers to RASP code?

The broader question fascinates me: what other architectures need their own programming languages? RNNs have automata theory. Transformers have RASP. What about graph neural networks? Diffusion models? Future architectures we haven’t invented?

RASP represents something new in AI research. Not just empirical evaluation or theoretical analysis, but computational understanding. The ability to program machine learning architectures in their native languages.

We’ve become obsessed with scaling laws and emergent capabilities while neglecting fundamental questions about computational models our architectures implement. Understanding these models isn’t academic curiosity - it’s foundation for principled design, reliable interpretability, informed scaling decisions.

When we understand how transformers think, we design better transformers. When we understand limitations, we engineer solutions. When we understand computational primitives, we compose them more effectively.

This feels like the beginning of something larger. A research program bridging programming languages, theoretical computer science, and machine learning. Where understanding computation becomes as important as scaling it.

I want to help build this future. Not just more powerful models, but more understandable ones. Not just better performance, but deeper insight into computational processes making intelligence possible.

That 2 AM debugging session taught me something profound. We don’t just need AI that works. We need AI we can understand, predict, and design with intention. RASP might be the first step toward that goal.

Because if we’re building minds, artificial or otherwise, we should probably understand how they think.