Language models today can pass bar exams and craft poetry, yet they’re essentially performing the same task as your phone’s predictive text—guessing the next word.
This process is known as autoregression. These models predict each token based on its predecessors: It’s probability math that boils down to “what’s the most likely next word?”
Yet, this feels more like a magic trick than genuine understanding—a sophisticated parlor game.
Consider this thought experiment: a 100-step random walk where each step is small, but there’s a 50% chance of a significant jump at a random position. An autoregressive model struggles here because the jump’s position isn’t related to previous steps—it’s a non-local dependency that breaks the sequential assumption.
Or take a sequence where each term is —all values depend on a single hidden number , not on each other. To predict the next term, an autoregressive model must reverse-engineer from previous terms—essentially solving modular equations. It’s like trying to guess someone’s password by watching them type parts of it. Technically possible, but wildly inefficient.
These aren’t just theoretical issues. They highlight fundamental limitations in how today’s language models operate. They’re like talented mimics who’ve never truly understood the language they’re speaking.
The real potential, I believe, lies in factorized models with meaningful latent variables. Models that don’t just predict surface patterns but capture the underlying concepts and relationships that generate language.
Think about it: humans don’t process language as a stream of probabilities. We understand ideas, relationships, concepts—latent variables that shape what we say and comprehend. When you’re reading this, you’re not just predicting my next word—you’re building a mental model of the concepts I’m trying to convey.
Transformers nudge us in this direction with self-attention mechanisms. They can at least look at all tokens simultaneously, encoding some non-sequential relationships. That’s why they’re better at in-context learning than pure sequential models. They’re catching glimpses of the deeper structure of language.
But I’m particularly intrigued by diffusion language models (DLMs). They flip the script entirely. Instead of generating text token by token like building a sentence one word at a time, they start with noise and iteratively refine it into coherent text.
It’s a completely different paradigm. Like sculpting versus building with blocks. The former reveals structure by removing what doesn’t belong; the latter constructs by adding piece by piece.
The theory is that diffusion models might better capture global dependencies and underlying structures in language. They’re not constrained by the left-to-right sequential nature of AR models. They can potentially see the forest, not just the trees.
Of course, there are challenges. Diffusion models require significant computational resources. Language is inherently discrete, with fewer bits per token than continuous data like images. And our evaluation metrics for language models were designed with autoregressive models in mind.
But what’s exciting is that we’re exploring fundamentally different approaches to language modeling. It’s like early aviation—we’ve built impressive gliders, but we’re still figuring out powered flight.
I think the next breakthrough in language AI won’t come from merely scaling existing models. It’ll come from models that can form meaningful latent representations of language—capturing the essence rather than just the surface form.
True intelligence in language isn’t about predicting the next word. It’s about understanding concepts, relationships, and meaning—the hidden variables that shape what we say and how we say it.
Autoregression has taken us surprisingly far. But to achieve genuine machine understanding, we need to move beyond the parlor trick of next-token prediction and build models that capture the rich latent structure of human language and thought.