Quantity has a quality all its own

The aphorism “quantity has a quality all its own” is usually attributed to Josef Stalin, though the earliest traceable source appears to be an American defense newsletter from 1979.

Whoever said it, they were talking about armies. More soldiers don’t just let you fight the same battles better. Past a certain size, your army can fight a different kind of battle entirely, with new strategies and tactics. The observation applies at other scales, too: firing a machine gun is a very different experience from firing a musket—especially if you’re on the receiving end.

The aphorism might as well be the thesis statement for one of the most surprising discoveries in modern AI research. I touched on it in my last post when I compared Mark V. Shaney’s capabilities to modern LLMs. The discovery is this: bigger language models don’t just do the same things, only better. When they grow large enough, they start to be able to do entirely new things.

The scaling hypothesis

For years, the working assumption in AI research was that more training data and more parameters meant better performance on the tasks the model could already do. A model with twice as many parameters would write sentences twice as fluent, make half as many factual errors, handle edge cases twice as gracefully— if you can measure any of that. But the improvement would be proportionate; no quantum leap¹ in capability was anticipated.

Researchers soon noticed that this assumption didn’t hold. Some LLM capabilities were simply absent in smaller models, and then, past some threshold of scale, they were present, as if the model had crossed a phase boundary—like water turning to ice. A landmark 2022 Google paper documented more than a hundred such emergent abilities across tasks ranging from arithmetic to logical reasoning to reading comprehension in multiple languages. If you’re wondering why AI researchers keep building bigger and bigger models, this is why.

Multi-step arithmetic, the kind where you carry values across several operations, is a good concrete example. Small language models fail utterly at this task. They don’t do it badly; they basically can’t do it at all. The failure rate barely budges as you scale the model up—until the parameter count passes a magical value. Then accuracy jumps, sharply and non-linearly, from “near zero” to “competent student.” Nothing changed in the training procedure; the model simply grew up. I’m anthropomorphizing, but the similarly to human maturation is uncanny.

Earlier, I used the term “quantum leap” to describe the unexpected jump in capability, the change that LLM researchers didn’t see coming. The change in capability is, indeed, a lot like the energy level of an electron in an atom, which can only have specific, discrete values. If you don’t put enough energy into the atom to push the electron to the next level, it doesn’t move. If you do, it does. That’s a quantum leap.

Emergence doesn’t negate the LLM’s fundamental nature as a pattern generation machine. An LLM can perform multi-step arithmetic because its training corpus, which includes things like mathematics textbooks, contains many examples of just that. Even a smallish LLM might be able to answer a question like “How do you do long division?” when it was trained on exactly that procedure. It just can’t apply those steps the way a larger model can.

How does emergence emerge?

The leading hypothesis is that emergent abilities arise from models being able to build increasingly abstract language representations as they grow. A small model learns surface statistics: word co-occurrences, common phrasings. A larger model, with more parameters to fill and more text to learn from, might be able to find deeper structure in language: grammar, logic, causality, maybe something resembling a model of the reader’s mental state, or even of the world itself. At some point, the theory goes, enough of those deeper structures exist to manifest a new capability—like how enough individual transistors, connected the right way, can suddenly perform computation.

Of course, in the case of the transistors, someone is doing the connecting. The freaky thing about emergent LLM capabilities is that they seem to speak themselves into existence. And neither we nor the models have the ability to inspect internal data structures and definitively say, “this is where arithmetic happens.” (Although interpretability is an area of active research.)

Among other implications, emergence would mean that evaluating an LLM’s capabilities based on older, smaller versions of itself is a fool’s errand. Safety becomes harder and harder to evaluate when you don’t even know what the model can do.

The researchers scaling up language models have learned that quantity has a quality all its own… much faster than anyone expected.

Yes, I really mean “quantum leap,” I promise! Keep reading. ↩︎