Why Creativity Cannot Be Interpolated
Why Creativity Cannot Be Interpolated
And Why Understanding Is the Path to Get There
Dr. Jeremy Budd (Assistant Professor of Mathematics and its Applications, University of Birmingham),
Dr. Tim Scarfe (MLST)
Summary
AI systems can produce individually novel outputs, but novelty alone is not creativity. We argue that genuine creativity requires respect for constraints -- the accumulated structure of prior discoveries -- and that current AI systems lack this capacity because their training takes greedy paths that preclude the right kind of representations.
Key claims
Creativity requires deep understanding and constraint-respecting extension. âSlopâ is the conjugate failure mode -- what you get when novelty is unconstrained by understanding.
More intelligence and more agency can harm transformational creativity. As a corollary of Kenneth Stanleyâs Why Greatness Cannot Be Planned, objective-driven optimisation helps with exploratory creativity but inhibits -- perhaps completely -- the transformational kind. This matters for AI efforts trying to square knowledge synthesis with genuine âunknown unknownâ discovery.
Constraints operate at three levels: physical (baked into matter), concrete (instantiated in a fixed substrate), and modelled (represented so they can be manipulated, transferred, and counterfactually varied). Understanding is the cognitive form of this third level -- the capacity to navigate between constrained perspectives and integrate across them.
Current AI is only meaningfully creative under human supervision. LLMs are coherent within any single frame, but they possess no trajectory of their own; their aggregated voice resolves into a coherent perspective only when a competent human supplies the grounding.
Human-AI co-creativity is the most promising path forward, though the risk of underplaying human supervision is real -- as evidenced by wild extrapolations about near-term labour-market disruption.
We leave the door open for better architectures. Any system -- biological or artificial -- whose learned, factored, path-dependent representations let it extend its own phylogeny could be genuinely creative, regardless of substrate.
Note: you can also read this on our newly launched MLST Archive site.
Why Creativity Cannot Be Interpolated
And Why Understanding Is the Path to Get There
âTo understand human-level intelligence, we are going to need to understand creativity. Itâs a big part of what being intelligent means from a human level, is our creative aspect.â
â Kenneth Stanley, On Creativity, Objectives, and Open-Endedness â HLAI Keynote
What are sparks without a fire? The authors of the GPT-4 technical report proclaimed âsparks of AGIâ, but a fire was, and is still, nowhere to be found. Despite apparent recent breakthroughs, AI on its own is missing the fire of creative power. And without this fire, AI will never venture beyond the territory it was trained on. As Neuroevolution: Harnessing Creativity in AI Agent Design puts it: âWhile [neural networks] interpolate well within the space of their training, they do not extrapolate well outside itâ. By âinterpolationâ we mean something broader than the mathematical sense: recombination within existing conceptual structure. A system that interpolates may produce individually novel outputsânew sentences, new imagesâbut only by averaging what it has seen, without representing the domainâs actual structure. Creativity, by contrast, respects that structureâthe constraints of a domainâwell enough to extend it, opening up genuinely new dimensions in the space of possibilities.
Figure 1: Neuroevolution: Harnessing Creativity in AI Agent Design (MIT Press, 2025) by Risi, Tang, Ha, and Miikkulainenâa comprehensive treatment of evolutionary approaches to neural network design and the open-ended creativity they enable. (Miikkulainen is a long-time collaborator of Kenneth Stanley, whom we will meet shortly.)
Creativity is not random. Many people picture it as chaoticâthrow enough paint at the wall and eventually you get a Pollock. But it is more like fitting puzzle pieces together for a puzzle that never existedâthe pieces must still interlock, even as you invent the picture. Yes, there is serendipity. But the stumbling happens along paths carved by structure, not by chance.
We want AIs that can âthinkâ, but what is thinking? Nobel laureate Daniel Kahnemanâs 2011 bestseller Thinking, Fast and Slow (Kahneman 2011) divides thinking into two systems.1 âSystem 1â thinking is fast, intuitive, and instinctive. It can make effective judgements when grounded in experience, but it operates within familiar territory. System 1 is what current AI systems do well: rapid pattern matching within their training distribution. But pattern matching fails when the territory is genuinely new. Every domain we care aboutâwriting code, driving cars, doing science, counselling patientsâdemands handling unknown unknowns: situations no training set anticipated. As we shall see, more intelligence can paradoxically make this harder, not easier.
âSystem 2â thinking is slow and deliberate, and is epitomised by reasoning. Unlike System 1, reasoning can venture into unfamiliar terrain by breaking the unknown into familiar pieces, constrained by the logic of what must fit together. This is the constraint-respecting mode of thought: not free association, but structured exploration where each step must cohere with what came before.
For an AI to âreasonâ, then, it must engage in some kind of deliberate, structured, compositional process that is aimed at acquiring knowledge and understanding. Not reasoning is very different from reasoning poorly. For example, if you ask me to find the best move in a chess position, I might make lots of mistakes in my analysis and miss the best move, yet still be reasoning. By contrast, Magnus Carlsen might âseeâ the best move instantly, without doing any explicit reasoning. Thus, whether one is reasoning is neither determined by the task one is performing nor the quality of knowledge one acquiresâa non-reasoner may acquire better knowledgeâbut by the process one is using.
We do not acquire knowledge in a vacuum. You donât really understand physics right after a lecture, or even after a degreeâyou understand it after doing the exercises, after years of reflection, building bridges to your own experience.2 Understanding is less âacquiredâ than it is synthesised and constructed.
Human understanding can be asymmetric: we often grasp things in a discriminative way that we cannot articulate generatively. This is what we call tasteâan ineffable sense of what works, even when we cannot say why or produce it on demand. Human creatives working in complex, ambiguous domains exploit this asymmetry: they generate many candidates and then discriminate, using their superior taste to select the better paths. Over time, this becomes self-adversarialâeach round of discrimination sharpens the generator, raising the bar for what taste will accept next.
Current AI systems suffer from a far more extreme asymmetry. They can often recognise good solutions, yet generate mediocrityâpartly because generation requires the deep structural knowledge that constrains the search, while verification can lean on shallower pattern matching; partly because discrimination focuses a modelâs full capacity on a single judgement, while generation disperses it across the output space, representations, and context with a fixed computational budget per step. As we shall see, much recent progress has come from adding external constraints, but the understanding those constraints embody comes from outside the system, not from within. Humans too use constraints to navigate domains that exceed their generative graspâthe difference is that our taste is far richer, so we can provide our own scaffolding.
But intelligent reasoning is not simply applying a deliberate, structured, compositional process. A calculator applies such a process, and might produce in you the new knowledge that 127,763 * 44,554 = 5,692,352,702 (arenât you glad). Yet a calculator is hardly intelligent. More is needed, and we will argue that what separates robust generalisation from brittle skill is something that looks a lot like creativityâthe capacity to respect and extend the structure of what came before.
Intelligent reasoning needs creativity (but not vice versa)
Why âbut not vice versaâ? Because creativity does not need intelligence. Evolution produced the entire tree of life through blind variation and selective retention, with no intelligence at all. Daniel Dennett had a name for this: competence without comprehension Dennett 2017. Competence Without Comprehension
One of Darwinâs 19th-century critics captured the idea perfectly, albeit in outrage: Darwin, âby a strange inversion of reasoning, [he] seems to think Absolute Ignorance fully qualified to take the place of Absolute Wisdom in all of the achievements of creative skillâ Dennett 2009. As Dennett loved to point out: thatâs exactly right. The eagleâs wing, the dolphinâs fin, the human eyeâall designed by a process with no insight, no purpose, no mind at all. Turing stumbled on the same strange inversion: a computing machine need not know what arithmetic is to perform it perfectly. Both showed that competence bubbles up from below: âunderstanding itself is a product of competence, not the other way aroundâ. We âintelligent designersâ are among the effects of this process, not its cause.
But evolution still has constraintsâphysical and concrete, baked into the laws of nature and matter itself, rather than modelled in representations that can be manipulated, transferred, and varied. How creativity can operate through such constraints without understanding is a tension we will resolve through our analysis of AlphaGo and the hierarchy of constraint adherence.
Chollet and âstrongâ reasoning
In 2019, Keras author François Chollet proposed a framework for measuring intelligence, focusing on generalisation as the key idea.
Generalisation requires more than skillâthe ability to perform a static set of tasks. A calculator is all skill; it can only do what it was hard-wired to do. Generalisation requires the capacity to acquire capacity, on-the-fly in response to new challenges. Chollet defines intelligence as:
âThe intelligence of a system is a measure of its skill-acquisition efficiency over a scope of tasks, with respect to priors, experience, and generalization difficulty.â
Chollet has more recently called this âfluid intelligenceâ. Note that this measure is relative to a scope of tasks; Chollet rejects the idea of universal intelligence, in stark contrast to folks like Legg and Hutter who think a single dimension of intelligence could rank humans, animals, AIs, and aliens alike.
Figure 2: General intelligence as program synthesis: an intelligent system composes skill programs on-the-fly to handle novel tasks. The framework is deliberately capability-levelâit measures what a system can do (task in, skill out) without prescribing how the system achieves it. Source: Chollet 2019
To summarise, in Cholletâs own words, general intelligence is âbeing able to synthesise new programs on the fly to solve never-seen-before tasksâ. Chollet gives a spectrum of generalisation: local generalisation handles known unknowns within a single task; broad generalisation handles unknown unknowns across related tasks; and extreme generalisation handles entirely novel tasks across wide domains. The late cognitive scientist Margaret Bodenâwhose typology of creativity we will develop in Section 2âdrew an influential distinction between exploratory creativity (finding new solutions within an existing framework) and transformational creativity (reshaping the framework itself). In her terms, Cholletâs intelligence is a powerful form of exploratory creativity. Chollet would argue that this covers broad and even extreme generalisationâbut as we will see, his framework bounds these within fixed priors. Genuinely unknown-unknown territory requires transformational creativity: the capacity to extend or reshape the space of possibilities.
Figure 3: Cholletâs framework in brief. Source: Chollet 2024
Chollet does use the term âunknown unknownsâ for his broader generalisation levels, yet his framework bounds them in two ways. First, he assumes that human-like general intelligence shares our innate Core Knowledge priorsâbasic cognitive capacities like objectness, agentness, number, and geometryâarguing that these priors âare not a limitation to our generalisation capabilities; to the contrary, they are their sourceâ. Second, he explicitly limits scope to âhuman-centric extreme generalisation...the space of tasks and domains that fit within the human experienceâ. These two bounds are related: as Chollet himself writes, priors âdetermine what categories of skills we can acquireââhe sees this as enabling (the No Free Lunch theorem means you need assumptions to learn at all), but it also means the âwide domainsâ of extreme generalisation are still those that Core Knowledge lets you make sense of.
Cholletâs âunknown unknownsâ are novel combinations within this prior-bounded space, not paradigmatically new discoveries that expand the space itself. Evolution produced Core Knowledge priors in the first placeâthat is the meta-level creativity Cholletâs framework cannot account for. His measure presupposes the priors; it cannot explain their origin. Chollet himself is candid about this: âthe exact nature of innate human prior knowledge is still an open problemâ (Chollet 2019).
Three claims are in play: Core Knowledge (Spelke and Kinzler 2007) is psychological (innate capacities for objectness, agentness, number, geometry); Cholletâs measure is epistemological (skill-acquisition efficiency given those priors); his âkaleidoscope hypothesisâ is ontological (reality itself is built from recurring patterns). Philosopher Mazviita Chirimuuta identifies this last claim as recognisably Platonic (Chirimuuta 2024). The position echoes Chomskyâs rationalism (Chomsky 2023): both treat intelligence as exploration within fixed innate priors, and remain silent about where those priors came from. Chirimuutaâs Kantian counter: the patterns may be âdemands of human reasonâ rather than discoverables (Chirimuuta 2024)âin which case, the benchmark measures fitness to a particular model of mind.
This matters for creativity. If Cholletâs priors are the bedrock of cognition, the distinction between exploratory and transformational creativity collapsesâall creativity becomes exploration within a fixed space. Our position requires that priors are contingent: evolved, path-dependent, and in principle revisableâshaped by the same meta-level process that Cholletâs framework cannot account for.
We do not need to settle these questions here. What matters is what Chollet gets right. His core insightâthat general intelligence amounts to on-the-fly program synthesisâhas proved highly productive. The ARC Prize competition, built around his benchmark, has drawn thousands of participants (Chollet, Knoop, and Kamradt 2025), and Chollet has since founded ndea, a research lab dedicated to combining program synthesis with deep learning. Much of this article owes its framing to Cholletâs thinking.
Where we extend Chollet is in asking how a system builds the internal structure that makes program synthesis possible. Cholletâs framework measures skill-acquisition efficiency and screens off internal mechanism from the description. Unfortunately, that means that a system can score well on capability benchmarks and still lack anything we would recognise as understanding. As we saw above, synthesis is deeply linked to how we acquire knowledge and understanding. We will call this process of composing models on the fly (to handle novelty) strong reasoning, to distinguish it from the meagre processes used by the likes of a calculator. Understanding how a system builds the internal structure that makes such composition possible is one of the central questions of this article.
Stanley and the need for open-endedness
A key architectural omission from Cholletâs account is the notion of agency. When Tim interviewed him in 2024, he expressed a strong interest in exploring the topic more deeply but said (after the interview) that he didnât yet have a âcrispâ way to do so. Curiously, the third version of Cholletâs ARC-AGI benchmark has been designed to target âexploration, goal-setting, and interactive planningâ, which Chollet considers to be âbeyond fluid intelligenceâ.
But computer scientist Kenneth Stanley, author of Why Greatness Cannot Be Planned and one of the deepest thinkers about AI creativity, sees things differently. His book deliberately avoided defining intelligenceâits target was the tyranny of objectives, the very paradigm that Cholletâs task-solving measure exemplifies. In later work, Stanley argued that âit was open-ended evolution in nature that designed our intellects the first timeâ (Stanley 2019), and in our interviews he has described creativity as âa big part of what being intelligent means at the human levelâ (Stanley 2021). Where Chollet treats exploration and goal-setting as beyond the scope of his benchmark, Stanley sees them as the heart of the problem.
There is a deeper connection here. Agency is goal-directed by definition: it takes actions to achieve goals. Intelligence, in Cholletâs sense, is about how efficiently you learn given priors and experience, not about what you are searching for. But Cholletâs picture of intelligence is still deployed toward objectives: you acquire skills in order to solve tasks. So both share the same vulnerability when those objectives are misspecified. When the objective is what Stanley calls a âfalse compassâ, both become blinkersâfocusing attention on the goal while missing the stepping stones that donât resemble it. More intelligence or more agency just means charging faster in the wrong direction, efficiently acquiring the wrong skills. Intelligence and agency only help if you happen to be solving the right problem or moving toward the right goalâthey are tools for exploratory creativity, not transformational creativity. But when the objective is genuineâwhen constraints have accumulated and the problem is well-definedâintelligence can actually help you. The more knowledge and structure you bring to a task, the more efficiently intelligence can exploit it. This is why Cholletâs measure includes âpriorsâ and âexperienceâ: intelligence leverages what you already have.
Stanley argues that convergent, goal-directed thinking limits the imagination; that divergent thinking is required to discover knowledge of unknown unknowns. Paradoxically, Stanley argues, this open-endedness is also essential for solving complex tasks. Complex and/or ambitious tasks are âdeceptiveâ; which is to say that (some of) the stepping stones towards solving them are very strange, seemingly unrelated to the task. As the Neuroevolution textbook puts it, these approaches âare motivated by the idea that reaching innovative solutions often requires navigating through a sequence of intermediate âstepping stonesââsolutions that may not resemble the final goal and are typically not identifiable in advanceâ. For example, the worst way to become a billionaire is to get a normal corporate job and incrementally maximise your salary. A great example of a strange path to greatness was YouTube, which was started as a video dating website!
In our interviews with Stanley, he has repeatedly emphasised this point.
âThe smart part is the exploration. The dumb part is the objective part because itâs freaking easy. Thereâs nothing really insightful or interesting about just doing objective optimization. [âŠ] Once I say that what you need to be good at is if I define where I want you to go and then you can get there, then Iâm basically training you not to be able to be smart if you donât know where youâre going. But thatâs what creativity is. Itâs about being able to get somewhere and be intelligent even though you donât know where your destination is.â
Prof. KENNETH STANLEY - Why Greatness Cannot Be Planned
Stanley therefore prescribes abandoning objectives, and becoming open-ended by searching for novelty.
What exactly is open-endedness? In 2024, a team led by Tim RocktĂ€schelâthe open-endedness team lead at Google DeepMind and Professor at UCLâ formally defined an open-ended system as one which produces a sequence of artefacts which are:
Novel, i.e. âartifacts become increasingly unpredictable with respect to the observerâs model at any fixed timeâ.
Learnable, i.e. âconditioning on a longer history makes artifacts more predictableâ.
We will return to this formal definition of open-endedness in Section 3, but for now notice what Chollet and RocktĂ€schel are both saying. Cholletâs general intelligence must âsynthesize new programsâ to âsolve never-seen-before tasksâ; RocktĂ€schelâs open-ended systems must produce ânovelâ and âlearnableâ artefacts. Both of these are describing creativity! The âstandard definition of creativityâ calls a work creative if it is (a) original or novel, and (b) effective or valuable. In our interview with RocktĂ€schel, Tim Scarfe observed: âI actually interpreted your definition of open-endedness as ... a definition of creativityâ. Open-Ended AI: The Key to Superhuman Intelligence? Creativity is thus the key to efficient generalisation and to open-ended exploration.
Figure 4: Kenneth Stanley on creativity and LLMs. Source: Stanley 2025
Agency requires intelligenceâyou cannot have directed, purposeful behaviour without some capacity to model and respond to the world (Schlosser 2019). In biological systems, intelligence and agency co-evolved and remain tightly coupled. But artificial intelligence need not be agentic; there is no reason a system with knowledge and reasoning capacity must also have future-pointing control. Still, even when intelligence is coupled with agency, Stanleyâs point still holds: fixed goals can constrain the very creativity needed to find problems worth solvingâunless the agent happens to be pointing in the right direction already, as we will discuss later.
Is that all there is to AI creativity?
The âstandard definitionâ lays out two criteria for creativity, but are those all you need? Creativity theorist Mark Runco thinks not. In two 2023 essays, Runco agreed that AI systems can, and indeed have, produced novel and effective outputsâbut argued that we must not focus only on the products of a system and ignore the processes by which those are produced. Runco adds two more criteria: authenticity and intent.3
A system is authentic if it acts in accordance with beliefs, desires, motives etc. that are both its (rather than someone elseâs) and express who it âreally isâ; authenticity is the opposite of being derivative. A system has intent if it is the reason why it does the things it does. If an AI system solves problems, but neither finds those problems nor has any intrinsic motivation to solve them, are those solutions really creative?
Both of Runcoâs criteria speak to a key distinction: creative ideas are not just original (a property of the product) but must also originate (a process) from their creator. Runco argues that AI systems lack key processes of human creativity, such as intrinsic motivation, problem-finding, autonomy, and (most starkly) the expression of an experience of the world. Runco concludes:
âGiven that artificial creativity lacks much of what is expressed in human creativity, and it uses wildly different processes, it is most accurate to view the ostensibly creative output of AI as a particular kind of pseudo-creativity.â
But is Runco right about the creativity needed for intelligent reasoning, rather than creative expression? Must this look like human creativity? To borrow a comment from Richard Feynman: our best machines donât go fast along the ground the way that cheetahs do, nor fly like birds do. A jet aeroplane uses âwildly different processesâ to fly than an albatross, but is it pseudo-flying? We are not claiming that different processes cannot workâonly that the particular processes used by current AI systems demonstrably fail in ways (adversarial brittleness, lack of transfer, derivative outputs) that reveal shallow pattern-matching rather than genuine comprehension. The principled distinction is this: understanding constrains and guides the creative searchâwithout it, outputs are derivative or random. Intent merely motivates the search. You can be creative by accident (Spencerâs microwave, evolution itself), but you cannot be creative without respecting constraints. That is why we require understanding but not intent.
Remember our central question: what qualities do AI systems need to perform reasoning tasks (planning, science, coding, etc.) in generalisable and robust ways? As we have seen, something that looks like, and quite possibly quacks like, creativity is needed. We must now ask: are authenticity and intent required for this creativity?
Creativity needs to respect the phylogeny
âI believe that it is possible, in principle, for a computer to be creative. But I also believe that being creative entails being able to understand and judge what one has created. In this sense of creativity, no existing computer can be said to be creative.â
â Melanie Mitchell, Artificial Intelligence: A Guide for Thinking Humans (Mitchell 2019)
Being inspired vs. being derivative
Can something derivative ever be creative? Is not a derivative system, in the end, merely laundering ideas from somewhere else? There is no creativity in the plagiarist. But one might objectâ as Alan Turing notedâwith the old saw that âthere is nothing new under the sunâ. Is not all creation derivative? Do not all creatives, from Shakespeare to Newton, stand on the shoulders of giants?
To make sense of this, we must distinguish being inspiredâwhere existing material flows through a creator, who makes it their ownâfrom being derivative, where existing material is pieced together with little deliberate input from the creator. The quintessential derivative system is a photocopier, which copies with zero understanding. Mitchell was onto something: understanding is crucial for authentic human creativity.
Understanding of what, exactly? We can draw a wonderful illustration by looking at Kenneth Stanleyâs 2007 Picbreeder website experiment. On Picbreeder, users could start from an image, get that image to produce âchildrenâ, then chose which child would be their new image, and so on. Behind the scenes, these images were being produced by neural networks, which evolved in response to the userâs choice via Stanleyâs NEAT algorithm (NeuroEvolution of Augmenting Topologies)âan evolutionary method that grows network structure incrementally. The project was collaborative: users could publish their images, and other users could start from published images rather than from scratch, creating a phylogeny of images.
Figure 5: Picbreeder phylogeny: the evolutionary tree showing how users collaboratively evolved images, including the famous âskullâ lineage. Source: Kumar, Clune, et al. 2025
In a 2025 paper, Akarsh Kumar and Kenneth Stanley point out that the networks producing these images have incredibly well-structured representations. Changing different parameters in the âskullâ network could make the mouth open and close, or the eyes wink. In our interview with Stanley, he argued that the crucial ingredient was the open-ended process by which users arrived at these images:
âOn the road to getting an image of a skull, they were not thinking about skulls. And so, like when they discovered a symmetric object like an ancestor to the skull, they chose it even though it didnât look like a skull. But that caused symmetry to be locked into the representation. You know, from then on, symmetry was a convention that was respected as they then searched through the space of symmetric objects. And somehow this hierarchical locking in over time creates an unbelievably elegant hierarchy of representation.â
Deep Learning has âfracturedâ representations [Kenneth Stanley / Akarsh Kumar]
These remarkable representations were the result of users respecting the phylogeny of the images they manipulated. By contrast, when Kumar et al. trained the same network to produce a Picbreeder image directly via stochastic gradient descent (SGD), ignoring this phylogeny, the image was almost identical but the representations were âfractured and entangledââin a word, garbage. Where the evolved network had parameters mapping to meaningful featuresâsymmetry, mouth shape, eyesâthe SGD-trained network smeared these across its weights with no interpretable structure. As Stanley put it: the SGD skull is âan impostor underneath the hoodâ. The Neuroevolution textbook generalises this finding:
âWhere SGD tends to entrench fractured and entangled representations, especially when optimizing toward a single objective, NEAT offers a contrasting developmental dynamic. By starting with minimal structures and expanding incrementally, NEAT encourages the emergence of modular, reusable, and semantically aligned representations.â
AI is SO Smart, Why Are Its Internals âSpaghettiâ? - Kenneth Stanley & Akarsh Kumar
All ideas have a phylogeny in this wayâmost much subtler and more complex than in Picbreederâand respect for this phylogeny is the difference between inspiration vs. being derivative. Inspiration is about understanding the phylogenies of the ideas one borrows, and thereby creating new works that deliberately extend those lineages. Ironically, to be âderivativeâ is to derive too little from oneâs sources!
Among the riches of the phylogeny are what Daniel Dennett called âfree-floating rationalesâ: reasons for a designâs structure that exist whether or not any mind grasps them. The eye has reasons for having a lens, but nobody had to understand them for the lens to evolve. In human creativity, by contrast, those same rationales become represented, manipulable, transferable.
This understanding comes in different levels. At the lowest is shallow, surface-level understanding, drawing very little from the riches of the phylogeny. A forger may paint a perfect copy of the Mona Lisa yet be hopeless at painting a new portrait, because all they understood was paint on canvas. Systems like Midjourney may produce impressive images, but their outputs are derivative of their vast training data (and usersâ prompts) sometimes to the level of, in Marcus and Southernâs words, âvisual plagiarismâ. These systems consume billions of images, but only as collections of pixels, and often demonstrate basic misunderstandings of image content, such as struggling to draw watches at times other than 10:10. This shallow understanding leads only to a âcreativityâ that recombines and remixes existing ideas. In her essay âWhat is creativity?â, Boden called this âcombinational creativityâ, but because these systems recombine without understandingâwithout grasping why the pieces fitâwe prefer to call it, at best, quasi-creativity. It may produce novel outputs, but there are no new ideas underlying those outputsâjust existing ones arranged in a new way.
The next level is domain-specific understanding. By understanding how the ideas and tools work within a domain (or what Boden calls a âconceptual spaceâ) one obtains âexploratory creativityâ, the ability to discover new possibilities within that space. This is the workhorse of human creativity. As Boden urges, âmany creative achievements involve exploration, and perhaps tweaking, of a conceptual space, rather than radical transformation of itââNobel Prizes reward âingenious and imaginative problem solvingâ, not Kuhnian revolutions. Even some of our most celebrated creative achievements stem from thinking deeply âinside the boxâ.
Finally, the highest level is domain-general understanding. When one understands oneâs tools in themselves, beyond their common or intended uses, one can use them in ever more creative ways. A wonderful example of this in action is the âsquare peg in a round holeâ scene from Apollo 13. Domain-general understanding is the key to what Boden calls âtransformational creativityâ, the ability to create new conceptual spaces. To make sense of a new conceptual space, one must understand how to extend phylogenies into this new domainâto understand gravity but not as a force, or harmony but without a tonal centre. To think âoutside the boxâ, one needs to understand what happens to oneâs tools when they are taken out of the box.
Apollo 13 (1995) - Square Peg in a Round Hole Scene
The boundary between exploration and transformation lies, somewhat, in the eye of the beholder. One personâs ânew domainâ might be anotherâs ânew possibility within a domainâ. Therefore, the key question is not âcan we make transformatively creative AIs?â Stanley remarked on a draft of this very article that he thinks of combinatorial and exploratory creativity as ways to find a new location within the space youâre in, whilst transformational creativity is about âadding new dimensions to the universeâ. In this view, NEATâs complexification operatorsâwhich add new nodes and connections to an evolving networkâare a concrete realisation of transformational creativity. Boden argued that a prima facie transformatively creative AI was built as far back as 1991 by Karl Sims. Instead, we should ask how deep the AIâs understanding was that led to its surprising outputs, and what spaces it can and canât make sense of.
A derivative system (ironically: not derivative enough!) will not generaliseâit lacks the phylogenetic understanding needed to extend ideas into unfamiliar settings, and its reliance on surface features makes it brittle.
All this said, derivative systems may still be useful for reasoning: they might extract ideas or reasoning patterns which, whilst pre-existing in data (or the user!), were previously inaccessible. This may be very valuable in creative reasoning pipelinesâas we will soon explore. Not all AI systems are equally derivative. Google DeepMindâs AlphaZero had, well, zero training data, and we will later explore the extent of AlphaZeroâs creativity.
Agency, intent, and Why Greatness Cannot Be Planned
What about Runcoâs criterion of âintentâ? This, alongside the stronger sense of authenticity as expressing âwho one really isâ, suggests that agency is needed for creativity. By agency we mean control over the expected futureâtaking actions now to shape what comes next. As Claude Shannon, the founder of information theory, observed: âWe know the past but cannot control it. We control the future but cannot know it.â4 Agency operates in this gap: we act on our expectations, which may prove wrong, and we can acquire new goals as understanding evolves. Surely the more agency you have, the more creative you can be, right?
Only the plot thickens, since as Stanley says, greatness cannot be planned! Too much agencyâtoo much controlâis anathema to creativity. Stanleyâs insight is that the most fertile ground for creativity is when you are unfettered and serendipitous. Serendipity doesnât imply greatness, but itâs so often present when greatness occurs!
But we must be careful here. The point is not that you should have no agency at allâquite the opposite. Follow someone elseâs objectives and you explore their search space, not your own; surrendering your agency is, on average, the worst way to be creative, because you are less likely to stumble upon spaces that only your particular trajectory could reach. The real insight is about the kind of agency that matters: agency diffused across many independent actors, each following their own gradient of interest.
Both creativity and intelligence use priorsâthe difference is direction. Intelligence converges toward a known goal; creativity diverges into unknown territory, using constraints to keep the search coherent. Constraints enable rather than determine: grammar constrains what you can say without determining it; physics made eyes possible without encoding them as a destination. Evolution has no agencyâ it cannot planâbut exhibits teleonomy: apparent goal-directedness from selection pressure rather than intention (Pittendrigh 1958). For agents who can plan, a different kind of agency helps creativity: the ânose for the interestingâ that Stanley emphasisesâtaste-driven, intuitive orientation toward the unknown. As Stanley puts it:
âThe gradient of interestingness is probably the best expression of the ideal divergent search. Not everything thatâs novel is interesting, but just about everything thatâs interesting is novel.â
Prof. KENNETH STANLEY - Why Greatness Cannot Be Planned
The best ideas are often those you were not seeking. One day in 1945, the engineer Percy Spencer was working on a radar set, and when he stood near a cavity magnetron, the chocolate bar in his pocket melted! Spencer recognised this sticky misfortune for what it truly was: it was an unplanned experiment on what microwaves do to food, and he understood what it meantâ leading him to invent the microwave oven! Creativity is thus less about oneâs control over the world, and more about oneâs ability to adapt to the curveballs the world throws, grounded in oneâs deep understanding.
Intent is, therefore, not a necessary condition for creativity. Both purposeful and non-purposeful creativity can work; human creativity often involves unintended twists, and as weâve seen, creativity doesnât require agency at all. It may not matter if an AI theorem prover does not care about the Riemann Hypothesis, or if a driverless car does not choose its destination. But a creative output must originate in a system for us to call that system creative for producing it, and this origination requires being grounded in and deliberately extending the phylogeny.
Can anything originate in an AI system? Ada Lovelace, the first ever computer programmer, famously argued that it couldnât:
âThe Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how to order it to perform.â
Boden gives a key response to Lovelace: what if an AI system changes its own programming? We can order it to perform some task, but allow it to determine how exactly it does so. Boden points to evolutionary algorithms, such as in Bird and Layzellâs 2002 âEvolved Radioâ, as permitting AI systems to give themselves genuinely novel (to the AI) capabilities.
Doing things one wasnât ordered to do is not enough, though. As Mitchell argues, creativity requires understanding and judging what one has created (Mitchell 2019). A monkey at a typewriter might produce Hamlet, but it could never repeat this miracleâorigination requires a process that systematically produces that sort of thing. Spencerâs chocolate melting was an accident, but it was no accident that it led him to invent the microwave oven; had the bar melted some other day, he would have invented it just the same.
So we have a framework: creativity requires respecting the phylogeny, and origination requires understanding. How do todayâs AI systems measure up?
Are LLMs creative?
The test is whether these systems respect the phylogeny, and whether what they produce can be said to originate in them. We start with LLMsâtrained on vast quantities of human textâthen turn to game-playing systems like AlphaGo and AlphaZero, which learn from self-play alone. Each fails differently, and the contrast sharpens the picture.
Way back in 2019âwhen, as far as LLM history is concerned, dinosaurs roamed the Earthâthe lowly GPT-2 could write poems.
Fair is the Lake, and bright the wood,
With many a flower-full glamour hung:
Fair are the banks; and soft the flood
With golden laughter of our tongue
Not bad for such an antiquated model, right? Well, not exactly. This poem is a short extract from a list of a thousand samples, 99.999% of which is junk. One finds many patterns in clouds, but the clouds are not creative!
ChatGPT was something new. Suddenly, here was a system you could ask to write an email as a Shakespearean sonnet, and it just... would. It wouldnât be perfect, or even all that good, but you wouldnât have to sift through pages of nonsense. And then GPT-4 landed a few months later, and was so much better. The hype went into overdrive; the exponential was upon us. No wonder that within weeks of GPT-4âs release, there were predictions of â AGI within 18 monthsâ!
But now the hype has started to fade. The systems are more capable than ever, yet people are increasingly unimpressed. GPT-5 landed less with a bang and more with a shrug.5 What is going on? Are these systems showing any creativity, or even quasi-creativity? Are they wholly uncreative â stochastic parrotsâ? Why have LLMs lost their shine?
Can you measure LLM creativity?
Measuring creative thinking is not straightforward. One of the â 6 Pâs of Creativityâ is persuasion: a truly creative reasoner can produce âwrongâ solutions just as valid as the ârightâ answer, and a benchmark that cannot be persuaded will reject themâ this has already happened. Still, some aspects can be tested. In a 2024 Nature study, GPT-4 outperformed humans on three standard divergent thinking tasksâgenerating unusual uses, surprising consequences, and maximally different concepts. As computer scientist Subbarao Kambhampati emphasised:
âWe think idea generation is the more important thing. LLMs are actually good for the idea generation [...] Mostly because ideas require knowledge. Itâs like ideation requires shallow knowledge and shallow knowledge of a very wide scope. [...] Compared to you and me, they have been trained on a lot more data that even if theyâre doing shallow, almost pattern match across their vast knowledge, to you it looks very impressive. And itâs a very useful ability.â
Do you think that ChatGPT can reason?
(Note Kambhampatiâs careful phrasing: âshallowâ and âalmost pattern matchâ. LLMs often act as if they have knowledge, but they cannot distinguish truth from statistical associationâthey lack the grounding that would make it knowledge proper.)
But divergent thinking is only half of creativity. Who cares if GPT-4 can list more uses of a fork than you can, if none of those uses are any good? The Allen Instituteâs MacGyver benchmark (Tian et al. 2024) tests creative problem solvingâe.g., heating leftover pizza in a hotel room using only an iron, foil sheets, a hairdryer, and similar everyday items. Humans outperformed all seven LLMs tested (including GPT-4), though GPT-4 came close.
LLMs, N-gram models, and stochastic parrots
Kambhampati has provocatively called LLMs just â N-gram models on steroidsâ. N-gram modelsâthe â quintessential stochastic parrotâ (DeepMindâs Timothy Nguyen)âpredict the next token by pattern-matching against the previous N-1 tokens. In a 2024 NeurIPS paper, Nguyen found that LLM next-token predictions agreed with simple N-gram rules 78% of the time (160M model on TinyStories) and 68% (1.4B model on Wikipedia). Are LLMs creative at all?
But Nguyen carefully states his finding: he found that 78% of the time, the LLMâs next-token-prediction could be described by the application of one or more N-gram rules, from a bank of just under 400 rules. This does not explain the LLMâs prediction: it does not say how or why that particular rule was selected. In our interview with Nguyen, he noted how Transformers cannot be a static N-gram model if they are to adapt to novel contexts:
âFamously one of the weaknesses of N-gram models is what do you do when you feed it a context it hasnât seen before? [...] The reason I have all these templates is in order to do robust prediction; the Transformer has to do some kind of negotiation between these different templates, because you canât get any one static template, that will just break.â
Is ChatGPT an N-gram model on steroids?
A human writer constrained to match N-gram predictions 80% of the time could still write creative storiesâbeing describable by simple rules does not make one a parrot. But that does not mean LLMs are creative for the same reason. What they are doing comes from compression. As Kambhampati notes, the number of possible N-grams grows exponentially in N, and once you get to the context size of even âthe lowly GPT-3.5â, let alone recent LLMs, the number of N-grams is essentially infinite, dwarfing the parameter count of any LLM.
âSo because thereâs this huge compression going on, interestingly, any compression corresponds to some generalization because, you know, you compress so some number of rows for which there would be zeros before now there might be non-zeros.â
Do you think that ChatGPT can reason?
This generalisation corresponds to combinational quasi-creativity: the LLM will perform this compression by interpolating the N-grams in its training data.
LLM âcreativityâ is highly derivative
This interpolation, however, does not give a deeper, genuine creativity. As Kambhampati says, LLMs are doing a shallow pattern-match over vast data. Every idea in that data has a phylogenyâa structured lineage of prior discoveries it builds on. LLMs consume the products of these lineages but not the lineages themselves, and by neglecting this phylogeny they fail to exhibit genuine creativityâthey do not understand beyond a surface-level. This is why LLMs have lost their shine: at first, their surprising combinations were impressive. But as they made more and more stuff, their blandness and shallowness became more and more evident, even as their technical quality improved.
Recall RocktĂ€schelâs formal definition of open-endedness: a system is open-ended when its artefacts are both novel and learnable from the observerâs perspective.
Figure 6: Open-endedness requires both novelty and learnability from the observerâs perspective. A mouse finds aircraft designs novel but not learnable; a superintelligent alien finds them learnable but not novel; only for a human aerospace engineer are they both. Source: Hughes et al. 2024.
In RocktĂ€schelâs terms, LLM outputs may be learnable but lack genuine noveltyâthey produce new artefacts without producing surprising ones. As Stanley puts it:
âIt can do some level of creativity, what I would call derivative creativity, which is sort of like the bedtime story version of creativity. Itâs like you ask for a bedtime story, you get a new one. Itâs actually new. No oneâs ever told that story before, but itâs not particularly notable. Itâs not gonna win a literary prize. Itâs not inventing a new genre of literature. Like, thereâs basically nothing new really going on other than that thereâs a new story.â
Kenneth Stanley: The Power of Open-Ended Search Representations
Do these combinations originate in the LLMs? One might suspect they merely ârenderâ ideas already in their prompts, but random-prompt experiments refute thisâgenerative models produce coherent, surprising images from pure gibberish:
Prompt: }?@%#{.;}{/$!?;,_:-%$/+$*=}+={ into DALL-E 3
These outputs plainly depend on training data, not prompts. That said, the more you prompt engineer an LLM, the more the ârendererâ analogy applies: the creations originate more in you.
The novelty of LLM outputs is in a sense accidental: the global minimiser of the training objectives of generative AI models perfectly memorise their training data (Bonnaire et al. 2025). These systems produce novel outputs only because they aim at that target and miss; they compress an entirely plagiarising model into something their parameters can express, and thus produce novelty by accident of training. If you tried to write out The Lord of the Rings from memory, and of course failed, you would technically have written a novel book, but trying to plagiarise and failing ( and not always failing!) is a very shallow form of âcreativityâ. Just like the SGD-trained Picbreeder networks, the selectional history of LLMsâthe history of what their training process rewardedâfavours the wrong abilities.
Using the Allen Instituteâs Creativity Index, we can even measure how derivative LLMs are. Introduced in a 2025 study, the Creativity Index quantifies the âlinguistic creativityâ of a piece of text by how easily one can reconstruct that text by mixing and matching snippets (i.e., N-grams) from some large corpus of text.
Figure 7: The Creativity Index measures how easily text can be reconstructed from N-gram snippets. Source: Lu et al. 2025
Comparing writings by professional writers and historical figures to LLMs (including ChatGPT, GPT-4, and LLaMA 2 Chat), the study found that human-created texts consistently had significantly better Creativity Index than LLM-generated texts, across various types of writing. Curiously, it also found that RLHF (reinforcement learning from human feedback) alignment significantly worsened Creativity Index. This provides empirical evidence that the originality displayed by LLMs is ultimately combinationalâby actually finding what might have been combined!
What about Large Reasoning Models?
But what about creative reasoning? Pure LLMs like GPT-4 struggled at reasoning. On Cholletâs Abstraction and Reasoning Corpus (ARC-AGI) benchmark, GPT-4.5 managed just 10.3% on ARC-AGI-1 and 0.8% on ARC-AGI-2! It was pretty easy to come up with mathematics questions that would stump these LLMs. And Kambhampati demonstrated that GPT-4âs performance on a planning benchmark could be utterly ruined by âobfuscatingâ the tasks in ways that preserved their underlying logic. Had GPT-4 been using a reasoning process, it would have been robust to this obfuscation; its failure demonstrated that it was not solving any of the tasks by reasoning.
But on December 20, 2024, OpenAIâs o3 model landed with a bang, announcing 87.5% on ARC-AGI-1. o3 was still an LLM at its core, but one fine-tuned via reinforcement learning to âthinkâ at inference time, producing an internal âchain-of-thoughtâ which it used to produce its answer. The coming weeks saw the release of OpenAIâs o3-mini, DeepSeekâs R1, and Googleâs Gemini Flash Thinking, and the age of the large reasoning model (LRM) was begun. Did these change the game? Can LRMs reason creatively?
Their progress in mathematics has certainly been dramatic, with both Google DeepMind and OpenAI announcing gold in the 2025 International Mathematics Olympiad (IMO). OpenAI researcher and mathematician SĂ©bastien Bubeck claimed in an August 2025 tweet that GPT-5-pro could prove ânew interesting mathematicsâ by improving a theorem in a provided convex optimisation paper. And on ARC, LRMs crowd the leaderboard, with Opus 4.6, GPT 5.2, and Gemini 3 all over 50% on ARC-AGI-2.
Figure 8: SĂ©bastien Bubeckâs claim that GPT-5-pro can prove new mathematics. Source: Bubeck 2025
However, these performances may be misleading. Greg Burnham at Epoch AI argues that the 2025 IMO was unfortunately lopsided, with the five questions that the LRMs could solve being comparatively easy (as judged by the USA IMO coach), and the one they couldnât solve being brutally hard.
Figure 9: 2025 International Mathematics Olympiad results comparing LRM performance across questions of varying difficulty. Source: Burnham 2025
For our topic, the only question Burnham judges as requiring âcreativity and abstractionâ was the one the LRMs couldnât do! The others, though far from simple, could be solved formulaically. Bubeckâs example follows a similar pattern: although the improvement would indeed have been novel (had a version 2 of the paper with an even better improvement not already been uploaded), GPT-5âs proof is a very standard application of convex analysis tricks; tricks it had already seen in the original paper. GPT-5 uses these tricks well, but not especially creatively. To co-author (and mathematician) Jeremyâs eye, the v2 paper proves a better result and has a more creative proof. Perhaps these LRMs are simply teaching mathematicians the lesson Go world champion Lee Sedol learned from AlphaGo:
âWhat surprised me the most was that AlphaGo showed us that moves humans may have thought are creative, were actually conventional.â
â Lee Sedol, AlphaGo - The Movie
Except, unlike AlphaGo, so far in mathematics LRMs have â told us nothing profound we didnât know alreadyâ, to quote mathematician Kevin Buzzard.
On ARC, an October 2025 paper by Beger, Mitchell, and colleagues (Beger et al. 2025) explored whether LRMs grasp the abstractions behind ARC puzzles. Using the ConceptARC benchmark, whose ARC-like puzzles follow very simple abstract rules, Mitchell tasked o3, o4-mini, Gemini 2.5 Pro, and Claude Sonnet 4 to solve the puzzles and explain (in words) the rules which solve them. Mitchell found that although the LRMs scored as high as 77.7% on the tasks, beating the human accuracy of 73%, compared to humans a lot more of the LRMsâ correct answers relied on rules which did not correspond to the correct abstraction. This suggests that the LRMs were still reliant on superficial patterns, and did not fully understand the puzzle. However, it is possible this analysis could change with SOTA models like Opus 4.6, GPT 5.2, and Gemini 3.
When it comes to creativity, LRMs have the same core issues as LLMs. An LRM is an LLM which has been fine-tuned to produceâinstead of simply the most probable next tokenâa âchain-of-thoughtâ which resembles those it saw in training data. Done well, this enables the LRM to indeed produce, for example, very clean mathematical proofs, when those use standard techniques or patterns. But when presented with a novel problem, this generated chain-of-thought must not be mistaken for the model understanding that problem, and deliberately taking steps to solve it. Kambhampati warns against anthropomorphising (Kambhampati, Valmeekam, Gundawar, et al. 2025) these so-called âreasoning tokensâ, arguing that these mimic only the syntax of reasoning, and lack semantics. The chain-of-thoughts parrot the way humans write about thinking, but may not reflect the actual way the LRMs produce their answers. Even fine-tuning an LRM on incorrect or truncated reasoning traces has been found to improve performance vs. the base LLM (Li et al. 2025), suggesting that performance gains do not derive from the LRM learning to reason, but merely from learning to pantomime reasoning. LRMs technically synthesise new programs on-the-fly, but very inefficiently and shallowly.
LLM-Modulo: LLMs as an engine for creative reasoning
So, is that it? Are LLMs and LRMs a nothingburger when it comes to intelligent, creative reasoning? Well, let us not be too hasty. As we have argued, these systems fail because they lack deep understanding, lack semantics, lack grounding in the phylogeny. But what if you hooked an LLM up to something which did?
This is the key idea of Kambhampatiâs LLM-Modulo framework. In LLM-Modulo, an LLM (or LRM) is an engine which generates plans to solve some task, but these plans are then fed into external critics which evaluate their quality. These critiques then backprompt the LLM to produce better plans, until the critics are satisfied. This generate-and-test pattern echoes psychologist and philosopher Donald Campbellâs âblind variation and selective retentionâ theory (Campbell 1960): knowledge and creative thought, biological or otherwise, require generating candidates without foresight and then selecting those with quality.
Figure 10: LLM-Modulo: LLMs generate plans, external critics evaluate them, feedback improves outputs. Source: Kambhampati, Valmeekam, Guan, et al. 2024
These critics can ground the system. Even if to an LLM the plans are just syntax, the critics, which potentially have rich representations of the task, can thereby imbue the LLM outputs with semantics. Do critics make the LLM more or less creative? The answer is nuanced: they bind the LLM to their specific domain, but this unlocks creativity within that domain. As we will later explore, constraints, not freedom, are the soul of creativity.
On ARC, this pattern has proved decisive. Ryan Greenblatt achieved 50% on ARC-AGI-1 by having GPT-4o generate Python programs and checking them against training examplesâthe Python interpreter as critic. Jeremy Berman took SOTA on ARC-AGI-2 with a variant using English instructions and LLM-based checking.
29.4% ARC-AGI-2 (TOP SCORE!) - Jeremy Berman
Most recently, Johan Land reached 72.9% on ARC-AGI-2 by ensembling multiple LLMs with both Python and LLM-based critics. LLM-Modulo consistently gets LLMs to solve ARC puzzles far more accurately and efficiently than LLMs alone.
Beyond ARC, Google DeepMindâs AlphaEvolve (building on FunSearch (Romera-Paredes et al. 2024)) applies the same pattern: an ensemble of LLMs iteratively generates and improves programs, evaluated by external critics, with an evolutionary algorithm selecting the best candidates.
Figure 11: Summary of AlphaEvolve, generated using Nano Banana Pro based on the description from Novikov et al. 2025
AlphaEvolveâs crown jewel: a novel method for multiplying 4x4 matrices in 48 multiplications, beating the 49-multiplication record held by Strassenâs algorithm since 1969.
Wild breakthrough on Math after 56 years... [Exclusive]
So if, as Buzzard said, LRMs have â told us nothing [mathematically] profound we didnât know alreadyâ, LLM-Modulo systems like AlphaEvolve definitely have. LLM-Modulo allows these systems to be much more grounded in the phylogeny of their task, and evolutionary refinement means that these systems extend that phylogeny further. It is no coincidence that it is these systems which have produced more creative results than scaling LLMs and LRMs.
Nevertheless, these systems still rely on substantial engineering, and have so far only achieved success for narrow, well-defined tasks. To think about what that means for their creativity, let us leave LLMs behind us, and look at AlphaEvolveâs older siblings...
Are AlphaGo and AlphaZero creative?
In March 2016, DeepMind made headlines when its AlphaGo model defeated Lee Sedol, one of the strongest players in the history of Go. Go had long been a major challenge for AI systems due to its vast depth, and until AlphaGo no AI system had ever beaten a professional player. But AlphaGo was remarkable not only in its strength, but also in the originality of some of its moves. Particularly, AlphaGoâs move 37 in Game 2 amazed commentators, with Lee Sedol commenting:
âI thought AlphaGo was based on probability calculation and that it was merely a machine. But when I saw this move, I changed my mind. Surely AlphaGo is creative. This move was really creative and beautiful.â
â Lee Sedol, AlphaGo - The Movie
AlphaGo used data from human Go games to guide its play. But its even stronger successor AlphaGo Zero used no human data at all, learning only from the rules of Go. In December 2017, DeepMind went a step further and announced AlphaZero, a more general algorithm which could learn to play many games (e.g., Go, chess, and shogi) again just from self-play, with no human data. How was this done?
Monte Carlo Tree Search
At the heart of AlphaGo, AlphaGo Zero, and AlphaZero is Monte Carlo tree search (MCTS): from any position, the possible futures form a vast branching tree, and MCTS seeks the best path by sampling many branches in a guided way. AlphaZeroâs MCTS was guided by a neural network that provided âintuitiveâ estimates of move quality and win probability. The key training loop iteratively amplified this intuition via MCTS reasoning, then distilled the conclusions back into an enhanced intuition. Through self-play, AlphaZero climbed from random play to superhuman performance. MCTS reasoning is vital: switch it off, and the raw model plays far worse.
The creativity of AlphaGo and AlphaZero
Are AlphaGo or AlphaZero really creative, or is this an illusion? According to RocktĂ€schelâs framework, AlphaGo is indeed open-ended:
âAfter sufficient training, AlphaGo produces policies which are novel to human expert players [...] Furthermore, humans can improve their win rate against AlphaGo by learning from AlphaGoâs behavior (Shin et al., 2023). Yet, AlphaGo keeps discovering new policies that can beat even a human who has learned from previous AlphaGo artifacts. Thus, so far as a human is concerned, AlphaGo is both novel and learnable.â
The same is true of AlphaZeroâin chess, AlphaZero pioneered new strategies, famously loving to push pawns on the side of the board. AlphaGo Zero and AlphaZero cannot be recombining existing ideasâthey arenât given any! Unlike LLMs, who generalise somewhat by accident as a consequence of compressing their vast training data, AlphaZero plays positions it has never seen before by deliberately reasoning about them, via MCTS, and this ability was actively selected for by its training. But is this strong reasoning or weak reasoning?
There are key limits to AlphaGo/AlphaZeroâs reasoning. As philosopher Marta Halina argues (Halina 2021), the limit of AlphaGoâs world is the standard game of Go; it is unable to play even mild variants of Go without retraining. Even AlphaZero, which can learn any two-player perfect-information game from its rules, canât be trained on one game and then transfer that knowledge to other games. Therefore, Halina argues that:
âComputer programmes like AlphaGo are not creative in the sense of having the capacity to solve novel problems through a domain-general understanding of the world. They cannot learn about the properties and affordances of objects in one domain and proceed to abstract away from the contingencies and idiosyncrasies of that domain in order to solve problems in a new context.â
RocktĂ€schel concurs, calling AlphaGo a ânarrow superhuman intelligenceâ. Why canât it abstract away from Goâs contingencies? The answer lies in how it learns. Self-play with a fixed objectiveâwin the gameâis still greedy optimisation. Gradient descent tends to take the direct path to the goal, without pausing to discover foundational regularities first. As the fractured entangled representations paper arguesâthe same phenomenon we saw in Picbreederâs SGD-trained networksâthis creates representations like spaghetti code: redundant, entangled, with the same logic copy-pasted rather than factored into reusable modules. AlphaGoâs implicit grasp of âterritorial influenceâ isnât a separable concept it could apply elsewhereâitâs smeared across millions of weights, entangled with everything else it knows about Go. This is what we call concrete constraint adherence: the constraints are instantiated in AlphaGoâs substrate and shape its play, but they are not represented in a format it can manipulate, transfer, or reason aboutâthey are the physics of its world, externally imposed via MCTS. It operates within constraints but cannot model them.
In a paper first posted in 2022, Tony Wang, Adam Gleave, and colleagues demonstrated an even more dramatic limit (Wang et al. 2023): KataGo (an even stronger Go AI than AlphaGo, developed in 2019) could be beaten a whopping 97% of the time, by using AlphaZero-style training to find adversarial strategies which exploited how KataGo approached the game:
âCritically, our adversaries do not win by playing Go well. Instead, they trick KataGo into making serious blunders that cause it to lose the game.â
The KataGo team were able to mitigate this via adversarial trainingâthat is, having KataGo simulate adversarial strategies during training and learn to respond to themâbut only partially. Gleaveâs strategies still worked 17.5% of the time even against adversarially trained KataGo; very impressive for playing Go badly!
These adversarial strategies were not arcane computer nonsense: a human expert could learn to use them to consistently beat superhuman Go AIs (and not just KataGo). Therefore, by RocktĂ€schelâs criteria, whilst Go AIs are âopen-endedâ relative to an unassisted human observer, relative to a human observer assisted by adversarial AI, they lack novelty in exploitable and learnable ways, and adversarial training only partially fixes this.
Does AlphaZero have phylogenetic understanding?
AlphaZero may disregard the human phylogeny of Go, chess, or etc., but via its self-play training loop, it creates and distills its own phylogeny: every move that it makes has a history in those millions of self-play games. Does this give it genuine understanding of the moves it makes, or merely an implicit grasp that falls short of understanding proper?
A 2022 DeepMind study investigated whether AlphaZero had learned to represent human chess concepts when learning to play chess. They defined a âconceptâ to be a function which assigns values to chess positions (e.g., the concept of âmaterialâ adds up the value of Whiteâs pieces and subtracts the value of Blackâs pieces). This notion was convenient, because such functions encoding many key chess concepts have been engineered to build traditional chess programs. Using a chess database, they then trained sparse linear probes to map the activations in AlphaZeroâs neural network head to the functions expressing these concepts. They found that initially these probes all had very low test accuracy, but over the course of AlphaZeroâs training they became much more accurate for many concepts, suggesting that AlphaZero was indeed acquiring representations of those concepts. For example, after hundreds of thousands of iterations AlphaZero eventually converged on the commonly accepted values for the chess pieces.
Figure 12: AlphaZero learning chess concepts over training iterations, including piece values. Source: McGrath, Kapishnikov, et al. 2022
However, there are two key caveats to this result. First, this evidence is just from sparse linear probes, which are limited tools for interpretability. Second, defining chess âconceptsâ as functions conflates the positions those concepts refer to with what those concepts mean. Suppose that in all of the positions in the chess database used, in every position where someone was in check, there was never a 2x2 square all full of queens (a very rare pattern). Then both âbeing in checkâ and âbeing in check with no 2x2 square of queens on the boardâ would correspond to the same function, but obviously donât mean the same thing.
As Fodor and Pylyshyn (1988) argued in their classic critique of connectionismâthough we do not require the full compositionality they demanded (see the Postscript)âunderstanding these meanings requires grasping something of their systematic and compositional nature: if a system truly understands a concept, it should be able to recombine that concept with others in structured ways. Understanding âbeing in checkâ should be intrinsically tied up with understanding âbeing in check by a pawnâ, âblocking a checkâ, âpinning a piece to the Kingâ etc. The research does not explore such networks of interrelated understandings, and as such cannot demonstrate a deep abstract understanding of these concepts.
But is abstract understanding needed to understand chess (or Go)? Chess theory has increasingly favoured concrete explanations (see International Master John Watsonâs Secrets of Modern Chess Strategy). Grandmaster Matthew Sadlerâs 2025 article â Understanding and Knowingâ describes three levels of understanding of a position, each more concrete than the last. Concrete understanding means grasping the key variationsâwinning lines, refutations of alternatives, contrasts with similar positionsâwithout exhaustive enumeration.
On these concrete terms, AlphaZero represents some progress, as it looks at far fewer positions than older chess systems in its search. However, it still looks at thousands of times more positions than a human grandmaster, so it will still explore many irrelevant lines. More deeply, it lacks counterfactual understandingâthe grasp of not just what works, but why alternatives fail and how the analysis changes under different conditions. In the above Sadler article, a crucial piece of the highest level of understanding was seeing how (and why) the winning line in the position wouldnât work in a superficially similar position. AlphaZeroâs MCTS will never explore these sorts of counterfactual positions. The adversarial examples show that even in purely concrete terms, these systems can utterly fail to understand strange positions.
In summary, AlphaGo sits at the concrete level of our hierarchy of constraint adherenceâbetween evolutionâs physical adherence and modelled understanding, where constraints can be manipulated and transferred across contexts. Its representations are entangled rather than factored, so it cannot manipulate or transfer its implicit grasp of Goâs logic. Move 37 was a genuine creative discovery, but the concepts underlying it cannot be extracted and reapplied. In Bodenâs terms, this is exploratory creativity within a fixed conceptual space, not the transformational creativity that would require learned, factored representations preserving their stepping-stone structure.
Stanleyâs false-compass problem bites AlphaGo at both levels: the fixed win objective blinds it to stepping stones, and gradient descent on dense networks precludes the modular structure that transfer demands. More intelligence does not help when the objective itself is the problemâand so, despite their immense strength, these systems remain blind to deeper domain-general features, and can be bamboozled by spurious patterns even within their own domain.
But if AlphaZero still crushes humans, does domain-general understanding matter? It depends on what you want. AlphaZero is stronger than any human at chessâbut would fail at âchess with one rule changeâ without retraining from scratch. Carlsen, though weaker, could adapt instantly, and would be very hard to bamboozle by playing badly! For robust generalisation to unknown unknowns, the deeper understanding matters; for raw performance on a fixed task, it may not. DeepMindâs work does suggest that AlphaZero learned to represent key chess conceptsâbut as we saw, this falls short of the modelled understanding that genuine creativity would require.
Putting the humans back in the loop
None of todayâs AI systemsâLLMs, LRMs, or AlphaZeroâcan, operating alone, handle the âunknown unknownsâ that characterise human creativity.
But have we been asking the wrong question this whole time? So far, we have been focusing on whether AI systems, by themselves, can reason creatively. This framing echoes the dream (or nightmare, depending on who you ask) of fully autonomous AI systems, a dream infamously expressed by Nobel laureate Geoffrey Hinton in 2016:
âI think if you work as a radiologist youâre like the coyote thatâs already over the edge of the cliff but hasnât yet looked down so doesnât realize thereâs no ground underneath him. People should stop training radiologists now. Itâs just completely obvious that within 5 years deep learning is going to do better than radiologists because itâs going to be able to get a lot more experience. It might be 10 years but weâve got plenty of radiologists already.â
Geoff Hinton: On Radiology
History has not been kind to this prediction. But setting aside the inaccuracy of the timeline, notice how Hinton pictures deep learning as replacing radiologists, rendering them obsolete.
What if instead the future looks like radiologists and AI systems working together, to perform better than either could alone, or do radiology in more diverse settings? Then there might be a need for more radiologists than ever. Spreadsheets, after all, did not lead to fewer accountants.
CT and MRI scanners are expensive and immobile; sub-Saharan Africa has less than one MRI scanner per million people. AI-enhanced alternatives like photoacoustic imaging are cheaper and more portableâbut still need radiologists to interpret them. If these techniques expand medical imaging across the developing world, global demand for radiologists could increase, not disappear.
In terms of creative reasoning, we should therefore be thinking not only about AI creativity, but also human-AI co-creativity. Consider coding and science; these are inherently interactive endeavours: any AI coder or scientist will inevitably interface with humans throughout. Who commissioned the software? Who are its users? Who will perform its experiments? To quote the AlphaEvolve authors from our MLST interview:
âI think the thing that makes AlphaEvolve so cool and powerful is kind of this back and forth between humans and machines, right? And like, the humans ask questions. The system gives you some form of an answer. And then you, like, improve your intuition. You improve your question-asking ability, right? And you ask more questions. [...] Weâre exploring [the next level of human-AI interaction] a lot. And I think itâs very exciting to see, like, what can be done in this kind of symbiosis space.â
Wild breakthrough on Math after 56 years... [Exclusive]
DeepMind researchers Mathewson and Pilarski show how humans are embedded throughout the machine learning lifecycle, from data collection to deployment. The Neuroevolution textbook echoes this too: âhumans and machines can work synergistically to construct intelligent agentsâ, ultimately enabling âinteractive neuroevolution where human knowledge and machine exploration work synergistically in both directions to solve problemsâ. We have so far been focusing on the âIâ of AI, but the âAâ often hides the extensive reliance of these systems on humans.
Figure 13: All machine learning is interactive: humans are embedded throughout the AI development lifecycle. Source: Mathewson and Pilarski 2022
Consider the history. AlexNetâs 2012 breakthrough depended on ImageNet, whose 14 million labels required years of Mechanical Turk labour. ChatGPTâs self-supervised training consumed the internet (created by humans), and making it presentable required reinforcement learning with human feedbackârelying on significant Kenyan labour.
Will AI always rely on human labour? Could not future AI systems be trained on AI-generated data and supervised by AIs, without any humans in the loop? Anthropic have after all been pioneering reinforcement learning with AI feedback, and the big tech companies have reportedly turned to synthetic data because they are running out of internet to train on. However, a 2024 front-page Nature paper (Shumailov et al. 2024) warned that indiscriminately training AIs on AI-generated data leads to âmodel collapseââan irreversible disappearance of the tails (i.e., low-probability outputs) of the AIâs distribution. This would especially kill creativity, since losing the tail means losing unexpected and novel outputs. Human-AI collaborations can exploit complementary strengths: we often find generation harder than evaluation, whilst AI systems often demonstrate the reverse. Thus, by delegating tasks, such as in LLM-Modulo, one can get the best of both worlds. As Stanley argues, the human ability to recognise interestingness is irreplaceable:
âWe have a nose for the interesting. Thatâs how we got this far. Thatâs how civilization came out. Thatâs why the history of innovation is so amazing for the last few thousand years.â
Prof. KENNETH STANLEY - Why Greatness Cannot Be Planned
What does human-AI co-creativity look like?
In 1997, Deep Blue beat chess world champion Garry Kasparov, and by 2006 computers had decisively overtaken human chess players: Hydra crushed Michael Adams 5œâœ in 2005, and Deep Fritz beat world champion Vladimir Kramnik 4â2 in 2006. (AlphaZero would later join the party with a bang in 2017.) As we saw, Go went the same way in 2016. Human-AI collaboration is now an integral part of high-level play in both games, with top players extensively preparing with computers. One might worry that this would atrophy these playersâ creative minds, but quite the opposite seems true. After the advent of AlphaGo, human Go players began to play both more accurately and more creatively. This really kicked in when open-source superhuman Go AIs arrived, as people could then learn not only from their actions, but also from their reasoning processes.
A similar story is true of chess: not only do players play much more accurately now than in the past, but computer analysis helped overturn dogmatic ideas of how chess could be played, and breathed new life into long abandoned strategies. AlphaZero has been used to explore new variant rules for chess, dramatically faster than humans could alone. Most recently, in a 2025 paper DeepMind showed how chess patterns uniquely recognised by AlphaZero could be extracted and taught to human grandmasters, demonstrating that these systems can continue to enhance the human understanding of chess.
AlphaZero in Chess | Reflections on Creative Play
Beyond board games, Stanleyâs Picbreeder (Section 2) remains the clearest case study: human selection plus machine variation produced vastly superior representations to anything SGD could reach alone.
Figure 14: Picbreeder networks learn semantically meaningful representations through open-ended evolution. Source: Stanley 2014
In experimental science, it may be more important than ever to keep humans in the loop. At the 2026 World Laureates Summit, Nobel laureate Omar Yaghi described coupling ChatGPT with a robotic platform to crystallise materials that had defied the chemistry community for a decade.6 The human contributes thirty-five years of domain knowledgeâreticular chemistry, the experimental scaffold, the judgement of what counts as âgood crystallinityâ. ChatGPT explores the parameter space within those constraints. Three experimental cycles yielded crystals three times more crystalline than a decade of unaided effort. AlphaFold follows the same logic: it predicts protein structures in minutes rather than years, but as AlphaFoldâs lead developer John Jumper put it in our interview, âthese machines let us predict. They let us control. We have to derive our own understanding at this moment.â
Both illustrate what prediction alone cannot reach. At the same Summit, optimisation theorist Yurii Nesterov articulated the limit: AI conclusions âcan be related only to a model of the bird [i.e. the object being studied] which exist in the corresponding virtual reality. If the model is done correctly, then this conclusion can be used in real life. If not, it could be a complete nonsense.â And Turing Award laureate Robert Tarjan identified what no model can supply: âasking the right question is more important than finding the answer. To be a really great researcher, you have to develop a certain kind of taste.â Tasteâthe nose for the interestingâis what the human brings to the collaboration.
Human-AI collaborations may also soon be fruitful in academia. Or so argued Fields medallist Terence Tao in a 2024 interview for Scientific American. Inspired by the success of automated proof assistants like Lean, Tao imagines mathematicians and AIs soon working together to produce proofs:
âI think in three years AI will become useful for mathematicians. It will be a great co-pilot. Youâre trying to prove a theorem, and thereâs one step that you think is true, but you canât quite see how itâs true. And you can say, âAI, can you do this stuff for me?â And it may say, âI think I can prove this.ââ
Tao sees this eventually transforming mathematical practice itselfâfrom âindividual craftsmenâ to a pipeline âproving hundreds of theorems or thousands of theorems at a timeâ, with human mathematicians directing at a higher level and formalisation making explicit the vast tacit knowledge âtrapped in the head of individual mathematiciansâ.
The Structure of Creativity
The Semantic Graph
LLMs, LRMs, AlphaZeroâall of these display what we might call statistical creativity: they search through the space of possibilities, in training and at inference, and stumble upon interesting regions. But the heart of creativity is semanticâgrounded not in statistical search but in understanding the structure of the domain, the phylogeny. As Tim put it in conversation with neuroevolution researcher Risto Miikkulainen (co-author of the Neuroevolution textbook we cited in the introduction):
âWe are describing a kind of statistical creativity where we want to make it more likely that we will find these tenuous, interesting regions. But could there be a kind of almost pure form of creativity where we know the semantic graph?â
A powerful intuition pump for these âsemantic graphsâ is this beautiful visualisation by the YouTuber 2swap:
Figure 15: The state space of Klotski (a classic sliding block puzzle where you manoeuvre pieces to free a larger block), visualised as a graph. Each node is a board configuration; edges connect positions one move apart. Source: 2swap I Solved Klotski
A semantic graphânot a knowledge graph in the NLP sense, but the full space of possibilities in a domain, with its own topologyâis like this Klotski graph writ large. Particularly important are the intricate substructuresâlocal regions with their own logicâconnected by narrow paths. Semantic creativity is about traversing this semantic graph, discovering the logic of your local bubble, and finding those tenuous connections that lead to new substructuresânew conceptual spaces. Stanleyâs insight is spot on: in the semantic graph, discovering a new substructure literally âadds new dimensions to the universeâ, opening up a new logic to explore. In real creative domains, of course, the graph is shrouded in a âfog of warâ; we discover new dimensions and subspaces as we go, rather than navigating a known topology. As Miikkulainen put it when shown this visualisation: creativity involves âpushing into another area of kind of solutions that youâve never seen beforeâ by finding those rare transitions between substructures.
How can we measure this semantic creativity? Perhaps the answer lies in the size of the subspace discovered. The stepping stone that leads to a vast new region of possibilities is more creative than one that leads to a small cul-de-sac. Looking at the Klotski graph, we can immediately see which clusters are large and which connections are most valuable. But in the real world, this is all covered by the fog of war: it takes time to realise where our stepping stones will lead, or just how big a new subspace actually is. Only in retrospect, once the phylogenetic tree has been expanded by subsequent discoveries, can we recognise how extraordinarily creative (or not!) a stepping stone was.
This is why creative solutions often seem obvious in hindsightâwhat we might call the âMcCorduck effectâ for creativity (named after AI historian Pamela McCorduck, who documented the pattern in Machines Who Think). The narrow path becomes a well-worn road. But perhaps the obviousness is real: genuine creativity follows the constraints of the domain, and the solution was always there in the semantic graph, waiting to be discovered by someone who understood the graph deeply enough to find the tenuous connection.
This applies to what Boden calls exploratory creativityânavigating within an existing conceptual space. Exploratory ideas, she notes, âmay come to seem glaringly obvious (âAh, what a foolish bird I have been!â)â. But transformational creativity is different: it outgrows the space itself, generating what Boden calls âimpossibilist surpriseâ: âthe shock of the new may be so great that even fellow artists find it difficult to see value in the novel idea.â Quantum mechanics still feels strange, not because we havenât understood it, but because classical intuitions cannot be patched to include it. The prior conceptual space was not extended but outgrownâand the path that outgrew it was itself a phylogeny, each stepping stone respecting the constraints of what came before.
Can we tell in advance how creative a stepping stone will be? Kumar suggests the answer lies in evolvabilityâthe capacity to enable future discoveries:
âThereâs an implicit selection pressure for evolvable things. If thereâs two versions of the skullâone is spaghetti and one is modular and composableâafter a few generations of evolution, the one thatâs more evolvable will win out. Just like in natural evolution, the evolution of evolvability. And this evolvability combined with serendipity is what gives you these nice representations.â
AI is SO Smart, Why Are Its Internals âSpaghettiâ?
Picbreeder illustrated this concretely: the modular skull representation was more evolvable than the SGD-trained spaghetti because regularities like symmetry had been locked in as building blocks for future variation.
Evolvability provides a future benefit, yet as the Neuroevolution textbook notes, âit needs to be developed implicitly based on only current and past informationâ. How would you even measure it? The textbook proposes a direct test: mutate a representation many times and count how many distinct, viable offspring it produces. A representation is evolvable when small changes yield diverse, functional variantsâwhen there is gold upstream. Evolution discovers such representations through meta-selection: evolvable lineages outcompete rigid ones because their offspring fill niches faster, especially after extinction events clear the landscape.
This is why path-dependent representations matter: they encode potentialâthe latent capacity for future creative leapsâalongside the solutions themselves. The Neuroevolution book extends this point:
âNeuroevolution gives us a rare opportunity to study representations not just as a byproduct of loss minimization, but as artifacts of open-ended exploration and accumulated structural regularities.â
This echoes what Akarsh Kumar calls the difference between âstatistical intelligenceâ and âregularity-based intelligenceââthe former perfect at pattern matching, the latter grounded in the actual structure of the worldâmirroring our distinction between statistical and semantic creativity. (Kumar and Scarfe 2026) Statistics are wonderful for representing data, for memorising and compressing what already exists. But intelligenceâand creativityâis fundamentally about building new representations, new models, constrained by the path that got us there.
Constraints make creativity possible
âArt lives from constraints and dies from freedom.â
â Leonardo Da Vinci
This constrained understanding is the foundation of creativity. As Noam Chomsky argued in our interview:
âIn fact, while itâs true that our genetic program rigidly constrains us, I think the more important point is that the existence of that rigid constraint is what provides the basis for our freedom and creativity. [...] If we really were plastic organisms without an extensive preprogramming, then the state that our mind achieves would in fact be a reflection of the environment, which means it would be extraordinarily impoverished. Fortunately for us, weâre rigidly preprogrammed with extremely rich systems that are part of our biological endowment. Correspondingly, a small amount of rather degenerate experience allows a kind of a great leap into a rich cognitive system. [...] We can say anything that we want over an infinite range. Other people will understand us, though theyâve heard nothing like that before. Weâre able to do that precisely because of that rigid programming.â
As Miikkulainen put it: âItâs respecting the constraints of the problem.â That is the crux. Deep understanding of a domainâs constraints is what you need to walk the narrow path to nearby domains, because you grasp structural regularities rather than surface features. This is why, as we saw with Carlsen and AlphaZero, deep structural grasp lets you transfer to variants that defeat a system trained on appearances alone. Creativity without constraints is noise.
The late Margaret Boden crystallised this in The Creative Mind: âFar from being the antithesis of creativity, constraints on thinking are what make it possibleââthey âmap out a territory of structural possibilities which can then be explored, and perhaps transformed to give another oneâ. âTo drop all current constraints and refrain from providing new ones is to invite not creativity, but confusion. There, madness lies.â The great creative minds, Boden observed, ârespect constraints more than we do, not lessââthey soared further precisely because they understood the domain well enough to push beyond it.
Think of it this way: creativity is like assembling a jigsaw whose picture you discover only as you place each pieceâand you cannot interpolate your way to an image you have never seen.
AI slop, and the supervisor illusion
Current generative AI systems have broad informationâvast statistical associations extracted from training dataâbut lack understanding. Recall our earlier distinction: coherence can emerge from mere constraint adherence (as in evolution), but understanding is cognitiveâit requires âgrasping of explanatory and other coherence-making relationshipsâ (Baumberger, Beisbart, and Brun 2017). Current generative AIs are like a child who can recite that âgreenhouse gases cause warmingâ because a trusted adult told them. They can reproduce the explanation, but they do not understand itâthey cannot answer counterfactual questions or reason about the mechanism. They cannot even distinguish what they have been told from what is true. Generative AIs lack the coherence that would make those explanations understood. And to paraphrase Boden: there, slop lies.
âAI slopâ is the opposite of coherence, and therefore the opposite of creativity. Slop is what happens when an artefact is generated without path-dependence, without understanding, without respecting the phylogeny or the constraints. As the fractured entangled representations paper argues, LLM outputs are incoherent because they took the wrong pathâor rather, no coherent path at all. Their representations lack the stepping-stone structure that would make outputs meaningful. They only produce non-slop when they are guided by supervision that provides the missing coherence.
There is a curious asymmetry here worth noting. Language models in generation mode are far more likely to produce slop than when operating in discrimination mode. The same LLM that confidently hallucinates a citation when asked to generate one can, when prompted to verify that citation, correctly identify it as nonexistent. What is going on?
Discrimination is a specific, constrained task: does this text exhibit certain statistical signatures? The constraints of the task impose coherence. But generation must conjure coherence from nothing; without external guidance the LLM defaults to the statistically average. The mediocre. The derivative. Slop. This explains why agentic workflows that decompose generation into smaller, more constrained subtasksâlike verifying each reference individually rather than generating a bibliography in one shotâcan dramatically reduce hallucination and improve coherence. The constraints of the subtask substitute for the understanding the model lacks.
With increasing levels of specification, and in domains where outputs are verifiableâeven implicitly verifiable through execution or compilationâlanguage models perform dramatically better. Tools like Claude Code (an AI coding assistant), and indeed most of the recent practical advances in deploying LLMs, are fundamentally ways of adding constraints to the generation process. Agentic scaffolding, tool use, code execution, test suites, type systems, LLM-Modulo: all of these impose external structure that guides generation toward coherence. In effect, we are compensating for the modelsâ lack of phylogenetic understanding by adding constraints that make them act as if they had such understanding. The constraints do the work that deep structural knowledge would otherwise provide.
This act as if can be convincingâuncannily soâwithin any single frame. Ask an LLM to check logical coherence and it finds genuine problems; ask it to verify facts and it catches real errors. Within each constrained task, the output is hard to distinguish from understanding. But the LLM is coherent within a frame while possessing no perspective of its ownâlike the blind men and the elephant, each accurately reporting what he feels, none integrating across perspectives.
Human understanding is perspectival tooâwe see a problem differently depending on which constraints we foreground, which subspace of our knowledge we inhabit. The difference is that a phylogeny gives you a trajectory: a path through your topology of constraints that lets you move between frames and integrate what each revealsâmuch as Microsoftâs Photosynth7 reconstructed a 3D scene from overlapping photographs by finding shared vertices between them. The LLMâs apparent perspective is an aggregate of every trajectory in its training data, which resolves into a coherent voice only when external constraintsâa system message, a prompt, a conversation historyâsupply the frame. The model always inhabits a borrowed perspective; strip those constraints away, as early language models showed, and coherence dissolves. Every frame it occupies is lent, not built.
One could iterateâlogical consistency, then factual accuracy, then terminological coherenceâbut the space of frames is inexhaustible, Protean in Chirimuutaâs apt metaphor (see the Postscript). Running an LLM in a loop over its own outputs adds more blind men; none integrates. What the supervisor brings is tasteâinternalised constraints that orient attention toward what is missing. The outer loop is understanding itself.
This creates what we might call the supervisor illusion. When a competent expert uses an AI system, they implicitly provide the constraints that guide generation toward coherence. They prompt engineer, iteratively refine, and know which outputs to reject. The result can be impressive, and it is tempting to credit the AI with creativity it does not possess.
The humanâAI system can be genuinely creativeâbut the creativity lives in the humanâs understanding, not in the AIâs computation. The AI borrows the supervisorâs constraints the way a pen borrows a writerâs thoughts. The human contributes both agency and understandingâbut these pull in opposite directions. Agency directs the AI toward a goal; understanding constrains it toward coherence. And as we have argued throughout, goal-directed agency works against transformational creativity: constraints open new paths, while goal-pursuit narrows to familiar ones. The supervisor illusion gets the credit doubly wrongâattributing to the AI what originates in the human, and attributing to the humanâs agency what originates in their understanding.
This illusion is particularly seductive in Silicon Valley, where technically sophisticated users routinely coax remarkable outputs from AI systems and extrapolate to world-changing predictions.
Anthropic CEO Dario Amodei (Amodei 2025), for instance, recently suggested that AI could âdisplace half of all entry-level white collar jobs in the next 1â5 yearsâ while enabling â10â20% sustained annual GDP growthâ. In the same essay, Amodei notes that âtop engineers now delegate almost all their coding to AIââbut this inadvertently proves the point: it is precisely because they are top engineers that the delegation works. They provide the missing coherence, acting as the critic in an LLM-Modulo loop; the solutions originate in the expertâAI hybrid, not the AI. There is a second factor here too: top engineers can move fast with AI-generated code because they comprehend what is happeningâthey breeze through without incurring understanding debt. When less experienced engineers attempt the same velocity, they outpace their own comprehension. The code works (for now), but they do not understand why, and this debt compounds. Every shortcut becomes a liability when something breaks. This is why extrapolations from expert productivity to market-wide transformation are likely to disappointâand the benchmarks underpinning such predictions may be equally unreliable. As Melanie Mitchell has argued, most AI benchmarks lack construct validityâthey fail to predict real-world performance because impressive results often stem from data contamination, approximate retrieval, or exploitable shortcuts rather than genuine capability (Mitchell 2026).
In a 2026 paper (Shen and Tamkin 2026), Anthropic researchers Shen and Tamkin ran a randomised controlled trial: junior software engineers learning a new Python library, half with AI assistance and half without. The AI group scored 17 percentage points worse on understanding, and were particularly worse at debuggingâthe very skill required to verify AI-generated code. They werenât even faster: only those who delegated completely saw time savings.
The mechanism: errors force you to think critically about why your expectations disagree with reality; they are the friction that kindles understanding. The AI removed this friction. The participants knew it too; the AI group reported feeling âlazyâ and aware of âgaps in understandingâ.
Ironically, Anthropicâs own research undermines Amodeiâs extrapolation. Entry-level workersâprecisely those Amodei predicts will be displacedâare the ones for whom AI assistance backfired. You cannot displace the junior engineers if the process that creates senior engineers depends on the struggle that AI removes.
But didnât AI make chess and Go players more creative? Shen and Tamkin found that those who delegated everything saw stunted understanding, while those who used AI for conceptual questions scored as well as the control groupâsuggesting the difference lies in how AI is used, and perhaps in who uses it.
Human-AI co-creation is a double-edged sword. Done right, AI offers a fresh perspective, free from human bias and dogmaâit can challenge received wisdom without making us defensive. But done wrong, it stunts our understanding and reduces our work to slop. We must develop usage patterns that reward rather than atrophy understandingâimposing friction, forcing us up against reality. AI amplifies what you bring to it; it does not substitute for what you lack.
The Argument in Brief
All the big things we want from AI require handling unknown unknownsâand that needs creativity.
Chollet-style intelligence (Chollet 2019) handles known unknowns: novel instances of familiar tasks. Agency pursues goalsâbut unknown unknowns precede any goal you could formulate. They require the capacity to discover stepping stones that nobody anticipated (Stanley and Lehman 2015). That capacity is creativity.Authentic creativity needs respect for constraints, not intelligence and agency.
Understanding is the cognitive form of something more general: respect for constraintsâoperating within and building on the structure that came before (Boden 2004). Constraints operate at three levels: physical (baked into matter), concrete (instantiated in a fixed substrate), and modelled (represented so they can be manipulated and transferred). Evolution built every organism on Earth through blind variation and selective retention (Campbell 1960)ârespect for constraints without cognition. Intelligence and agency help with exploratory creativity but are antithetical to transformational creativity: Stanleyâs âfalse compassâ (Stanley and Lehman 2015) means any premature objective is potentially deceptive, and more cognitive power only accelerates the detour. Transformational creativity discovers the path; intelligence walks it faster (Schopenhauer 1844).Current AIs take the wrong path; their training rewards the wrong abilities.
LLMs recombine training data without the respect for constraints that would make outputs genuinely new. AlphaZero discovers real structure, but it is concrete rather than modelledâpowerful within its domain, impossible to extract or transfer. Gradient descentâs direct route to the objective bypasses the incremental, building-block structure creativity requires (Kumar, Clune, et al. 2025), producing fractured entangled representations too entangled to decompose or repurpose. This hypothesis is preliminary, and whether it persists at scale remains open, but the broader argument has a longer pedigree in neuroevolution research (Risi et al. 2025): gradient descent is a greedy hill-climber, and scaling it does not change its nature.Human-AI collaboration is the path forwardâfor now.
Humans supply the respect for constraintsâthe understandingâthat current AI systems lack. In chess and Go, AI made humans more creative by challenging dogma whilst preserving the friction with reality that kindles understanding. If this success can be replicated, the scope for human-AI co-creativity is vast. But we do not discount that someday AI systems with the right kind of representationsâlearned, factored, path-dependent, evolvableâmight have genuine understanding and collect stepping stones on their own. Such systems would be creative, not because they are intelligent, but because they respect the phylogeny.
We should acknowledge that the concept of creativity is contested. Philosopher Shevlin (2021) argues that comparative psychology should abandon it entirely, in favour of operationalised notions like innovation and behavioural plasticity, given deep disagreements about whether creativity requires subjective experience, intentional agency, spontaneity, or valuable outputs. Our argument, however, rests on claims about representational structure and understanding that are tractable regardless of how one defines creativity per se. Whether we call the outcome âcreativityâ or âopen-ended innovationâ is somewhat terminological; what matters is whether a systemâs representations support transfer, counterfactual reasoning, and coherent extension of the phylogeny. These are engineering and cognitive science questions, not definitional debates. We maintain that evolution is creativeâbut readers who prefer to reserve âcreativityâ for minded systems can substitute âopen-ended innovationâ without loss to our core argument.
Conclusions
A corollary of the preceding argument: there can be no robustly generalising intelligence without understanding. As Philip K. Dick put it, âreality is that which, when you stop believing in it, doesnât go awayâ; without access to realityâs constraints, no amount of raw cognitive power will help you explore it.
This is not a claim that AI can never be creativeâonly a claim about what AI creativity would require. If what matters is the structure of representationsâlearned, factored, path-dependentârather than the biological substrate, then any system that can navigate its own topology of constraints could in principle achieve understanding. The question of grounding remains open: can a system that has never pushed against realityâs constraints build a trajectory through them? Perhaps future systems, trained through interaction with the physical world, could develop what current systems lackâand perhaps achieve creativity in domains we cannot access, even if not in ours.
For now, the most promising path forward is human-AI co-creativity. From board games to reticular chemistry to protein science, AI predicts in minutes what once took yearsâbut the understanding that frames the search remains human. Picbreeder showed how keeping humans in the loop can produce representations far richer than those achieved by standard training methods. And as Terence Tao suggests, mathematicians and AI systems working together may soon prove theorems that neither could reach alone. The human provides the coherence, the understanding, the taste for the interesting; the AI provides statistical power, tireless exploration, and freedom from cognitive biases.
If greatness cannot be interpolated, perhaps it cannot be fully automated eitherâat least not yet. But it can be amplified.
Companion video discussions are forthcoming on the MLST channel.
Postscript: Must Representations Be Perfect?
Our critique of fractured entangled representations might seem to demand the alternative that Fodor and Pylyshyn championed in their classic critique of connectionism (Fodor and Pylyshyn 1988): perfectly compositional representations where complex meanings are built systematically from atomic parts, and the capacity to think one thought guarantees the capacity to think structurally related thoughts. That would be a misreading. The history of symbolic AI is a cautionary tale about assuming the world decomposes that neatly.
Chirimuutaâwhose Kantian critique of Cholletâs kaleidoscope hypothesis we encountered in Section 1âinvokes nature as Proteus, the shape-shifting sea god. Pin him down and he answers truthfully, but release your grip and he shifts; there are always other ways he could have been pinned (Chirimuuta 2024). If nature admits many valid decompositions but no single canonical one, then no representation will ever be perfectly factored. The âcoarse-grained stabilitiesâ Chirimuuta describesâfunctional patterns that hold well enough to explain, without carving nature at its jointsâmay be all there is.
This is part of the story of AI creativity in our opinion. We have contrasted spaghetti representations and structured ones, but âstructuredâ need not mean âperfectly compositionalâ. Evolution itself works with leaky, context-dependent modulesâbiological structures are âgood enoughâ, shaped by the path that produced them, reused opportunistically rather than designed from scratch. They are far from Fodorian symbols, yet they underwrite the entire tree of life.
The bull case for AI creativity, then, does not require solving metaphysics. It requires representations that are more factored, more path-dependent, and more evolvable than current spaghettiâwithout reaching some Platonic ideal. Such a system would still be messy, still Protean, still resistant to any single clean decomposition. But it would have what current systems lackâa structure that can grow.
References
See https://archive.mlst.ai/paper/why-creativity-cannot-be-interpolated/
Cite this paper
@article{mlst_2026_001,
title = {Why Creativity Cannot Be Interpolated},
author = {Dr. Jeremy Budd and Dr. Tim Scarfe},
journal = {MLST Archive},
year = {2026},
url = {https://archive.mlst.ai/paper/why-creativity-cannot-be-interpolated}
}
How it works
Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content â general knowledge won't be enough. Score 70+ to count toward your certificate.
Questions are cached â you'll always get the same 5 for this article.