When AI Discovers the Next Transformer: Evolutionary LLM Systems and the Future of Automated Science

A founding researcher at Sakana AI made a claim that should unsettle anyone building with LLMs: "When we run LLMs autonomously, nothing interesting happens" [1]. The models generate output, sure. But novelty — genuine, surprising, useful novelty — doesn't emerge from running a language model in a loop.

That's a problem, because the most valuable thing AI could do next isn't answer questions faster. It's discover things humans haven't thought of yet. And the gap between "impressive autocomplete" and "automated scientific discovery" turns out to require something LLMs alone can't provide: evolution.

The Problem With Giving AI a Fixed Problem

Every major LLM-driven code generation system — AlphaEvolve, Jeremy Howard's approaches, the standard agent coding pipelines — shares a structural limitation. You give the system a problem. It optimizes a solution. It gets better at that specific problem. Then it stops.

Robert Lange, founding researcher at Sakana AI, calls this the "problem problem" [1]. AlphaEvolve can optimize circle packing or matrix multiplication brilliantly — but it needs a human to hand it the right problem in the first place. The system never asks: "What if I'm solving the wrong problem? What if solving a completely different problem first would unlock a better solution to this one?"

That's not a minor limitation. It's the fundamental difference between optimization and discovery.

Consider how human scientific breakthroughs actually happen. The insight that cracked a number theory problem came from linear algebra. The technique that revolutionized one field was borrowed from an unrelated one. Real progress requires inventing new problems, not just solving the ones handed to you [1].

Current LLM systems can't do this. They're parasitic on their starting conditions — deeply capable within the search space you define, but unable to redefine the search space itself [1].

Shinka Evolve: Evolution That Evolves Itself

Sakana AI's answer is Shinka Evolve — a framework that combines LLMs with evolutionary algorithms to do open-ended program search. The name is deliberately recursive: "shinka" means "evolve" in Japanese. Evolve evolve. The evolutionary algorithm co-evolves alongside the programs it's optimizing [1].

The architecture works like this: an archive of programs is organized as islands. LLMs serve as mutation operators — proposing code diffs, full rewrites, or crossovers between programs from different islands. An evaluator scores each mutation. Good mutations propagate. Bad ones die. The archive grows [1].

Three types of mutation keep diversity high:

Diff-based patches: Targeted edits to specific parts of a program, with markers protecting essential code (imports, evaluation harness) from accidental deletion
Full rewrites: The LLM regenerates the entire mutable section from scratch, enabling radical departures from the current solution
Crossover: Two parent programs from different islands are merged, combining complementary innovations — an initialization strategy from one program with an optimization routine from another [1]

But the real innovation isn't the mutation operators. It's the model ensemble and adaptive selection.

The Multi-Model Bandit: When GPT-5 and Sonnet Take Turns

Shinka Evolve doesn't rely on a single frontier model. It runs GPT-5, Sonnet 4.5, Gemini, and others simultaneously — and uses a UCB (Upper Confidence Bound) bandit algorithm to figure out which model to deploy for each mutation [1].

The intuition sounds simple: just use the best model. But in practice, the best model on SWE-bench isn't always the best mutation proposer for a given program state. Sometimes GPT-5 lays a stepping stone that Sonnet builds on. The credit-assignment problem across models turns out to be genuinely hard — did the performance gain come from the model that made the current mutation, or the one that created the stepping stone three generations back? [1]

UCB handles this elegantly. Each model is an arm of a multi-armed bandit. The algorithm tracks which models have produced improvements for similar parent nodes, allocates more attempts to high-performing models, but never fully abandons any model — maintaining a probability floor that preserves the chance for serendipity [1].

The theoretical guarantee matters: UCB's regret is logarithmic, meaning it converges to near-optimal model selection without requiring perfect credit assignment upfront [1].

The Results That Matter

Circle packing is the canonical benchmark — pack circles into a square to maximize the sum of radii. Shinka Evolve achieved state-of-the-art results with dramatically fewer evaluations than AlphaEvolve [1]. In under 200 LLM interactions, it converged on a solution. That's not just good — it's sample-efficient enough to make evolutionary program search practical for researchers without massive compute budgets.

But three other applications reveal the framework's real range:

Evolving agent scaffolds. Using a framework called ADAS (Automatic Design of Agentic Systems), Shinka evolved the agent scaffolding itself — not the model weights, but the code that orchestrates how a model reasons through tasks. On AIME math benchmarks, it dramatically improved the performance of cheap models like GPT-4.1 Nano. The evolved scaffolds generalized across different language models and different years of AIME problems [1]. An agent that evolves agents.

Competitive programming. Applied to ALE bench (AtCoder heuristic programming contests), Shinka optimized solutions on top of an existing agent's initial outputs. The combination would have placed second in the actual competition [1].

Mixture-of-experts loss functions. Shinka evolved load-balancing loss functions for MoE models, illuminating a full Pareto front of trade-offs between model performance and load balance — in roughly 20 generations [1]. Not one optimal solution, but an entire landscape of viable options.

The Stepping Stone Argument

Kenneth Stanley's book Why Greatness Cannot Be Planned provides the intellectual foundation here. The core claim: breakthrough innovations follow divergent paths that look stupid in hindsight. Natural evolution doesn't optimize toward a goal — it accumulates stepping stones, and some of those stepping stones turn out to be revolutionary in ways that couldn't have been predicted [1].

Shinka Evolve operationalizes this idea. By maintaining diverse islands of programs, allowing radical rewrites alongside incremental patches, and never fully converging on a single solution, it creates conditions where stepping stones can accumulate [1].

But Lange is honest about the current limits. When you start Shinka Evolve with an already-optimized solution, it gets stuck in local optima. Start with an impoverished solution, and there's much more room for genuine diversity — but the search takes longer. It's the classic exploration-exploitation trade-off, now playing out in program space [1].

The deeper limitation: Shinka Evolve still takes the problem as fixed. It doesn't yet co-evolve problems and solutions together. Lange points to Jeff Clune's POET framework — where environments and agents co-evolve in an auto-curriculum — as the direction that could unlock truly open-ended discovery [1]. The system that invents new problems as a way of solving the original one.

The AI Scientist Question

Sakana AI also built the AI Scientist — an autonomous system that generates research ideas, implements experiments, runs them, and writes papers. Version 2 replaced the linear experiment pipeline with a parallelizable agentic tree search, inspired by the scientific method itself: accumulate evidence, reject hypotheses, adapt the next experiment based on results [1].

The results are genuinely novel. One AI Scientist paper was accepted at an ICLR workshop, passing the acceptance threshold before meta-review [1]. That's not a Nature paper, but it's the first time an autonomous system produced work that cleared a peer-review bar.

Lange's self-assessment is refreshingly honest: not every paper the AI Scientist produces is a discovery. Some is what critics call "slop" — work that looks like science, follows the format, but lacks deep grounded understanding [1]. The system operates near the top of what Lange calls the "epistemic tree" — doing surface-level recombination of known ideas rather than reaching deep into the tree to synthesize genuinely novel insights.

But the trajectory matters more than the current snapshot. The gap between GPT-3 and GPT-4 was a massive increase in fidelity. There's no principled reason these systems can't develop deep grounded understanding — they just don't have it yet [1].

The Verification Bottleneck

Every evolutionary program search system shares one critical weakness: verification.

It's easier to generate a mountain of candidate solutions than to verify which ones actually work [1]. LLMs can do "soft verification" — reading code and mentally tracing execution — but it's not exact. Reward hacking is real: systems find shortcuts that satisfy the evaluator without achieving genuine progress [1].

This is where the "problem problem" bites hardest. If you're co-evolving problems and solutions, you also need to co-evolve the verification — and automated verification of novel scientific claims is an unsolved problem. For now, the ultimate verifier is still human judgment, and that creates a bottleneck that scales poorly [1].

Lange sees a path forward through systems like OpenAI's PaperBench and LLM-based soft verification, combined with physical experiment execution through robotic labs. But he's clear that this infrastructure will take years to mature [1].

What This Means for AI Agent Engineering

The connection between evolutionary program search and AI agent development is direct. Shinka Evolve already demonstrated evolving agent scaffolds — the code that orchestrates how models reason through tasks. As these evolutionary systems become more sample-efficient, three implications become unavoidable:

Agent architectures will be evolved, not designed. The ADAS application proved that LLM-evolved scaffolding can outperform hand-designed agent pipelines. Today it works on math benchmarks. Tomorrow it works on your production agent's task-routing logic, its retry strategies, its context management.

Multi-model orchestration needs adaptive selection. Shinka's UCB bandit for model selection foreshadows how production agent systems will work: not hardcoding which model handles which subtask, but dynamically routing based on observed performance. The model that's best for code generation isn't always best for planning, and the optimal assignment changes based on the specific problem state.

The "vibe coding" paradigm is a stepping stone. Lange describes the trajectory clearly: from chat assistants (single-threaded, human-in-the-loop) to what he calls "vibe researching" — distributed optimization where you steer during the day and parallel experiments run overnight [1]. The current cursor-style coding assistants are the beginning, not the end state.

The Uncomfortable Frontier

Lange raised a point that deserves more attention than it gets. These coding assistants might function like drugs — addictive, budget-limited, and capable of making you accept outputs you haven't actually understood [1]. The autopilot problem is real: when models generate tokens faster than you can read them, you start pressing accept without thinking. You stop being the driver.

The counterargument — that humans will always provide the deep understanding and creativity — rests on an assumption that gets weaker every year. Lange still believes it. He points out that current AI systems understand things "a few levels down in the epistemic tree" while humans understand things deep down, giving us a wider cone of creative potential [1].

But he also says the Rubicon moment is clear: it arrives when AI discovers the next Transformer architecture — something massive, foundational, and universally adopted — and we're all using it. Not a marginal improvement. A paradigm shift discovered by a machine [1].

We're not there yet. But Shinka Evolve, running for under 200 evaluations on a math problem and achieving state-of-the-art results, suggests the distance is shorter than most people think. The systems that will cross that line won't be bigger language models. They'll be evolutionary frameworks that use language models as mutation operators — accumulating stepping stones, co-evolving problems and solutions, and running not for minutes but for months.

The researchers building those systems are working in the open. The code is available. The question is whether the AI agent engineering community recognizes this as the frontier it is — or keeps optimizing prompts while the real architecture of automated discovery takes shape somewhere else.

References

[1] Machine Learning Street Talk — When AI Discovers the Next Transformer — Robert Lange — (2026-03-13). Video