Not every discovery needs an LLM
TL;DR: AI-driven scientific discovery with systems such as AlphaTensor, AlphaEvolve, etc is en vogue and we are also heavily invested in that space. However, these LLM- and RL-driven approaches, while impressive often in themselves, are not always the right tool to get the job done. In this post I will talk about three recent projects from our group that take different angles on LLM- and RL-driven mathematical discovery systems. Two of them revisit classical problems made famous through these AI systems and show that classical structured search matches or beats them on these specific problem classes; with a tiny fraction of the compute. The third example goes the other way around: AI not as discovery engine, but an AI-accelerated subroutine sits inside a classical structured search and lets us settle a thirty-year-old conjecture on real algebraic plane curves of degree seven. This points to an interesting shift in the recent AI vs. classical discussion: it is not so much about whether or not to use AI, but where in the stack AI belongs.
Not a takedown
When AlphaTensor [AT] landed in 2022 and AlphaEvolve [AE] in 2025, several of my colleagues in the optimization community quietly asked themselves the same uncomfortable question: are we obsolete now? AlphaTensor found better tensor decompositions for matrix multiplication, a problem that had been a textbook example of algebraic complexity for more than fifty years. AlphaEvolve, later, produced new best-known solutions for circle packing, hexagon packing, and minimum-distance configurations, also problems that go back to Bateman and Erdős in 1951. In both cases, a DeepMind-scale AI system took a classical problem, applied a lot of compute, and came out with a new record; making headlines in the process.
Now, the Zuse Institute Berlin (ZIB) is a research institute with a significant focus on AI, optimization, model-based simulation, etc. and it is our goal to push that even further. Also, my group and I have spent a considerable amount of time working at the intersection of optimization and machine learning, so that the AI-person in me celebrates these successes while the optimization-person in me is maybe a little bit more reserved. These systems are genuinely impressive, and the broader agenda of using strong learned models to help propose, search, and evaluate mathematical constructions, is something I strongly believe in but at the same time it seems to me that it is crucial where we apply AI in the stack:
For example, in some cases the right position for AI in the stack can be the formulation of the optimization problem, that is then solved with a classical optimization solver, rather than solving the actual problem itself. Similarly, sometimes the right position in the stack can be the design of an improved subroutine in a search that everything hinges on, rather than forcing the actual search through the AI system.
Full Disclosure. I am a co-author on all three of the projects I am about to discuss: the flip-graph paper on structured matrix multiplication [KGP26] which was just accepted for ISSAC 2026, the global-optimization paper on AlphaEvolve’s benchmark problems [BKMPP26], and the two companion papers on patchworked real plane curves of degree seven [GJKMPSWZ26a], [GJKMPSWZ26b]. So read this post with that in mind: I have skin in the game, just not always sure which side I’m on.
Two modes of search, three placements for AI
Both AI-driven discovery and classical optimization are, at the end of the day, search over a large space. The difference is what each method is able to exploit, which often is a function of how structured the space is.
AlphaTensor searches the space of candidate low-rank tensor decompositions using deep reinforcement learning. AlphaEvolve searches the space of programs by having an LLM propose code edits that an evaluator then scores in an evolutionary loop. FunSearch [FS], AlphaDev [AD], and AlphaGeometry [AG] operate along related axes. The common structure is: a strong learned model generates candidates, an external evaluator or environment checks them, and the loop is long enough for compute to discover behavior that was not explicitly programmed in.
Classical structured search looks different. It does not treat the search space as a black box. It uses decades, sometimes centuries, of theoretical development to constrain where to look: in the flip-graph setting, by only visiting provably correct tensor decompositions and only moving between them via rank-preserving or rank-reducing rewrites; in the global-optimization setting, by writing the problem as a nonlinear program and handing it to a spatial branch-and-bound solver that prunes entire regions using mathematical bounds. The same bitter-lesson machinery that helps RL and LLMs on unstructured problems becomes overhead when the problem’s structure is already available in closed form.
Once you think of it this way, the question is not really “AI or classical?” but where does AI sit in the stack? The answer highly depends on the problem and the three case studies below make that point by providing three different answers to that question:
- AI as the search engine, no human structure injected. The AlphaTensor / AlphaEvolve default, which our first two cases revisit and find wanting on structured problems.
- Classical structured search, no AI at all. What flip-graph random walks and off-the-shelf NLP solvers actually look like, and what makes them hard to beat on these problem classes (Case 1, Case 2).
- AI inside a classical structured search, providing a fast learned subroutine. How we settled a thirty-year-old conjecture on degree-seven plane curves (Case 3).
Case 1: matrix multiplication schemes via flip graphs
The first paper is Faster Algorithms for Structured Matrix Multiplication via Flip Graph Search [KGP26] by Kirill Khoruzhii, Patrick Gelß, and me, to appear at ISSAC 2026.
Background. Since Strassen showed in 1969 that the naive $O(n^3)$ algorithm for matrix multiplication is suboptimal, researchers have chased two different goals. One is the asymptotic exponent $\omega$, currently pushed below $2.372$ via laser-method refinements; the other is the constant-factor story for small base sizes, where an improved low-rank decomposition recursed on itself yields faster practical algorithms. AlphaTensor aimed for this second goal, for general (unstructured) matrix multiplication. But much of what numerical libraries actually do is not general matmul. It is Level-3 BLAS primitives with specific structure. For example, SYRK computes $AA^T$ and TRMM multiplies a triangular by a general matrix. These show up everywhere, from Gram matrices in statistics to Newton-Schulz updates in modern LLM training to causal-attention masks.
The flip graph idea. The search space is the set of correct decompositions of a target tensor. Its nodes are rank-$r$ decompositions that really do compute the target bilinear operation; its edges are local rewrites, “flips”, that take one correct decomposition to another, potentially decreasing $r$. Because edges preserve correctness by construction, every node we visit is a valid algorithm. A random walk on this graph becomes a remarkably efficient way to find low-rank schemes, introduced by Kauers and Moosbauer in 2023 [KM23].
flip-cpd repository (Kirill Khoruzhii). The 3D view shows the target tensor as a cube of entries, and the decomposition as a sum of rank-1 outer products. Try the presets (Strassen, Laderman) and press Reduce to run 1,000,000 flips — you can watch the rank decrease as the random walk finds lower-rank representations of the same bilinear operation.
What we added. Two technical ingredients. First, we search over $\mathbb{F}_2$ and $\mathbb{F}_3$, not just $\mathbb{F}_2$; $\mathbb{F}_3$ lets us discover schemes that fundamentally require the inverse of 2, and we then Hensel-lift them to $\mathbb{Z}$ or $\mathbb{Q}$. Second, for transpose products we introduce a corner-zeroing technique that increases the fraction of recursive calls, tightening the asymptotic factor.
Headline numbers. We systematically searched the 15 distinct structured format combinations that arise for base sizes $n \in {2,3,4,5}$ and improved the asymptotic complexity factor for 13 of them. A few concrete entries:
- $4 \times 4$ symmetric rank-k update ($AA^T$): rank 34, giving an asymptotic factor $\gamma = 22/37 \approx 0.595$, improving on the previous best $8/13 \approx 0.615$.
- $2 \times 2$ symmetric-symmetric: rank 5, improving on the previous rank-6 construction.
- $3 \times 3$ skew-symmetric $\times$ general: rank 14, improving on AlphaTensor’s rank 15 for the same problem.
Compute budget. The search used commodity 48-core Intel Xeon Gold nodes, 24-hour time limits per configuration, 1007 core-days in total. The headline $\langle 4,4,4:34 \rangle$ SYRK scheme was found in 10 minutes of wall-clock time on one node. The implementation sustains about $5 \times 10^6$ flips per second per thread. Code and schemes are released at github.com/khoruzhii/flip-cpd.
The right way to read this. It is not “random walks beat deep RL”: AlphaTensor solves a strictly harder problem (general matrix multiplication) on which this paper explicitly does not try to improve. What flip graphs do is encode the algebraic structure of correctness directly into the graph: every neighbor is a valid algorithm, so the search never wastes steps on invalid candidates, and the neighborhood relation is tailored to rank-reducing moves. A deep-RL agent that does not start with this structure has to learn, implicitly, to avoid the entire ocean of incorrect decompositions, which is most of the space. That is compute spent on a problem the flip-graph formulation eliminates by construction.
Case 2: combinatorial geometry with off-the-shelf NLP solvers
The second paper is Global Optimization for Combinatorial Geometry Problems Revisited in the Era of LLMs [BKMPP26] by Timo Berthold, Dominik Kamp, Gioni Mexi, me, and Imre Pólik.
Background. AlphaEvolve’s mathematical-discovery paper reports new best-known solutions for several classical problems, including circle packing (pack $n$ circles with variable radii into a unit square, maximize the sum of radii), hexagon packing (pack $n$ unit hexagons into a regular hexagon, minimize the side length of the container), and minimum-distance-ratio configurations (place $n$ points in the plane, minimize the ratio of max to min pairwise distance). These are classical extremal-geometry problems; best-known solutions were mostly due to Erich Friedman, David Cantrell, and related work, and some have been open since the 1950s. AlphaEvolve improved the state-of-the-art for several of them.
Our approach. We wrote the same problems as compact nonlinear programs, basically a few nonlinear constraints expressing non-overlap or containment together with some linear bookkeeping. Then we ran them through two off-the-shelf global NLP solvers: FICO’s commercial Xpress and the open-source SCIP. We used default settings with zero tuning. We wanted to see what generic global-optimization technology delivers on these problems, not some handcrafted problem-specific technique. In some sense this is the optimization “analog” to using an LLM: a general purpose technology (= solver) + problem formulation here in form of an NLP.
Results. With unmodified out-of-the-box solvers we reproduce/match every benchmark result we attempted, and in several cases we strictly improve upon AlphaEvolve’s best-known solution. A condensed view:
| Problem | Instance | Our result | Previous best |
|---|---|---|---|
| Circle packing, square | $n = 32$ | 2.93957 | 2.93794 (AE 2025) |
| Circle packing, rectangle | $n = 26$ | 2.63930 | 2.638 (Cantrell 2011) |
| Circle packing, rectangle | $n = 27$ | 2.69015 | 2.687 (Cantrell 2011) |
| Hexagon packing | $n = 11$ | 3.92485 | 3.93010 (AE 2025) |
| Hexagon packing | $n = 12$ | 3.94165 | 3.94192 (AE 2025) |
| Hexagon packing | $n = 14$ | 4.26900 | 4.27240 (Friedman 2015) |
| Hexagon packing | $n = 15$ | 4.44769 | 4.45406 (Friedman 2015) |
| Hexagon packing | $n = 16$ | 4.52788 | 4.53633 (Friedman 2015) |
Note. For Circle packing the objective is max sum of radii, so larger is better, whereas for Hexagon packing the objective is min side length of the outer hexagon, so smaller is better.
Wall-clock times are seconds to minutes on commodity hardware. Compare this with AlphaEvolve’s pipeline of LLM-driven code generation, execution, and evolutionary refinement per instance.
Constraint density matters. One observation from the paper is quite interesting. The smallest improvements over AlphaEvolve were on the least constrained problem (min-max distance ratio). The largest improvements were on the most constrained one (hexagon packing, with its trigonometric rotations and Farkas-based non-overlap). So it seems: the optimization advantage grows with constraint density.
The OpenEvolve footnote. The paper flags a particularly instructive anecdote. In an OpenEvolve experiment on packing 26 circles, the LLM-driven evolutionary loop converged to generating code that calls SciPy’s Sequential Least Squares Programming solver (SLSQP). The LLM, given freedom, chose to be a thin wrapper around a classical local NLP solver.
A 3D extension. The same NLP machinery extends to 3D. As an illustration, the widget below shows the current best-known solution (which is due to us) to the cube-in-cube packing problem from Erich Friedman’s classical packing benchmarks: pack $n = 11$ unit cubes (rotations allowed) into the smallest containing cube. Every cube is independently positioned and rotated in $\mathrm{SO}(3)$; the non-overlap constraints are pairwise separations between rotated boxes; the objective is to minimize the side length of the container.
Case 3: patchworked curves and AI as a subroutine
The first two cases aim at the “classical beats AI-driven discovery” narrative, at the top level of the stack. The third case is complementary in some sense, as it moves the question one layer down in the stack: a problem where AI is genuinely useful, but as a component inside a classical structured search, not as the search engine itself. The associated papers are 121 Patchworked Curves of Degree Seven [GJKMPSWZ26a] and its companion Fast Isotopy Computation for T-Curves [GJKMPSWZ26b], with the same eight authors (Geiselmann, Joswig, Kastner, Mundinger, me, Spiegel, Wack, Zimmer) from our Algebraic Curves MATH+ project.
The question. Hilbert’s 16th problem asks for the topological classification of smooth real plane projective algebraic curves. Viro [V84] classified the 121 real schemes (ambient isotopy types) of smooth degree-$7$ curves in 1984. A natural, much harder question was left open: can every one of the 121 real schemes actually be realized by a T-curve, i.e., constructed via Viro’s patchworking? This question was raised explicitly by Itenberg and Viro in 1996, and was open for almost thirty years.
Why it is hard. A T-curve of degree $d$ is specified by a regular unimodular triangulation of the dilated triangle $d \cdot \Delta_2$ together with a sign distribution on its lattice points. By the Patchworking Theorem, this combinatorial datum pins down the isotopy type of a smooth real plane curve of degree $d$. Settling the 121-classes question in the constructive direction is, at bottom, a structured search: enumerate triangulations and sign distributions, compute the resulting isotopy type, and check that every one of the 121 classes is hit.
Where AI came in. The bottleneck was not the search framework itself but the actual isotopy-type computation. This subroutine is at the core of the search, mapping “triangulation + signs” to “which of the 121 classes this is”. Classical implementations were far too slow to run at the scale needed. Using AI in the design process, we built a new, near-quadratic algorithm together with a highly-efficient GPU implementation, that evaluates roughly $10^9$ real schemes per second [GJKMPSWZ26b]. Equipped with that subroutine, the structured search (deemed impossible by many) became suddenly feasible, and the companion paper [GJKMPSWZ26a] provides explicit patchworks (and hence explicit polynomials) for every one of the 121 real schemes, settling the Itenberg–Viro question in the positive.
To understand the scale a little better, for a degree-$8$ triangulation there are $2^{42}$ sign distributions to check, which is about $12.64$ base-10 OOMs. A compute-day buys us $4.9$ OOMs and for degree $8$, the “hash rate” drops to about $10^8$ real schemes/sec, so $8$ OOMs. This means that we roughly need one gpu-day to check a single degree $8$ triangulation exhaustively. In contrast, when we started out (with what was already pretty optimized code) we were roughly running at $4-5$ OOMs, which means roughly somewhere between $2.7$ and $27.4$ years for a single triangulation of degree-$8$.
What is the difference? Notice what did and did not happen here. The AI component did not propose the curves, guide the enumeration, or rank candidates: a patchwork is either valid or not and the isotopy-type answer is exact. The AI component replaced a slow but conceptually classical subroutine with a much faster one. The discovery engine on top is a classical, exhaustive, structured search.
The widget gives an idea of how Viro patchworking works.
The pattern
The three case studies point in the same direction. Rich Sutton’s bitter lesson [BL] says that general methods that scale with compute eventually beat methods relying on human-crafted knowledge. On problems where structure is hard to exploit, e.g., image recognition, protein folding, natural-language understanding, game play with sparse and shifting heuristics, etc. this has now been proven empirically more than once, and I would not argue with it.
But there is an inverse of this lesson on problems where structure is cheap to exploit. Tensor decomposition has a theory of correctness and a theory of local rewrites that preserve correctness. Circle packing has 200 years of extremal geometry behind it, and hexagon packing can be written as a compact NLP that a modern solver prunes very effectively. Real algebraic plane curves have a century of patchworking and tropical geometry behind them. When a problem’s structure is already available in closed form, an AI system that has to rediscover it by exploration is paying the compute tax without getting the benefit. It can succeed (as shown by the AlphaTensor and AlphaEvolve results) but it succeeds in spite of not using the structure, not because of it.
At the same time, benefiting from this inverse lesson depends on someone (or something) formulating the problem in code or as an optimization problem in the first place. This is exactly what was observed in the OpenEvolve anecdote, where the AI formulated the actual optimization problem and then solved it with a classical solver. This is also the paradigm that we have been observing to be very powerful: the AI/LLM as meta-orchestrator of tools (e.g., special-purpose solvers) to solve the actual problem of interest.
Below for comparison a rough estimate of utilized compute; take these numbers with a grain of salt (see description).
What this means for AI × math
I want to be careful here because I think the weak version of this story (“AI is overrated; pick up a solver”) is wrong, unhelpful, and just plain stupid. The strong version could be more along the following lines though:
Mathematics has, at this point, a several-thousand-year head start on taxonomy, methods, and approaches. Whole subfields (algebraic complexity, global optimization, polyhedral geometry, numerical linear algebra) are each a century or more of accumulated answers to the question what structure does this problem have, and how do I exploit it? What somehow seems to be missing the point is having an LLM/AI rederive everything that we know each time. This is not just wasteful but also highly inefficient and error-prone. An enormous amount of the “AI for math” debate, in my view, is really a debate about taxonomy: for a given problem, which tool fits? and a lot of these see what my LLM can do! demonstrations. LLMs are spectacularly good at orchestrating the tools we already have, from a “soft” input down to progressively harder formulations such as an optimization problem. A lot of the benefit comes from leveraging this incredible power to “harden” softer inputs into formal(ized) representations that can then be processed with more “formal” tools, such as solvers, formal verifiers, etc.
As such, AI fits naturally into the stack between the researcher and formal science by “hardening” soft “human thought”. This leads to hybrid systems that use strong learned models for the part they are good at (exploration, hypothesis generation, problem formulation, heuristic guidance) and then interface with downstream (sometimes more) classical methods. For those of you interested, this is precisely the approach we follow in our agentic researcher workflow.
Open questions
At least two questions seem to be very immediate.
- Meta-level routing. Given a new problem, can an AI system choose the right approach, i.e., determine the handover point? In some sense the AI becomes the “glue code” between the researcher and the downstream formal, classical, etc. methods.
- Formulation as the hybrid interface. The OpenEvolve-to-SLSQP anecdote suggests a pattern in which an LLM’s main job on a well-structured problem is to write the model, and a classical solver’s job is to solve it. This is exactly the direction of OptiChat [OC], ORLM [ORLM], and related work.
References
[AT] Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, Pushmeet Kohli, Discovering faster matrix multiplication algorithms with reinforcement learning, Nature 2022. link
[AE] Alexander Novikov, Ngân Vũ, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, Matej Balog, AlphaEvolve: a coding agent for scientific and algorithmic discovery, 2025. link
[FS] Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, Alhussein Fawzi, Mathematical discoveries from program search with large language models, Nature 2023. link
[AD] Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol Vinyals, David Silver, Faster sorting algorithms discovered using deep reinforcement learning, Nature 2023. link
[AG] Trieu H. Trinh, Yuhuai Wu, Quoc V. Le, He He, Thang Luong, Solving olympiad geometry without human demonstrations, Nature 2024. link
[KM23] Manuel Kauers, Jakob Moosbauer, Flip Graphs for Matrix Multiplication, in Proceedings of the 2023 International Symposium on Symbolic and Algebraic Computation (ISSAC ‘23), ACM, 2023. link
[KGP26] Kirill Khoruzhii, Patrick Gelß, Sebastian Pokutta, Faster Algorithms for Structured Matrix Multiplication via Flip Graph Search, ISSAC 2026. link
[BKMPP26] Timo Berthold, Dominik Kamp, Gioni Mexi, Sebastian Pokutta, Imre Pólik, Global Optimization for Combinatorial Geometry Problems Revisited in the Era of LLMs, 2026. link
[GJKMPSWZ26a] Zoe Geiselmann, Michael Joswig, Lars Kastner, Konrad Mundinger, Sebastian Pokutta, Christoph Spiegel, Marcel Wack, Max Zimmer, 121 Patchworked Curves of Degree Seven, 2026. link
[GJKMPSWZ26b] Zoe Geiselmann, Michael Joswig, Lars Kastner, Konrad Mundinger, Sebastian Pokutta, Christoph Spiegel, Marcel Wack, Max Zimmer, Fast Isotopy Computation for T-Curves, 2026. link
[V84] Oleg Viro, Gluing of plane real algebraic curves and constructions of curves of degrees 6 and 7, in Topology (Leningrad, 1982), Lecture Notes in Mathematics 1060, Springer, 1984. link
[BL] Richard S. Sutton, The Bitter Lesson, 2019. link
[OC] Hao Chen, Gonzalo Esteban Constante-Flores, Krishna Sri Ipsit Mantri, Sai Madhukiran Kompalli, Akshdeep Singh Ahluwalia, Can Li, OptiChat: Bridging Optimization Models and Practitioners with Large Language Models, 2025. link
[ORLM] Chenyu Huang, Zhengyang Tang, Shixi Hu, Ruoqing Jiang, Xin Zheng, Dongdong Ge, Benyou Wang, Zizhuo Wang, ORLM: A Customizable Framework in Training Large Models for Automated Optimization Modeling, Operations Research 73(6), 2025. link
Comments