TL;DR: Our ICLR 2026 paper Neural Sum-of-Squares by Nico Pelleriti, Christoph Spiegel, Shiwei Liu, David Martínez-Rubio, Max Zimmer, and Sebastian Pokutta uses a Transformer to predict compact monomial bases for sum-of-squares certificates, then verifies them with an exact SDP. On structured instances: approx. 1000x speedups. On large problems: we solve what standard methods can’t. Nico has written an excellent detailed walkthrough of the method; here I want to focus on the broader picture.

The design pattern

This paper is a great example for a pattern that has been emerging across our group’s recent work that I colloquially refer to AI suggests, mathematics verifies and which is strongly related to Algorithms with Predictions [MV20]. The idea is simple. Many exact mathematical methods (or more broadly algorithms) spend most of their time searching a (too) large space. If a trained ML model can narrow down that space (even if imperfectly), in many cases the exact method becomes much faster. The key is of course, that it does not jeopardize the correctness of the method and often it can be done in such as way that if the prediction is good we are fast and if not then we fall back to the slow base method with maybe a small penalty factor; the idea is also quite similar to optimism in first-order methods (see e.g., [DR25]), [RMP25]).

We have seen this pattern work in several different settings now:

  • Basis Prediction for Border Bases [KPIZP25]: a transformer predicts the structure of the order ideal of a border basis.
  • Learning heuristics for branch-and-bound [CKG+21]: a model learns when to call MIP heuristics; the solver still proves optimality.
  • The Agentic Researcher [ZPRP26]): AI agents propose conjectures and experiments; mathematicians verify and steer. This is a couple of levels higher up but the idea remains the same.

Neural Sum-of-Squares is the latest and perhaps cleanest instance of this pattern. A Transformer proposes where to look for a mathematical certificate; a semidefinite program decides whether the certificate is actually there. If not the algorithm is still correct but slow as some repair has to be done, if the prediction was good the semidefinite program can be solved fast.

The problem in one paragraph

Given a polynomial $p(x)$, we want to certify that it is globally nonnegative by writing it as a sum of squares. The standard approach formulates this as a semidefinite program (SDP), but the SDP size depends on the monomial basis used to express the certificate. Choose too large a basis and the SDP becomes intractable. Choose too small a basis and the certificate may not exist in that search space. The bottleneck is selecting the right basis.

A quick example

Consider the polynomial

\[p(x_1, x_2) = 4x_1^4 + 12x_1^2x_2^2 + 9x_2^4 + 1.\]

The standard degree-based basis would be

\[\mathbf{z}(\mathbf{x}) = [1, x_1, x_2, x_1x_2, x_1^2, x_2^2]^\top,\]

which leads to a $6 \times 6$ SDP. But the actual certificate only needs

\[\mathbf{z}'(\mathbf{x}) = [1, x_1^2, x_2^2]^\top,\]

giving a $3 \times 3$ SDP instead. That is a factor-four reduction in matrix dimension from a toy example. For real problems the ratio is much more dramatic — and it is often the difference between a practical solve and no solve at all.

What we do

The method has three stages:

  1. Predict. A Transformer, trained on 200M+ synthetic SOS polynomials, autoregressively predicts a compact basis.
  2. Repair. A greedy step ensures the predicted basis covers all necessary monomials.
  3. Verify and expand. An exact SDP checks feasibility. If it fails, we iteratively expand the basis until it works, or eventually fall back to the classical candidate set.

The critical point: the Transformer never certifies anything. It only suggests a smaller search space but the mathematics/exact method has the final say. For the full technical details, including tokenization, reverse sampling for training data, the permutation-based expansion strategy, see Nico’s excellent blog post.

The results

Configuration Ours Best baseline Speedup
8 vars, deg 20 1.5s 2,675s (TSSOS) 1,800x
6 vars, deg 20 5.6s 86.5s (TSSOS) 15x
6 vars, deg 40 42.8s timeout only we solve
100 vars, deg 10 18.3s timeout only we solve

On structured polynomial families, the Transformer learns to identify compact bases that reduce SDP size by an order of magnitude. The last two rows are the most important: these are problems that standard methods (SoS.jl, TSSOS, Chordal-TSSOS) simply cannot solve within time or memory limits. On small or unstructured problems, the method degrades gracefully back to the classical pipeline. There is some overhead from the Transformer and repair steps, but no loss of correctness.

Why I think this matters beyond SOS

What excites me about this work is not just the numbers but the template. The “propose-verify” pattern has a few properties that make it broadly useful for AI4Math:

Graceful degradation. If the learned model is bad, you just get the standard method. This means the downside is bounded, while the upside can be enormous. Compare this to end-to-end learned solvers where a bad prediction gives you a wrong answer.

Separation of concerns. The neural model handles pattern recognition (which basis elements are likely needed?) and the exact method handles certification (is this actually a valid SOS decomposition?). Each part does what it is best at.

Structure exploitation. The model learns to exploit algebraic structure that rule-based methods miss. Sparse and block-diagonal polynomial families have regularities in their coefficient matrices that the Transformer picks up. Dense, unstructured families do not — and the method correctly does not try to compress those.

Scalability. This is perhaps the most practically important point. SOS programming is fundamentally limited by basis size. Methods that can systematically reduce basis size open up problem classes that are currently out of reach. Going from “infeasible” to “18 seconds” is not an incremental improvement.

The bigger AI4Math picture

This paper fits into a broader program in our group: developing AI systems that augment mathematical reasoning without replacing its guarantees. In the Agentic Researcher framework, AI agents run computational experiments and propose conjectures, but humans and formal verifiers close the loop. In our neural discovery work on the Hadwiger-Nelson problem, neural networks discovered improved colorings of the plane, but the combinatorial verification is exact.

The common thread is that AI and mathematics are most powerful not when one replaces the other, but when they work in concert — each handling the part of the problem it is best suited for.

Neural Sum-of-Squares is a particularly clean instance of this: the Transformer handles search space reduction, the SDP handles certification. Neither could do the other’s job effectively.

Paper and code

References

[P03] Parrilo, P. A. (2003). Semidefinite programming relaxations for semialgebraic problems. Mathematical Programming, 96(2), 293-320.

[CKG+21] Chmiela, A., Khalil, E., Gleixner, A., Lodi, A., and Pokutta, S. (2021). Learning to Schedule Heuristics in Branch-and-Bound. NeurIPS 2021. arxiv

[KPIZ25] Kera, H., Pelleriti, N., Ishihara, Y., Zimmer, M., & Pokutta, S. (2025). Computational Algebra with Attention: Transformer Oracles for Border Basis Algorithms. NeurIPS 2025. arxiv

[RMP25] Roux, C., Martínez-Rubio, D., and Pokutta, S. (2025). Implicit Riemannian Optimism with Applications to Min-Max Problems. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267, 52139–52172. arxiv

[DR25] Martínez-Rubio, D., and Pokutta, S. (2026). Beyond Short Steps in Frank-Wolfe Algorithms. To Appear in Proceedings of the International Conference on Learning Representations (ICLR) arxiv

[KPIZP25] Kera, H., Pelleriti, N., Ishihara, Y., Zimmer, M., and Pokutta, S. (2025). Computational Algebra with Attention: Transformer Oracles for Border Basis Algorithms. Advances in Neural Information Processing Systems (NeurIPS), 38. arxiv

[MV20] Mitzenmacher, M. & Vassilvitskii, S. (2020). Algorithms with Predictions. arXiv preprint. arxiv