Sitemap

Beyond Chain of Thought: How Padding Tokens Give Transformers More Power Without Sacrificing Speed

3 min readSep 17, 2025
Press enter or click to view image in full size

Chain of thought reasoning is a popular method that increases the capabilities of transformer language models for solving sequential reasoning problems. However, chain of thought reasoning forces the model to spend many steps solving even simple problems, wasting computational resources on sequential thinking that could be done in parallel. CDS PhD alumnus William Merrill and his collaborator Ashish Sabharwal have proven that, in principle, a much simpler approach — adding blank “padding” tokens to the input — could also expand transformers’ computational power without sacrificing the speed advantages of parallel processing.

Their paper, “Exact Expressive Power of Transformers with Padding,” resolves an open question in the field by proving that transformers with polynomial padding can solve exactly the class of problems known as TC⁰ — the same ceiling that previous work had established as an upper bound, but had never shown was achievable.

The breakthrough reveals a fundamental insight about how transformers process information. While standard transformers are limited in their ability to reason about complex relationships between multiple elements in their input, adding padding tokens gives them enough “storage space” to track all possible combinations of variables needed to solve harder problems.

“A nice thing about the padding approach is that it’s more parallelizable than chain-of-thought,” Merrill said. “With chain-of-thought, you need to wait before you can generate the next output. But with padding, you don’t need to wait at all.”

The difference becomes clear through a concrete example. Consider the “3SUM” problem: given a list of numbers, determine whether any three of them add up to a specific target like 42.

“The natural way you would solve this problem is you would read the list and then at the last number, you would look back over all pairs of three positions in the list, and you would check for each one whether the sum of these three numbers sum to 42,” Merrill explained. Standard transformers “can only look back at one thing at a time. So you can’t really look back at all the triples in the list.”

Padding tokens solve this limitation by providing dedicated space for the transformer to store and manipulate multiple variables simultaneously. The key insight is that nk padding tokens give transformers enough computational space to resolve logical formulas over k variables — sufficient to capture the full TC⁰ complexity class.

The work extends beyond basic padding to examine “looping” — repeating transformer layers based on input complexity. Merrill and Sabharwal proved that transformers with both padding and polylogarithmic looping can solve problems in TCd, a significantly more powerful class that includes Boolean formula evaluation, graph connectivity, and context-free language recognition.

This combination approaches the theoretical ceiling of NC — the largest class of problems solvable with parallel computation under standard complexity theory conjectures. Chain of thought reasoning can solve problems beyond this ceiling, but only by sacrificing parallelism entirely.

The theoretical advances point toward practical applications in test-time compute — methods for giving models more computational resources during inference. Current approaches like chain of thought require sequential generation, making inference slow and expensive.

“Various works have suggested that chain-of-thought might be somewhat wasteful — that you need to spend a lot of steps to solve even some problems that are quite simple,” Merrill said. The padding approach “can gain expressive power in a more parallelizable way, which is practically appealing.”

The research connects transformer capabilities to classical results in logic and complexity theory, bridging machine learning with decades of theoretical computer science. By proving exact characterizations rather than just upper bounds, the work provides the mathematical foundation needed to reason systematically about transformer modifications using familiar tools like reductions and complete problems.

Merrill recently accepted a position as assistant professor at the Toyota Technical Institute at the University of Chicago, where he will continue theoretical research on the computational power of language modeling architectures and ways to improve them. In particular, this work suggests that padding and looping deserve empirical investigation as more efficient alternatives to chain-of-thought for solving moderately parallelizable reasoning problems.

By Stephen Thomas

Have feedback on our content? Help us improve our blog by completing our (super quick) survey.

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet