The Illusion of State: Uncovering the Limitations of State-Space Models

NYU Center for Data Science
3 min readMay 23, 2024

State-space models are a promising new alternative to the transformer architecture that powers ChatGPT — however, they may not be a panacea for all limitations of transformers. New research from CDS reveals that state-space models, much like transformers, struggle with state tracking tasks that are thought to be fundamental to many real-world applications of artificial intelligence.

This groundbreaking work builds upon previous research by CDS PhD student William Merrill and Ashish Sabharwal from the Allen Institute for AI, which used complexity theory to study the intrinsic capabilities and limitations of transformers. Their earlier findings were recently featured in a Quanta Magazine article titled “How Chain-of-Thought Reasoning Helps Neural Networks Compute,” which explored the impact of chain-of-thought prompting on the computational power of transformers.

“There’s a nice connection between that work and this work,” Merrill said. “That work analyzed transformers and the type of reasoning they can express. This new work focuses on asking the same questions about a new class of models: state-space models.”

Merrill and Sabharwal’s new research, undertaken with co-author Jackson Petty, a Linguistics PhD student at NYU, has been published in a paper titled “The Illusion of State in State-Space Models.” The study investigates the expressive power of state-space models for state tracking tasks, comparing them to the previously established limitations of transformers.

State-space models, a recently popular alternative to transformers in artificial intelligence, have been touted as a potential solution to the inherent limitations of transformers in processing sequential data. However, new research from CDS suggests that the apparent advantages of state-space models may be nothing more than an illusion.

One of the motivations for state-space models is their close architectural similarity to recurrent neural networks (RNNs), which are known for their ability to handle sequential data and maintain state. In contrast, transformers, despite their ubiquity in natural language processing, have been shown to struggle with certain types of sequential computation and state tracking. The work demonstrates that both linear and Mamba-style state-space models, like transformers, cannot express simple state-tracking problems like permutation composition, which lies at the heart of real-world tasks such as tracking chess moves, evaluating code, and tracking entities in a narrative.

“We think that state tracking is a fundamental capability that comes up in a lot of AI reasoning tasks,” Merrill explained. “So it seems like a pretty fundamental limitation that state-space models can’t express it.”

The researchers also conducted experiments that confirmed their theoretical predictions. Both transformers and state-space models struggled to learn permutation composition with a fixed number of layers, while RNNs could compose permutations with just a single layer.

Despite the limitations uncovered in their research, Merrill and his colleagues propose ways to augment SSMs to compensate for their limitations, in the form of “extensions” that allow them to solve permutation composition. However, these extensions may come at the cost of reduced parallelism and potentially negative impacts on learning dynamics.

“We aren’t saying that state-space models are doomed,” Merrill clarified. “Also, transformers can’t express hard state tracking either, and we know transformers are still useful in practice. We are saying that if you want to claim that a model can actually be stateful, you need to be careful about how you design it and go beyond current approaches.”

The findings of this research have important implications for the design of language models, and Merrill is quick to note that there is still much exploration to be done with AI architecture design. “There are still unexplored techniques that could be really interesting,” he said. “A model could be created that has some state-space layers and some transformer layers — by interleaving them, you might be able to get the best of both worlds.”

As AI continues to advance and tackle increasingly complex tasks, understanding the strengths and limitations of different architectures is crucial. The work of Merrill and his colleagues sheds much-needed light on the challenges that lie ahead, paving the way for the development of more powerful and efficient language models.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.