Sitemap

LLMs Switch to Guesswork Once Instructions Get Long

3 min readSep 10, 2025
Press enter or click to view image in full size

Even the most advanced language models abandon rigorous problem-solving strategies and switch to educated guessing when faced with complex instructions, much like humans who resort to shortcuts when overwhelmed. Linguistics PhD student Jackson Petty and several colleagues at CDS, including CDS Associate Professor of Linguistics and Data Science Tal Linzen, CDS Faculty Fellow Shauli Ravfogel, CDS PhD student Will Merrill, CDS PhD student Michael Hu, and CDS PhD student Wentao Wang, discovered this troubling pattern while testing how well AI systems can follow compositional instructions through a new framework called RELIC.

The research, detailed in the paper “RELIC: Evaluating Compositional Instruction Following via Language Recognition,” revealed that current language models — including OpenAI’s then-most sophisticated reasoning model o3 — consistently failed when asked to parse formal languages that require applying multiple rules in sequence. The task mirrors everyday scenarios where AI must follow complex, multi-step instructions without examples to guide them.

“One of the most common ways that we interact with language models is by asking them to complete a task based on a list of instructions that you type out in a prompt,” Petty said. The problem occurs when these instructions require sophisticated reasoning about information scattered throughout the model’s context window.

RELIC works by generating formal grammars — sets of rules that define a language — and asking models to determine whether specific strings of symbols can be generated by those grammars. This computational task requires models to compose multiple rules in the correct order, sometimes applying each rule multiple times.

The team found that models initially attempt the correct approach for simple problems, carefully working through grammar rules to construct parse trees. But as the complexity increases — whether through longer strings or more grammar rules — the models abandon this systematic approach. “We find this interesting phenomenon, where, as prompts get more complicated, models change strategies,” Petty explained. “Models are ‘quiet-quitting’ when we give them hard tasks, kind of like some people might.”

This behavioral shift has serious implications for real-world AI applications. When models switch to heuristic reasoning, they might appear to solve problems correctly while actually relying on superficial patterns rather than following instructions. “If you ask ChatGPT something, and you give it data, and it gives you back an answer, you would hope that it’s actually doing the thing that you told it to do, and not just giving a vibes-based, hand-wavey response,” Petty said.

The research team tested eight different language models on grammars with up to 500 production rules and strings up to 50 symbols long. All models, including OpenAI’s reasoning-capable o3, approached chance performance as complexity increased. Most concerning was the discovery that models reduce their computational effort precisely when they should be working harder — the number of reasoning tokens generated actually decreased for longer, more complex examples.

The work represents a bridge between traditional computational linguistics and modern AI systems. “This project is sort of an instantiation of thinking about a kind of task that we would like language models to be good at, and thinking about ways that we can use these formal methodologies that linguists have cared about historically,” Petty noted.

RELIC’s design addresses two persistent problems in AI evaluation: data contamination and benchmark saturation. Because the framework generates synthetic grammars and examples on demand, it can produce unlimited novel test cases that no model could have encountered during training. This allows researchers to scale complexity as AI capabilities improve.

The theoretical foundations suggest the ‘quiet-quitting’ problem may be fundamental to current transformer architectures. Context-free language recognition belongs to computational complexity classes that transformers cannot solve without chain-of-thought reasoning, and even with additional reasoning steps, the required computational time should grow substantially with input length. Instead, the researchers observed models giving up and switching to shortcuts.

The findings reveal that current estimates of language models’ instruction-following capabilities may be overly optimistic. While models excel at needle-in-a-haystack tasks that require retrieving specific information, they struggle with compositional reasoning that demands systematic application of multiple rules. The team plans to explore whether training on formal language tasks could improve models’ general instruction-following abilities.

By Stephen Thomas

Have feedback on our content? Help us improve our blog by completing our (super quick) survey.

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet