Training Transformers: Formal Languages as the Key to Efficient Learning

2 min readApr 17, 2025
structured strings of brackets

Training a transformer language model from scratch typically demands vast quantities of natural language data, pushing computational limits and costs to extremes. New research, however, shows there might be a better starting point: formal languages.

A recent study led by CDS PhD student Michael Y. Hu and co-authored by CDS PhD student William Merrill, CDS MS grad Chuan Shi, NYU Linguistics PhD student Jackson Petty, and CDS Associate Professor Tal Linzen, demonstrated that training transformer models initially on formal languages, such as structured strings of brackets, can notably speed up subsequent training on natural language data. The paper is titled “Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases.”

The study found a remarkable outcome: transformers that underwent initial training on certain structured formal languages required 33% fewer natural language tokens to reach equivalent performance on tasks such as predicting subsequent words in text and discerning grammatical correctness. “Pre-pretraining on formal languages gives models a significant head start,” said Michael Hu. “This initial step seems to efficiently teach the models fundamental linguistic structures.”

Prior work hinted that context-sensitive formal languages, which can represent hierarchical dependencies like those in human languages, were particularly effective for training. But as Hu explained, context sensitivity alone wasn’t enough. “We realized we had to pick languages that transformers found ‘easy’ to learn,” Hu said. By defining ease through computational constraints specific to transformers, they identified a formal language called “shuffle Dyck” — a structured system of interleaved brackets.

Transformers pretrained on shuffle Dyck not only learned faster but also generalized better on syntactic tasks than models pre-trained directly on natural language or other formal languages tested. The researchers also found that certain computational sub-networks developed during the formal language training phase persisted into natural language training, underpinning improved performance on linguistic evaluations.

One intriguing possibility highlighted by the research involves potential implications for controlling variance in model training — the unpredictability of model performance due to factors like random initialization. “We suspect pre-pretraining could reduce this variance,” said Hu, though he noted that further studies are needed to confirm this benefit conclusively.

While the method isn’t yet standard in production-scale models, Hu suggested that the application could be straightforward and beneficial: “I would recommend incorporating a brief formal language training phase. Our findings suggest it could streamline training processes.”

The team plans to continue exploring precisely how formal languages enhance transformers’ abilities. For now, their results provide compelling evidence that a brief detour into the structured world of formal languages could substantially ease the immense computational demands of modern language modeling.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet