Language Models Optimize Biologically Realistic Synthetic Sequences, Potentially Helping Drug Discovery

NYU Center for Data Science
3 min read6 days ago

--

Creating new drug candidates typically requires extensive and costly laboratory testing for each potential molecule. Now CDS PhD student Angelica Chen and collaborators, in “LLMs are Highly-Constrained Biophysical Sequence Optimizers,” have developed a powerful new method that could help optimize sequences while respecting strict constraints, with potential applications to drug discovery.

The research team, led by Chen along with CDS PhD Alumnus Samuel D. Stanton and Nathan C. Frey at Prescient Design, developed a technique called Language Model Optimization with Margin Expectation (LLOME). LLOME trains large language models to iteratively refine discrete sequences while strictly adhering to constraints that mirror realistic biological constraints.

The approach leverages a key advantage of language models: their flexible interface. “One of the things that makes large language models so nice to use compared to a lot of other algorithms that might be used in drug design is that they are very flexible in terms of their user interface — you can use natural language to control the model, versus a lot of other algorithms or models where you might have to format your instructions in very specific ways,” Chen said.

The project brought together machine learning experts and structural biologists to identify the key molecular constraints that matter most when designing new drug candidates. CDS Professor of Computer Science and Data Science Kyunghyun Cho and former CDS Advisory Committee Faculty Richard Bonneau provided high-level guidance on both the machine learning and biological aspects of the work.

The team, which also included Robert G. Alberstein, Andrew M. Watkins, and Vladimir Gligorijević (all at Prescient Design), demonstrated that their approach finds better candidate sequences while requiring fewer oracle evaluations compared to traditional methods. On their test problems, LLOME achieved significantly lower error rates than genetic algorithms, the current standard approach.

A key innovation was the development of a new training method called Margin-Aligned Expectation (MargE) that helps language models smoothly improve their outputs over multiple iterations. This allows LLOME to efficiently explore the vast space of possible molecules while staying within biological constraints.

While still in early stages, the research establishes important proof points about the potential of language models in drug discovery. “This paper helps show that language models are capable of this type of more highly constrained task,” Chen said, though she noted they “haven’t yet tackled the problem of whether or not success on this particular benchmark translates to success in the lab.”

The work represents an important step toward using AI to accelerate drug discovery. By reducing the number of physical experiments required while improving candidate quality, approaches like LLOME could help bring new treatments to patients faster and at lower cost.

“The eventual ambition is to develop this kind of single, general purpose model that can take very flexible inputs and accomplish all of these different parts of the drug design pipeline in basically one interface,” Chen said.

The research demonstrates the value of bringing together different types of expertise. The team’s ability to incorporate real biological constraints into their AI system stemmed directly from close collaboration between machine learning researchers and structural biologists who could identify which constraints mattered most for designing viable molecules.

By Stephen Thomas

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet