Sitemap

Research spotlight: Masked graph modeling for molecule generation

2 min readJun 17, 2021

In the turbulent times of a global pandemic, the importance of modern medical advances such as drugs and therapeutics is more apparent than ever. However, the process is time-consuming and expensive. One major hurdle is the sheer number of possible molecules that can be synthesized. With so many possibilities it is difficult for models to generate outputs with desired properties. Machine learning approaches can help address this problem by generating molecules automatically rather than relying on explicitly enumerated heuristics or expert intuition alone.

A team of scientists comprising CDS PhD student Omar Mahmood, recent Computer Science PhD graduate Elman Mansimov, and CDS faculty Richard Bonneau and Kyunghyun Cho recently authored a paper titled, “Masked graph modeling for molecule generation” that was then published in Nature Communications. In the paper they propose a new approach which they refer to as “masked graph modeling.” Whereas many modern deep learning models serialize molecules as strings and work with these string representations, the masked graph model generates molecules directly in their graph representations. It does this by randomly masking out parts of each input molecular graph, and then learning the conditional probability distribution of the masked out part given the rest of the graph. After training is complete, the model samples repeatedly from these learned distributions, in a manner analogous to Gibbs Sampling, to generate novel molecular graphs.

The authors recount how they evaluated this new approach on the GuacaMol distribution-learning benchmark on the QM9 and ChEMBL datasets. They found that in general, many of these metrics are correlated. They examined how different models trade off these metrics and found that the masked graph model can trade them off more effectively than existing models, achieving a good balance between novelty and similarity to the set of molecules that the model was trained on. They also showed that their model can successfully generate molecules with specified properties without compromising physicochemical similarity to the training distribution.

Omar Mahmood spoke to us about the research and remarked:

“Machine learning-based molecule generation is key to driving down the high failure rates of candidate drug compounds. We are excited to advance the frontier of graph generation and to see how our framework will impact the generation of molecules directly in their graph representations.”

--

--

NYU Center for Data Science
NYU Center for Data Science

Written by NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.

No responses yet