Unveiling the Hidden Code of Life: A New Approach to Gene Regulatory Network Inference

NYU Center for Data Science
4 min readMay 8, 2024

Inside every cell lies a hidden code — a complex program that determines everything from eye color to susceptibility to disease. This code is not written in the familiar ones and zeros of computers, but rather in the language of gene expression, controlled by the interactions of proteins called transcription factors with their target genes. Deciphering this code is the key to understanding the fundamental processes of life and unlocking new frontiers in medicine.

Now, in a new paper published in Genome Biology, researchers Claudia Skok Gibbs, a PhD student at CDS, and Omar Mahmood, a CDS PhD Alumnus, have developed a novel approach called PMF-GRN (Probabilistic Matrix Factorization for Gene Regulatory Network Inference) that brings us one step closer to cracking this code.

“I got into this research five, six years ago now,” said Skok Gibbs. She started by working on an existing regulatory inference program called the Inferelator, developed by her advisor, Richard Bonneau, during his PhD. However, the Inferelator was designed for older sequencing technology and struggled with the sparse datasets generated by newer single-cell RNA sequencing methods.

To address this challenge, Skok Gibbs adapted the framework to single-cell technology, making it more scalable and able to incorporate additional genomic information. Along the way, she identified several key issues in gene regulatory network inference that needed to be addressed.

The first issue was the inflexibility of existing algorithms, which were designed for specific datasets and couldn’t easily adapt to new sequencing technologies or biological questions. “These frameworks are pretty inflexible,” explained Skok Gibbs. PMF-GRN tackles this by using a probabilistic approach with latent variables that can be easily modified as new data becomes available.

The second issue was the lack of principled model selection. Most algorithms rely on heuristics without justifying why a particular approach was chosen or if it’s the best fit for the data. PMF-GRN performs hyperparameter search and variational inference to find the optimal model, replacing heuristic model selection with a systematic evaluation of different generative models and hyperparameter configurations.

The third issue was the difficulty in assessing the accuracy of predicted regulatory interactions, especially in complex systems like humans where gold standard datasets are limited. By using a probabilistic approach and variational inference, PMF-GRN provides uncertainty estimates for each predicted interaction. “Uncertainty estimates can be useful in situations where there are limited validated interactions or a gold standard is incomplete,” said Skok Gibbs.

A key contributor to the development of PMF-GRN was Kyunghyun Cho, Professor of Computer Science and Data Science at CDS. Cho’s expertise in machine learning algorithms, particularly matrix factorization, was instrumental in shaping the project. “There was a moment around 2021, where all his students were walking around saying ‘matrix factorization’, which is when I got super excited about it.” The collaboration between Skok Gibbs, with her focus on biological problems, and Cho, with his algorithmic ideas, proved to be a powerful combination.

The researchers demonstrated the advantages of PMF-GRN by applying it to datasets from yeast, human peripheral blood mononuclear cells (PBMCs), and synthetic data from BEELINE. In the yeast datasets, which have good gold standards, PMF-GRN outperformed three state-of-the-art regression-based methods. For the PBMC data, Skok Gibbs manually searched the internet for thousands of predicted interactions, finding 60 papers supporting those regulatory relationships. “That took about a week of my life,” she joked.

The potential applications of accurate gene regulatory network inference are immense. In cancer patients, it could enable the development of precisely targeted chemotherapy drugs. More broadly, it could shed light on everything from the mechanisms of autoimmune diseases to the process of development from a cluster of cells into a fully formed human. “How did your cells know to become eye cells and brain cells?” mused Skok Gibbs. “There’s a program, and gene regulatory networks reveal this program to you.”

Achieving these ambitious goals will require further advances in both computational methods and sequencing technology. “One of the major things that’s preventing this from being an easily solvable problem is the available sequencing technology,” said Skok Gibbs. She explained that right now all we have are “little snapshots” of what’s happening inside cells, from which so much has to be inferred — though that’s beginning to change. “They’re starting to develop sequencing technology where cells stay alive, from which they take measurements over time. That’s a very exciting development.”

For Skok Gibbs, the motivation behind the work is clear. “I want to help people by reducing the burden of illness and improving overall health outcomes,” she said. With PMF-GRN, Skok Gibbs and her colleagues have taken an important step towards that goal, bringing us closer to a future where the hidden code of life is no longer a mystery, but a tool for improving human health and well-being.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.