NetQuilt, a protein function prediction method: Interview with CDS PhD student Meet Barot

The body is an incredibly complex system, and its distinctive qualities are derived from an individual’s DNA. Think of DNA as a set of instructions; it gives your body directions to create all sorts of proteins used to structure, function, and regulate the body’s tissues and organs. Proteins exist in all organisms as they are the fundamental building block of life. But for all that is known about proteins, most have a minimal amount of information regarding what it is they actually do.

Meet Barot is a PhD student at CDS focused on using data science to identify each protein’s function, given the limited amount of information that currently exists on proteins. Recently, he worked with CDS associated professor Richard Bonneau and CDS Associate Professor Kyunghyun Cho on NetQuilt, a protein function prediction method. We asked Meet to explain his work on the project.

(Interview is slightly edited for brevity.)

Tell us a little about your project, NetQuilt.

So, NetQuilt is a protein function prediction method for taking in protein interaction networks of different organisms. A lot of existing function prediction methods only use sequences which gives you a limited amount of information to work with.

It’s similar to social networks; you’re able to determine something about a person based on whom they associate with or the types of things they like. That kind of information can tell you a lot about someone. In the same way, if you know what proteins associate with each other, that can tell you a great deal about what a protein does.

Another factor is homology, which is a term used to describe the type of sequence similarity. Homology implies some sort of evolutionary relationship; if two proteins have high sequence similarity, it’s more likely that they evolved from a common ancestor in terms of the organisms. That’s another kind of relationship that we wanted to encode in our model. We wanted a way to combine the evolutionary/homology relationships and the protein interaction, network relationships in an informative way for a machine learning model to take as input to predict a protein function at the output.

How did you get involved in establishing NetQuilt?

Previously, I had been working on protein function prediction with my co-author Vladimir Gligorijević. Eventually, Vlad, Richard Bonneau, and I developed this method called deepNF, a protein interaction network-based function prediction method for a single organism. However, we wanted to expand the model to use multiple organisms’ data. This would allow for more transfer between proteins that are well studied in an organism and between organisms that are not well studied. My work with NetQuilt improved upon deepNF as it analyzes the relationship between multiple organisms instead of one. You can also transfer the knowledge that we know of from existing organisms. It’s a pretty significant step up in terms of functionality and also accuracy. In general, it’s both a qualitative and quantitative kind of improvement.

What conclusions have you reached with your research, and what implications does your research have?

I am primarily interested in the constant improvement of these methods. In terms of the actual use of NetQuilt, I view it as mainly a stepping stone to better models. However, there are many potential applications for this method, not just protein function prediction. This is a very general network-based model. Because it’s a node labeling problem, you could replace the end task with predicting some other quality about proteins, rather than the Gene Ontology terms that we were originally predicting.

Potentially, someone could use this algorithm for protein function prediction. But the field is advancing and more sophisticated algorithms are coming out all the time. We’ll likely have improved versions in the near future.

Tell us more about what you’ve been currently working on!

The current work that I’m doing is to discover new functions entirely. We currently have systems that describe what proteins do in terms of previously known functions. Previously, we were trying to assign proteins with a particular label that we already knew about. For example, we knew that it’s possible for a protein to bind DNA, and we can assign a new protein with that label. Now, the problem that I’m focusing on is trying to describe the function of a protein without any constraints to discover new functions entirely, by generating the descriptions of a protein’s function in natural language. I think it’s pretty exciting. If we were able to make this kind of system, it could have many implications for biologists. Beyond an improvement in predictive performance, it’s a different paradigm.

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.