How Should Self-Supervised Learning Models Represent Their Data?

NYU Center for Data Science
4 min readNov 3, 2023

Self-supervised learning (SSL) has emerged as a powerful technique for training deep neural networks without extensive labeled data. However, unlike supervised learning, where labels help identify relevant information, the optimal SSL representation heavily depends on assumptions made about the input data and desired downstream task. This is the subject of two recent papers by Ravid Shwartz-Ziv, a Faculty Fellow at CDS, CDS founding director Yann LeCun: “An Information Theory Perspective on Variance-Invariance-Covariance Regularization,” also with CDS Instructor Tim G. J. Rudner, among others, and “To Compress or Not to Compress — Self-Supervised Learning and Information Theory: A Review.” The former was recently accepted to ​​NeurIPS, and the latter to JMLR.

In an interview, Shwartz-Ziv explains his motivation: “I truly believe SSL is the way to make better algorithms. This is how humans learn.” But there are many ways to represent the information in neural networks, which leads to the question of what kind of representation is optimal. The research of Shwartz-Ziv and LeCun and their co-authors goes over optimal representation in supervised learning in unsupervised, semi-supervised, and multiview contexts, which are readily articulable, and optimal representation in self-supervised learning, which they call “an open question.”

One complication is that, in an SSL context with unlabeled data, the representation relies heavily on assumptions made about your data’s connection to your downstream task. Part of Shwartz-Ziv and LeCun’s work was emphasizing the importance of understanding and explicitly defining these assumptions.

However, the ways people make assumptions about how to train their SSL models, and what trade-offs they’re making in doing so, are currently poorly understood. So it’s important to understand what you’re gaining and losing from each choice you make. As Shwartz-Ziv pointed out, “It’s similar to the ‘no free lunch’ theory — you can’t get something general that will work really well in any scenario without making any assumptions.”

Shwartz-Ziv’s background in information theory and neuroscience led him to apply these lenses to analyze this problem. In particular, Shwartz-Ziv and LeCun used the information bottleneck principle, first developed by Naftali Tishby twenty-five years ago and subsequently developed by Shwartz-Ziv.

The information bottleneck principle is a communication concept, where a sender aims to transmit a message to a receiver. The goal is to compress the original message (or input) in such a way that only the relevant parts of the information, in relation to some task or variable, are retained, while the irrelevant parts are discarded or ‘compressed out’.

Mathematically, it can be visualized as a trade-off: On one hand, you want the compressed message (or representation) to retain as much information as possible about the original input. On the other hand, you want this representation to only contain the necessary information about a specific relevant variable or task, thereby getting rid of the irrelevant ‘noise’.

In the context of deep learning, the assumptions you make about the relationships between the input data, representations, and desired downstream tasks will inform what kinds of trade-offs you make, offering a way to formalize and compare different assumptions made in self-supervised learning algorithms about what information is relevant or irrelevant.

There is no right or wrong way to make these trade-offs in the abstract, but being informed about the consequences of these decisions will be highly important to deep learning engineers seeking to create better SSL models for specific purposes.

Shwartz-Ziv and LeCun present a unified framework to compare SSL methods’ objectives, assumptions, and challenges. A key insight is that under the MultiView assumption, which states that the most relevant information is that which is shared between multiple views, optimization involves maximizing mutual information between representations while compressing irrelevant information. However, as datasets grow and models serve multiple downstream tasks, the MultiView assumption becomes less applicable. Shwartz-Ziv cautions that new methods are needed to separate relevant and irrelevant information without this assumption.

When asked about practical applications and areas where these insights might be immediately used, Shwartz-Ziv highlighted the potential in multi-modalities and tabular data. “Current neural networks are not working so well for tabular data,” he said, explaining that this is an area that their work, with further research, could benefit. On the other hand, Shwartz-Ziv and LeCun demonstrated that, by carefully selecting assumptions, they can improve the performance of current vision models.

Shwartz-Ziv’s collaboration with Yann LeCun has been instrumental in this research. Describing their working relationship, he said, “Yann is one of the smartest people, and he brings his thirty or forty years of multidisciplinary experience to every meeting. He really knows how to navigate the project and to ask the right questions.” Their combined expertise has undoubtedly paved the way for these groundbreaking papers.

By Stephen Thomas



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.