Maximum Manifold Capacity Representations: A Step Forward in Self-Supervised Learning

NYU Center for Data Science
5 min readSep 13, 2024

--

The world of multi-view self-supervised learning (SSL) can be loosely grouped into four families of methods: contrastive learning, clustering, distillation/momentum, and redundancy reduction. Now, a new approach called Maximum Manifold Capacity Representations (MMCR) is redefining what’s possible, and a recent paper by CDS members, and others, “Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations,” pushes forward progress on this framework. A collaboration between Stanford’s Rylan Schaeffer and Sanmi Koyejo, CDS Research Scientist Ravid Shwartz-Ziv, CDS founding director Yann LeCun, and their co-authors, the team examines MMCR through both statistical and information-theoretic lenses, challenging the idea that these two perspectives are incompatible.

MMCR was first introduced in 2023 by CDS-affiliated Assistant Professor of Neural Science SueYeon Chung, CDS Professor of Neural Science, Mathematics, Data Science, and Psychology Eero Simoncelli, and their colleagues in the paper “Learning Efficient Coding of Natural Images with Maximum Manifold Capacity Representations.” Their work was rooted in neuroscience — specifically the efficient coding hypothesis, which suggests that biological sensory systems are optimized by adapting the sensory representation to the statistics of the input signal, such as reducing redundancy or dimensionality. The original MMCR framework extended this idea from neuroscience to artificial neural networks by adapting the “manifold capacity” — a measure of the number of object categories that can be linearly separated within a given representation space. This was used to learn MMCRs that demonstrated competitive performance on self-supervised learning (SSL) benchmarks and were validated against neural data from the primate visual cortex.

Building on Chung’s foundational work, Shwartz-Ziv and LeCun’s recent research takes MMCR further by providing a more comprehensive theoretical framework that connects MMCR’s geometric basis with information-theoretic principles. While the 2023 paper focused on demonstrating MMCR’s ability to serve as both a model for visual recognition and a plausible model of the primate ventral stream, the new work explores the deeper mechanics of MMCR and extends its applications to multimodal data, such as image-text pairs.

Bridging Statistical Mechanics and Information Theory

Unlike most multi-view self-supervised learning (MVSSL) methods, MMCR does not rely on the usual suspects: contrastive learning, clustering, or redundancy reduction. Instead, it draws on concepts from statistical mechanics, specifically the linear separability of data manifolds, to form a novel approach. “We wanted to see if this old idea [of SSL] could be interpreted in a new way,” Shwartz-Ziv explained, noting their motivation to connect MMCR to established information-theoretic principles. This work not only brings new theoretical insights but also introduces practical tools to optimize self-supervised models.

The original MMCR framework simplified the computationally intensive calculations needed to measure manifold capacity, making it feasible to use this measure as an objective function in SSL. However, Shwartz-Ziv, LeCun, and their co-authors sought to show that the geometric perspective of MMCR can indeed be framed as an information-theoretic problem. By leveraging tools from high-dimensional probability, they demonstrated that MMCR can be understood within the same theoretical framework as other SSL methods, even though it originates from a distinct lineage. This connection bridges a gap between two seemingly different theoretical approaches, showing that MMCR aligns with the broader goals of maximizing mutual information between views.

Predicting and Validating Complex Learning Behaviors

One of the standout contributions of Shwartz-Ziv and LeCun’s work lies in their prediction of “double descent” behavior in the pretraining loss of MMCR models. Double descent, a phenomenon recently observed in deep learning, describes how a model’s error first decreases, then increases, and finally decreases again as the number of parameters grows. This behavior appears to contradict the classical bias-variance tradeoff, which traditionally suggests a U-shaped error curve. Double descent is counterintuitive from what we expect in theory, challenging our conventional understanding of model complexity and generalization. What makes this finding particularly intriguing is that the MMCR double descent is not tied to traditional hyperparameters like data or model size but rather to atypical factors: the number of data manifolds and embedding dimensions.

Through both theoretical analysis and empirical validation, the team showed that MMCR’s loss function exhibits this non-monotonic behavior under these unique conditions. “We could use our analysis to predict how the loss would behave,” said Shwartz-Ziv. “And when we ran real networks, they behaved just as our theory predicted.” This advancement builds on the earlier groundwork of Chung and her team, where MMCR was validated against neural data, but now allows for more targeted optimization of hyperparameters, potentially saving significant computational resources.

Scaling Laws and New Frontiers

Beyond these theoretical breakthroughs, the researchers also introduced compute scaling laws specific to MMCR. These laws enable the prediction of pretraining loss as a function of parameters like gradient steps, batch size, embedding dimensions, and the number of views. This approach could revolutionize how researchers approach model scaling, providing a more systematic way to optimize the performance of large models based on smaller, computationally cheaper runs.

Moreover, Shwartz-Ziv, LeCun, and their team extended MMCR’s application from single-modality (images) to multimodal settings, such as image-text pairs. This adaptability suggests that MMCR could compete with or even surpass existing models like CLIP in certain cases, particularly with smaller batch sizes. This extension demonstrates MMCR’s potential versatility, opening doors to a wider range of applications in fields requiring robust multimodal representations.

The Future of Self-Supervised Learning

While the work of Shwartz-Ziv, LeCun, and their colleagues has provided valuable insights into the mechanics of MMCR, the researchers are cautious about its broader impact. “I don’t think it will replace existing algorithms,” Shwartz-Ziv noted. Instead, he emphasized that the value lies in how the analytical framework behind MMCR could inspire the development of new methods. By showing how theoretical analysis can yield practical, scalable tools, their work highlights the ongoing interplay between theory and application in advancing machine learning.

The exploration of MMCR is far from over. As researchers continue to probe its limits and potential, it may serve as a template for developing new models that combine ideas from different fields — a reminder that in machine learning, as in other sciences, sometimes the most interesting advances happen at the intersections.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.