A Counterintuitive Hiccup In Multimodal Learning: Taro Makino’s New Research

NYU Center for Data Science
3 min readOct 31, 2023

Multimodal learning, a technique that trains machine learning models using multiple data modalities simultaneously, such as images and text, stands out as a promising avenue. However, a recurring challenge has been observed: multimodal models often fail to utilize all data modalities effectively. Taro Makino, a fourth-year CDS PhD student, along with his co-authors, including CDS faculty Kyunghyun Cho and CDS-affiliated assistant professor Krzysztof J. Geras, dives deep into this issue in their latest paper titled, “Detecting incidental correlation in multimodal learning via latent variable modeling,” published in Transactions on Machine Learning Research (TMLR).

The problem can be illustrated as follows: Imagine training a breast cancer classifier using both X-ray and ultrasound images. Intuitively, one would expect a model that uses both modalities to outperform a model that relies on just one. Surprisingly, this isn’t always the case. Often, the single-modality model outperforms its multimodal counterpart. Why does this happen?

Makino and his team propose a novel explanation for this phenomenon: the concept of “incidental correlation.” As Makino explains, “Incidental correlation is a spurious correlation that emerges when data is insufficient.” This deceptive correlation can mislead models, causing them to rely on superficial patterns rather than genuinely understanding the underlying relationships between different data modalities. For example, if most pictures of cats in a training set were taken during the day, the model might start to think cats only appear during the day, which obviously is not a true fact about cats.

This kind of correlation can lead to modality underutilization even when conditions seem favorable. For instance, even when datasets are free of systematic biases and use proven neural network architectures, they can still suffer from modality underutilization if the data is not sufficient.

The phenomenon of incidental correlation represents a significant, and possibly fatal, problem for multimodal learning. To address this, Makino’s team developed a method using a generative model called a variational autoencoder. Through this approach, they were able to demonstrate that incidental correlation emerges in real multimodal datasets when the dataset size is restricted. They performed experiments with both synthetic data and real-world datasets like VQA v2.0 and NLVR2 to show this. This insight underscores the importance of having sufficiently large datasets when pursuing multimodal learning.

Helpfully, for now, Makino’s team’s method can theoretically be replicated by practitioners to determine if incidental correlation is present in their datasets, and if more data collection is necessary.

However, in the long run, while Makino’s research has shed light on the challenges of incidental correlation in multimodal learning and provided a method to detect it, a solution to this problem remains elusive. “We have shown that incidental correlation is problematic, and have proposed a method for detecting it,” said Makino. “But we do not have a way of resolving this issue yet.” Makino remains optimistic though, seeing this as a promising direction for future research.

For those interested in diving deeper into the research, the full paper can be accessed at the “Detecting incidental correlation…” OpenReview.net page.

By Stephen Thomas



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.