New Framework Improves Multi-Modal AI Performance Across Diverse Tasks
Multi-modal AI models often perform worse than uni-modal models, contrary to expectations. A new framework developed by researchers at CDS aims to resolve this paradox.
CDS PhD student Taro Makino, along with fellow NYU PhD student Divyam Madaan, CDS Professor of Computer Science and Data Science Kyunghyun Cho, and Sumit Chopra from the NYU Grossman School of Medicine, proposed a novel approach for supervised multi-modal learning called inter- & intra-modality modeling (I2M2) that captures both the relationships between different modalities and the internal patterns within each modality and the label. Their paper, “A Framework for Multi-modal Learning: Jointly Modeling Inter- & Intra-Modality Dependencies,” accepted at NeurIPS 2024, demonstrated consistent performance improvements across healthcare and vision-language tasks.
The Paradox of Multi-modal learning
The motivation for this work stemmed from the team’s experience with a knee MRI dataset, which they were using to try to identify the most significant knee pathologies. Madaan explained, “At the time, we didn’t even know whether to consider it a multi-modal dataset or a unimodal dataset. It was a very new space for us.” The researchers found that conventional multi-modal approaches often failed to improve performance over single-modal models. This led them to question the fundamental assumptions of multi-modal learning. “We tried to consider: what exactly is the benefit that we get from considering something multi-modal?” Madaan said.
Explicitly Modeling Inter- and Intra-Modality Dependencies
To address this issue, the team developed a framework that separates intra-modality (dependencies between individual modalities and the label) and inter-modality dependencies (dependencies between the modalities and label). Madaan clarified, “We provide a way where you don’t need to know the strength of these dependencies or rely on assumptions about what the model will capture — we make it explicit, and tell the model to capture both intra and inter separately.”
“The strength of this paper is its very thorough experimental evaluation,” Makino said. “It shows that in many different contexts, including audio-vision, medical imaging, and visual question answering, this somewhat simple idea leads to very consistent gains.”
The researchers evaluated I2M2 on several datasets with varying strength of inter- and intra-modality dependencies, including knee MRI exams for diagnosing conditions like ACL, meniscus tears and cartilage, the MIMIC-III dataset for predicting patient mortality and diagnoses, and vision-language tasks like visual question answering and natural language vision reasoning. The researchers found that the importance of these dependencies varied across datasets. For the NLVR2 task, inter-modality interactions were crucial, while for fastMRI, intra-modality patterns were more relevant. I2M2 excelled in both scenarios by effectively capturing the most pertinent information.
“It’s applicable to pretty much every kind of multi-modal scenario,” Makino explained. The framework’s versatility stems from its ability to leverage both inter-modality and intra-modality dependencies without requiring prior knowledge of their relative importance for a given task. Madaan emphasized, “All you have to do is capture them separately, use whatever dataset you have, and if there is any benefit from the interaction, you’ll see it through the performance improvements.”
This work builds on years of research into the challenges of multi-modal learning. “Kyunghyun has been working on this problem for a while through multiple generations of PhD students,” Makino noted, highlighting the ongoing nature of this research direction at CDS.
Future Work and Applications
Looking ahead, Madaan says the team is exploring the application of their work in further healthcare applications, such as early risk prediction for dementia and diagnosing prostate cancer through MRIs. The team is also investigating how to integrate their approach with existing state-of-the-art models. Madaan explained, “We are asking: how can we use existing state-of-the-art unimodal models, augment them with inter-modality, and see how those combinations would do on downstream tasks that people care about.”
While I2M2 represents a significant advance, Makino cautioned that further work is needed to fully understand and optimize multi-modal learning. The team found that pre-training individual modality models before joint fine-tuning was more effective than training from scratch, suggesting underlying optimization challenges that warrant further investigation.
As AI systems increasingly need to integrate diverse types of data, frameworks like I2M2 that can effectively leverage multiple modalities are likely to become increasingly important. This research takes an important step toward more robust and versatile multi-modal AI models.
The researchers have made their code publicly available and encourage others to explore using this work — in combination with existing inter-modality or intra-modality models — for their own multi-modal applications.
By Stephen Thomas