CDS Founder Yann LeCun, Deepmind Fellow Aishwarya Kamath and CIMS Post-Doctoral Fellow Nicolas Carion propose MDETR

If someone handed you an art history book and asked you to show them the famous painting of a woman in a brack dress who is not quite smiling, you could probably identify Da Vinci’s Mona Lisa from the pictures within. Humans learn this ability to match images and objects to a written or oral description over many years. AI has proven capable of this as well but not quite as capable as humans. This is especially true when the AI is asked to identify an object outside of its current vocabulary or when it only has a textual description of the object to go off of.

A team of scientists at NYU and Facebook, including CDS founder Yann LeCun, CDS Deepmind Fellow Aishwarya Kamath and CIMS Post-Doctoral Fellow Nicolas Carion came together to try to create a solution to this problem. What they came up with is MDETR or Modulated Detection for End-to-End Multi-Modal Understanding which they outlined in a paper available on arXiv.

In the paper, the team describes how they utilized a “transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.” The two prong approach, as illustrated in the model diagram, expands the vocabulary to anything found in free form text, making it possible to identify previously unseen combinations of objects and categories such as “pink elephant” when probed using only a textual description.

Based on this approach, the team was able to obtain considerable boosts in performance on tasks such as phrase grounding and referring expression comprehension, up to halving the previously achieved error rate. Their method also paves the way for more interpretable systems, which make it possible to “see” what the model is looking at while reasoning about the text and image. An example of this is shown below, where the model is asked the question “What is on the table?”, MDETR provides boxes around the objects that are relevant to the query — in this case, the table and the laptop.

Using the novel method introduced in this paper, it is possible to probe for fine-grained objects in images, whether they are the salient objects such as the umbrellas in this image or just objects in the background such as the car and electricity box.

It is always exciting to see scientists at CDS and beyond use their knowledge and experience to further progress in the field of AI. We’re sure this is just the beginning of the innovations we’ll see from this team.