CDS students Aishwarya Kamath and Sara Price develop novel dataset to enhance computer vision


The research provides a challenging benchmark for testing models’ understanding of images

CDS PhD Student Aishwarya Kamath, CDS Master Graduate Sara Price, and Former Courant Postdoc Nicolas Carion

An essential function of computer vision is to understand and reason over visual scenes. Several tasks have been developed to test how well current models understand the contents of an image, but these benchmarks have their limits. In a recent paper “TRICD: Testing Robust Image Understanding Through Contextual Phrase Detection” NYU researchers worked to highlight a blind spot in model evaluation by proposing a new task called Contextual Phrase Detection (CPD). To support evaluation of this task, the researchers put together a dataset called “Testing Robust Image Understanding Through Contextual Phrase Detection” (TRICD), pronounced “tricked”.

The work was led by CDS PhD student Aishwarya Kamath whose research focuses on computer vision and natural language processing. She is advised by Silver Professor at NYU Courant Institute of Mathematical Sciences and CDS Associate Professor of Data Science and Vice-President, Chief AI Scientist at Meta Yann LeCun and Associate Professor of Computer Science and Data Science Kyunghyun Cho. CDS MS graduate Sara Price and Research Scientist at Meta AI Nicolas Carion were joint core contributors to the work, and it was supported by Research Scientist at Google Research Jonas Pfeiffer and LeCun.

A significant limitation in current object detection benchmarks is they are typically restricted to a set of concepts obtained from a fixed vocabulary. To achieve a more nuanced look into a model’s visual understanding capabilities, tasks can use natural language but these models still have their limits. For example, phrase grounding involves probing the model with natural language, but operates under the assumption that the text corresponds to the objects present in the image. The paper shows the results on such benchmarks tend to overestimate model capabilities significantly given that models do not need to understand context, only isolate the named entities. CPD bridges traditional object detection with the flexibility that arises from using natural language.

The novel task (CPD) provides models with phrases taken from a larger context, and the model must detect when a phrase is used only when relevant to the larger context. An example the research paper provides is with the sentence “cat on a table.” The model is required to predict boxes (markers outlining the selected object in an image) for a cat and table only if there is a cat on the table and none otherwise. The team manually curated image-text pairs that are contextually related, but partially contradictory; i.e. while the images and texts are semantically similar, each sentence is only depicted in one of the images, but not the other. The dataset consists of 2,672 image-text pairs with 1,101 unique phrases associated with 6,058 bounding boxes that were manually annotated.

“We believe this task is the next natural step in the quest to evaluate ever-more flexible and general detection systems,” write the authors. “We hope that this benchmark will pave the way for building stronger models with better fine-grained spatial and relational reasoning capabilities.”

More information is available on the project website on the TRICD website and GitHub.

By Meryl Phair



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.