Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task


We describe a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identify the most suitable \emph{text} describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing keywords'' (or key-phrases) and their corresponding visual concepts, and instead requires an alignment between the representations of the two modalities that achieves a visually-groundedunderstanding'' of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-understood metric: the accuracy in detecting the true target among the decoys.

The paper makes several contributions: a generic mechanism for generating decoys from (human-created) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; results on a human evaluation on this dataset, thus providing a performance ceiling; and several baseline and competitive learning approaches that illustrate the utility of the proposed framework in advancing both image and language machine comprehension. In particular, there is a large gap between human performance and state-of-the-art learning methods, suggesting a fruitful direction for future research.