版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:Indian Inst Technol BHU Dept Comp Sci & Engn Varanasi 221005 Uttar Pradesh India
出 版 物:《IET COMPUTER VISION》 (IET电脑视觉)
年 卷 期:2018年第12卷第8期
页 面:1141-1150页
核心收录:
学科分类:0808[工学-电气工程] 08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:We would like to thank the anonymous reviewers and the editor for their insightful comments
主 题:image sequences question answering (information retrieval) text analysis image coding neural net architecture object sequences spatial object information encoding categorical object information encoding yes-no visual question answering task VQA task language information text-based question visual information encoding visual features language-based features neural network architecture GuessWhat dataset Oracle task
摘 要:The task of visual question answering (VQA) has gained wide popularity in recent times. Effectively solving the VQA task requires the understanding of both the visual content in the image and the language information associated with the text-based question. In this study, the authors propose a novel method of encoding the visual information (categorical and spatial object information) of all the objects present in the image into a sequential format, which is called an object sequence. These object sequences can then be suitably processed by a neural network. They experiment with multiple techniques for obtaining a joint embedding from the visual features (in the form of object sequences) and language-based features obtained from the question. They also provide a detailed analysis on the performance of a neural network architecture using object sequences, on the Oracle task of GuessWhat dataset (a Yes/No VQA task) and benchmark it against the baseline.