The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our ...
详细信息
The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://***/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE conference on computervision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example ba
Lip-reading is the operation of recognizing speech from lip movements. This is a difficult task because the movements of the lips when pronouncing the words are similar for some of them. Viseme is used to describe lip...
详细信息
In this paper, an adaptation of the eikonal equation is proposed by considering the latter on weighted graphs of arbitrary structure. This novel approach is based on a family of discrete morphological local and nonloc...
详细信息
ISBN:
(纸本)9783642022555
In this paper, an adaptation of the eikonal equation is proposed by considering the latter on weighted graphs of arbitrary structure. This novel approach is based on a family of discrete morphological local and nonlocal gradients expressed by partial difference equations (PdEs). Our formulation of the eikonal equation on weighted graphs generalizes local and nonlocal configurations in the context of imageprocessing and extends this equation for the processing of any unorganized high dimensional discrete data that can be represented by a graph. Our approach leads to a unified formulation for image segmentation and high dimensional irregular data processing.
Jingdezhen ceramics have a long history and are world-famous, and thus often become the object of imitation. Aiming at the current ceramic anti-counterfeiting traceability technology is not precise enough, a new treat...
详细信息
One of the most important stages in the fate of the embryo in In vitro fertilization (IVF) is the blastocyst stage. There is currently no way to diagnose blastocyst. In this study, using Resnet and Unet networks, the ...
详细信息
This paper addresses the problem of spherical imageprocessing. Thanks to projective geometry, the omnidi-rectional image can be presented as a function on sphere S2. The target application includes omnidirectional im...
详细信息
This paper addresses the problem of spherical imageprocessing. Thanks to projective geometry, the omnidi-rectional image can be presented as a function on sphere S2. The target application includes omnidirectional image smoothing. We describe a new method of smoothing for spherical images. For that purpose, we in-troduce a suitable Wiener filter and we use the Tikhonov method to these images. In order to compare their performances, we present the most used classical spherical kernels. We present several examples for filtering real and synthetical spherical images. ".
We implemented a real-time ensemble model for face detection by combining the results of YOLO v1 to v4. We used the WIDER FACE benchmark for training YOLOv1 to v4 in the Darknet framework. Then, we ensemble their resu...
详细信息
CT images play a vital role in the diagnosis of liver cancer. However, CT images often have significant image noise, which is unfavourable for doctors' diagnoses. In response to this problem, this paper applies th...
详细信息
Sign Language Detection has become crucial and effective for humans and research in this area is in progress and is one of the applications of computervision. Earlier works included detection using static signs with ...
详细信息
In MRF based unsupervised segmentation, the MRF model parameters are typically estimated globally. Those global statistics sometimes are far from accurate for local areas if the image is highly non-stationary, and hen...
详细信息
暂无评论