This paper addresses the problem of visual dialog, which aims to answer multi-round questions based on the dialog history and image content. This is a challenging task because a question may be answered in relations t...
详细信息
ISBN:
(纸本)9783030923099;9783030923105
This paper addresses the problem of visual dialog, which aims to answer multi-round questions based on the dialog history and image content. This is a challenging task because a question may be answered in relations to any previous dialog and visual clues in image. Existing methods mainly focus on discriminative setting, which design various attention mechanisms to model interaction between answer candidates and multi-modal context. Despite having impressive results with attention based model for visual dialog, a universal encoder-decoder for both answer understanding and generation remains challenging. In this paper, we propose UED, a unified framework that exploits answer candidates to jointly train discriminative and generative tasks. UED is unified in that (1) it fully exploiting the interaction between different modalities to support answer ranking and generation in a single transformer based model, and (2) it uses the answers as anchors to facilitate both two settings. We evaluate the proposed UED on the VisDial dataset, where our model outperforms the state-of-the-art.
In the field of 3D face alignment, most researchers have focused on improving the prediction accuracy of algorithms and ignored the portability for practical applications. To this end, this study presents a real-time ...
详细信息
In the field of 3D face alignment, most researchers have focused on improving the prediction accuracy of algorithms and ignored the portability for practical applications. To this end, this study presents a real-time 3D face-alignment method that uses an encoder-decodernetwork with an efficient deconvolution layer. The fusion of the encoding and decoding feature adds more abundant features to this network. An efficient deconvolution layer at the decoding stage applies the L1 norm to select useful features and generate abundant ones through linear operations. Experimental results using the standard AFLW2000-3D and AFLW-LFPA datasets show that our algorithm has low prediction errors with real-time applicability.
Pulmonary embolism (PE) is diagnosed early and accurately to ensure minimal danger at an advanced stage. This approach extends the advanced techniques for preprocessing, including normalization, slice filtering and re...
详细信息
Many advanced models have been proposed for automatic surface defect inspection. Although CNN-based methods have achieved superior performance among these models, it is limited to extracting global semantic details du...
详细信息
Many advanced models have been proposed for automatic surface defect inspection. Although CNN-based methods have achieved superior performance among these models, it is limited to extracting global semantic details due to the locality of the convolution operation. In addition, global semantic details can achieve high success for detecting surface defects. Recently, inspired by the success of Transformer, which has powerful abilities to model global semantic details with global self-attention mechanisms, some researchers have started to apply Transformer-based methods in many computer-vision challenges. However, as many researchers notice, transformers lose spatial details while extracting semantic features. To alleviate these problems, in this paper, a transformer-based Hybrid Attention Gate (HAG) model is proposed to extract both global semantic features and spatial features. The HAG model consists of Transformer (Trans), channel Squeeze-spatial Excitation (sSE), and merge process. The Trans model extracts global semantic features and the sSE extracts spatial features. The merge process which consists of different versions such as concat, add, max, and mul allows these two different models to be combined effectively. Finally, four versions based on HAG-Feature Fusion network (HAG-FFN) were developed using the proposed HAG model for the detection of surface defects. The four different datasets were used to test the performance of the proposed HAG-FFN versions. In the experimental studies, the proposed model produced 83.83%, 79.34%, 76.53%, and 81.78% mIoU scores for MT, MVTec-Texture, DAGM, and AITEX datasets. These results show that the proposed HAGmax-FFN model provided better performance than the state-of-the-art models.
Unlike approaches that classify single gesture at a time, we propose a deep learning based technique that can classify multiple gestures in one shot. This is specially suitable for applications that involves seamless ...
详细信息
ISBN:
(纸本)9781538633540
Unlike approaches that classify single gesture at a time, we propose a deep learning based technique that can classify multiple gestures in one shot. This is specially suitable for applications that involves seamless gesture sequences such as sign language recognition, touch-less car assistance systems and gaming systems. We propose a Long Short Term Memory(LSTM) based deep network on the lines of an encoder-decoder architecture that classifies gesture sequence accurately in one go. We also show an empirical training strategy for our architecture which can achieve good results even with limited amount of collected data. Results from the experiments performed on labelled datasets from Inertial Motion Units (IMU) proves the efficiency and usefulness of the proposed method.
暂无评论