Human emotions recognization contributes to the development of human-computer interaction. The machines understanding human emotions in the real world will significantly contribute to life in the future. This paper in...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Human emotions recognization contributes to the development of human-computer interaction. The machines understanding human emotions in the real world will significantly contribute to life in the future. This paper introduces the 3rd Affective Behavior Analysis in-the-wild (ABAW3) 2022 challenge. We focused on solving the problem of the Valence-Arousal (VA) estimation and Action Unit (AU) detection. For valence-arousal estimation, we conducted two stages: creating new features from multimodel and temporal learning to predict valence-arousal. First, we make new features;the Gated Recurrent Unit (GRU) and Transformer are combined using a Regular Networks (RegNet) feature, which is extracted from the image. The next step is the GRU combined with local attention to predict valence-arousal. The Concordance Correlation Coefficient (CCC) was used to evaluate the model. The result achieved 0.450 for valence and 0.445 for arousal on the test set, outperforming the baseline method with a corresponding CCC of 0.180 for valence and 0.170 for arousal. We also performed additional experiments on the action unit task with simple transformer blocks. We achieved a score of 49.04 on the test set in terms of F-1 score, which outperforms the baseline method with a corresponding F1 score of 36.50. Our submission to ABAW3 2022 ranks 3rd for both tasks.
Polyp segmentation is a crucial step towards computer-aided diagnosis of colorectal cancer. However, most of the polyp segmentation methods require pixel-wise annotated datasets. Annotated datasets are tedious and tim...
详细信息
Growing interest in conversational agents promote two-way human-computer communications involving asking and answering visual questions have become an active area of research in AI. Thus, generation of visual question...
详细信息
Image processing is a very fundamental technique in the field of low-level vision. However, with the development of deep learning over the past five years, most low-level vision methods tend to ignore this technique. ...
详细信息
Recent advancements in learning-based novel view synthesis enable users to synthesize light field from a monocular image without special equipment. Moreover, the state-of-the-art techniques including multiplane image ...
详细信息
The aim of this paper is to propose a large scale dataset for image restoration (LSDIR). Recent work in image restoration has been focused on the design of deep neural networks. The datasets used to train these networ...
详细信息
We propose a novel depth-aware joint attention target estimation framework that estimates the attention target in 3D space. Our goal is to mimic human's ability to understand where each person is looking in their ...
详细信息
With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text descrip...
详细信息
This paper presents our Facial Action Units (AUs) detection submission to the fifth Affective Behavior Analysis in-the-wild Competition (ABAW). Our approach consists of three main modules: (i) a pre-trained facial rep...
详细信息
Monocular omnidirectional depth estimation is receiving considerable research attention due to its broad applications for sensing 360 degrees surroundings. Existing approaches in this field suffer from limitations in ...
详细信息
ISBN:
(纸本)9781665487399
Monocular omnidirectional depth estimation is receiving considerable research attention due to its broad applications for sensing 360 degrees surroundings. Existing approaches in this field suffer from limitations in recovering small object details and data lost during the ground-truth depth map acquisition. In this paper, a novel monocular omnidirectional depth estimation model, namely HiMODE is proposed based on a hybrid CNN+Transformer (encoder-decoder) architecture whose modules are efficiently designed to mitigate distortion and computational cost, without performance degradation. Firstly, we design a feature pyramid network based on the HNet block to extract high-resolution features near the edges. The performance is further improved, benefiting from a self and cross attention layer and spatial/temporal patches in the Transformer encoder and decoder, respectively. Besides, a spatial residual block is employed to reduce the number of parameters. By jointly passing the deep features extracted from an input image at each backbone block, along with the raw depth maps predicted by the transformer encoder-decoder, through a context adjustment layer, our model can produce resulting depth maps with better visual quality than the ground-truth. Comprehensive ablation studies demonstrate the significance of each individual module. Extensive experiments conducted on three datasets;Stanford3D, Matterport3D, and SunCG, demonstrate that HiMODE can achieve state-of-the-art performance for 360 degrees monocular depth estimation. Complete project code and supplementary materials are available at https://***/himode5008/HiMODE.
暂无评论