Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have ...
详细信息
Emotion recognition, which aims to identify an individual’s emotional state from the acquired physiological or body signals, is very important in affective computing. Emotions have two common representations: categor...
详细信息
ISBN:
(纸本)9781665442084
Emotion recognition, which aims to identify an individual’s emotional state from the acquired physiological or body signals, is very important in affective computing. Emotions have two common representations: categorical, e.g., happy, sad, etc., and dimensional (continuous), e.g., valence, arousal and dominance. Training a good emotion classification or regression model usually requires a large number of labeled data. However, the labeling process is very difficult. As emotions are subtle and uncertain, it usually requires multiple assessors to label each emotional instance to obtain the groundtruth categorical label or dimensional values. In this paper, we propose a multi-task active learning (MTAL) framework to query the most useful samples for labeling, which enables the efficient training of an emotion classification model and multiple emotion regression models simultaneously. This is novel and challenging, as all previous research considered only emotion classification or regression alone, but not simultaneously. Experimental results on the IEMOCAP dataset demonstrated that MTAL outperformed random selection and several state-of-the-art single task active learning approaches, i.e., with the same number of labeled samples, MTAL can obtain better emotion classification and regression models simultaneously.
This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The cruci...
详细信息
We show that relation modeling between visual elements matters in cropping view recommendation. Cropping view recommendation addresses the problem of image recomposition conditioned on the composition quality and the ...
详细信息
ISBN:
(纸本)9781665428132
We show that relation modeling between visual elements matters in cropping view recommendation. Cropping view recommendation addresses the problem of image recomposition conditioned on the composition quality and the ranking of views (cropped sub-regions). This task is challenging because the visual difference is subtle when a visual element is reserved or removed. Existing methods represent visual elements by extracting region-based convolutional features inside and outside the cropping view boundaries, without probing a fundamental question: why some visual elements are of interest or of discard? In this work, we observe that the relation between different visual elements significantly affects their relative positions to the desired cropping view, and such relation can be characterized by the attraction inside/outside the cropping view boundaries and the repulsion across the boundaries. By instantiating a transformer-based solution that represents visual elements as visual words and that models the dependencies between visual words, we report not only state-of-the-art performance on public benchmarks, but also interesting visualizations that depict the attraction and repulsion between visual elements, which may shed light on what makes for effective cropping view recommendation.
Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and n...
详细信息
In semantic segmentation, pixels often share the same mask labels across a vast region. However, in recent prevalent transformer-based models, predictions frequently suffer from incompleteness or discontinuities. The ...
In semantic segmentation, pixels often share the same mask labels across a vast region. However, in recent prevalent transformer-based models, predictions frequently suffer from incompleteness or discontinuities. The dilemma is caused by sparse activation of vanilla self-attention during feature extraction. During the vanilla self-attention, each query over-focuses on a small number of relevant keys but neglects numerous keys sharing the same category with it, restricting the capture of universal feature representation for its category. Such sparsely activated self-attention will further cause unignorable feature differences among tokens sharing the same class in feature maps, introducing noise on final predictions. To reduce such differences, we propose the Densely Activated self-attention Module (DAM), a novel pluggable module designed to generate densely activated self-attention. Inserted after the encoder, it encourages each query to attend to a broader range of keys, obtaining more consistent features. Experimental results on three widely used benchmarks with six different baselines demonstrate that DAM consistently improves performance with a negligible increase in parameters and FLOPs. Our work provides a new perspective on the behavior of self-attention in semantic segmentation.
With the development and application of computer vision, many target detection networks are applied to the detection of floating objects in rivers. For the detection problems such as small targets easily missed and mi...
详细信息
3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed join...
3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J- Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization. The code is avaliable at https://***/ChanglongJiangGit/A2J-Transformer.
Real-time eyeblink detection in the wild can widely serve for fatigue detection, face anti-spoofing, emotion analysis, etc. The existing research efforts generally focus on single-person cases towards trimmed video. H...
详细信息
Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in a...
Emotion Recognition in Conversation (ERC) has attracted widespread attention in the natural language processing field due to its enormous potential for practical applications. Existing ERC methods face challenges in achieving generalization to diverse scenarios due to insufficient modeling of context, ambiguous capture of dialogue relationships and overfitting in speaker modeling. In this work, we present a Hybrid Continuous Attributive Network (HCAN) to address these issues in the perspective of emotional continuation and emotional attribution. Specifically, HCAN adopts a hybrid recurrent and attention-based module to model global emotion continuity. Then a novel Emotional Attribution Encoding (EAE) is proposed to model intra- and inter-emotional attribution for each utterance. Moreover, aiming to enhance the robustness of the model in speaker modeling and improve its performance in different scenarios, A comprehensive loss function emotional cognitive loss $\mathcal{L}_{EC}$ is proposed to alleviate emotional drift and overcome the overfitting of the model to speaker modeling. Our model achieves state-of-the-art performance on three datasets, demonstrating the superiority of our work. Another extensive comparative experiments and ablation studies on three benchmarks are conducted to provided evidence to support the efficacy of each module. Further exploration of generalization ability experiments shows the plug-and-play nature of the EAE module in our method.
暂无评论