We present a new data set for speech emotion recognition (SER) tasks called Dusha. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. The...
详细信息
In this paper, we introduce a novel tool for speech emotion recognition, CA-SER, that borrows self-supervised learning to extract semantic speech representations from a pre-trained wav2vec 2.0 model and combine them w...
详细信息
In this paper, we demonstrate how the problem of the optimal team choice in the popular computer game Dota Underlords can be reduced to the problem of linear integer programming. We propose a model and solve it for th...
详细信息
This paper considers an assessment and evaluation of the pronunciation quality in computer-aided language learning systems. We propose the novel distortion measure for speech processing by using the gain optimization ...
详细信息
A temporal graph G = (G1, G2,..., GT ) is a graph represented by a sequence of T graphs over a common set of vertices, such that at the ith time step only the edge set Ei is active. The temporal graph exploration prob...
详细信息
This article presents our results for the sixth Affective Behavior analysis in-the-wild (ABAW) competition. To improve the trustworthiness of facial analysis, we study the possibility of using pre-trained deep models ...
详细信息
In this paper, we describe the results of the HSEmotion team in two tasks of the seventh Affective Behavior analysis in-the-wild (ABAW) competition, namely, multi-task learning for simultaneous prediction of facial ex...
详细信息
This article presents our results for the sixth Affective Behavior analysis in-the-wild (ABAW) competition. To improve the trustworthiness of facial analysis, we study the possibility of using pre-trained deep models ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This article presents our results for the sixth Affective Behavior analysis in-the-wild (ABAW) competition. To improve the trustworthiness of facial analysis, we study the possibility of using pre-trained deep models that extract reliable emotional features without the need to fine-tune the neural networks for a downstream task. In particular, we introduce several lightweight models based on MobileViT, MobileFaceNet, EfficientNet, and DDAMFN architectures trained in multi-task scenarios to recognize facial expressions, valence, and arousal on static photos. These neural networks extract frame-level features fed into a simple classifier, e.g., linear feed-forward neural network, to predict emotion intensity, compound expressions, and valence/arousal. Experimental results for three tasks from the sixth ABAW challenge demonstrate that our approach lets us significantly improve quality metrics on validation sets compared to existing non-ensemble techniques. As a result, our solutions took second place in the compound expression recognition competition.
In this article, the results of our team for the fifth Affective Behavior analysis in-the-wild (ABAW) competition are presented. The usage of the pre-trained convolutional networks from the EmotiEffNet family for fram...
详细信息
This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on t...
详细信息
ISBN:
(数字)9781728180533
ISBN:
(纸本)9781728180540
This paper is focused on the finetuning of acoustic models for speaker adaptation goals on a given gender. We pretrained the Transformer baseline model on Librispeech-960 and conducted experiments with finetuning on the gender-specific test subsets. The obtained word error rate (WER) relatively to the baseline is up to 5% and 3% lower on male and female subsets, respectively, if the layers in the encoder and decoder are not frozen, and the tuning is started from the last checkpoints. Moreover, we adapted our base model on the complete L2 Arctic dataset of accented speech and finetuned it for particular speakers and male and female genders separately. The models trained on the gender subsets obtained 1-2% lower WER when compared to the model tuned on the whole L2 Arctic dataset. Finally, it was experimentally confirmed that the concatenation of the pretrained voice embeddings (x-vector) and embeddings from a conventional encoder cannot significantly improve the speech recognition accuracy.
暂无评论