We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, negatively affecting performance compared to monotonically increasing a...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, negatively affecting performance compared to monotonically increasing attention weights. Further investigation shows that the Conformer encoder reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
This study investigates the robustness of Transformer models in irony detection addressing various textual perturbations, revealing potential biases in training data concerning ironic and non-ironic classes. The pertu...
详细信息
This paper introduces the RWTH-PHOENIX-Weather 2014, a video-based, large vocabulary, German sign language corpus which has been extended over the last two years, tripling the size of the original corpus. The corpus c...
详细信息
ISBN:
(纸本)9782951740884
This paper introduces the RWTH-PHOENIX-Weather 2014, a video-based, large vocabulary, German sign language corpus which has been extended over the last two years, tripling the size of the original corpus. The corpus contains weather forecasts simultaneously interpreted into sign language which were recorded from German public TV and manually annotated using glosses on the sentence level and semi-automatically transcribed spoken German extracted from the videos using the open-source speech recognition system RASR. Spatial annotations of the signers' hands as well as shape and orientation annotations of the dominant hand have been added for more than 40k respectively 10k video frames creating one of the largest corpora allowing for quantitative evaluation of object tracking algorithms. Further, over 2k signs have been annotated using the SignWriting annotation system, focusing on the shape, orientation, movement as well as spatial contacts of both hands. Finally, extended recognition and translation setups are defined, and baseline results are presented.
Recent advances in Audio–Visual Speech recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task ...
详细信息
Recent advances in Audio–Visual Speech recognition (AVSR) have led to unprecedented achievements in the field, improving the robustness of this type of system in adverse, noisy environments. In most cases, this task has been addressed through the design of models composed of two independent encoders, each dedicated to a specific modality. However, while recent works have explored unified audio–visual encoders, determining the optimal cross-modal architecture remains an ongoing challenge. Furthermore, such approaches often rely on models comprising vast amounts of parameters and high computational cost training processes. In this paper, we aim to bridge this research gap by introducing a novel audio–visual framework. Our proposed method constitutes, to the best of our knowledge, the first attempt to harness the flexibility and interpretability offered by encoder architectures, such as the Branchformer, in the design of parameter-efficient AVSR systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio–visual unified encoder based on the layer-level branch scores provided by the modality-specific models. Extensive experiments on English and Spanish AVSR benchmarks covering multiple data conditions and scenarios demonstrated the effectiveness of our proposed method. Even when trained on a moderate scale of data, our models achieve competitive word error rates (WER) of approximately 2.5% for English and surpass existing approaches for Spanish, establishing a new benchmark with an average WER of around 9.1%. These results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates while significantly reducing the model complexity w.r.t. the prevalent approach in the field. Code and pre-trained models are available at https://***/david-gimeno/tailored-avsr .
We propose to use energy minimization in MRFs for matching-based image recognition tasks. To this end, the Tree-Reweighted Message Passing algorithm is modified by geometric constraints and efficiently used by exploit...
详细信息
We propose to explicitly model white-spaces for Arabic handwriting recognition within different writing variants. Position-dependent character shapes in Arabic handwriting allow for large white-spaces between characte...
详细信息
Sign languages represent an interesting niche for statistical machine translation that is typically hampered by the scarceness of suitable data, and most papers in this area apply only a few, well-known techniques and...
详细信息
In this paper we present a novel transliteration technique which is based on deep belief networks. Common approaches use finite state machines or other methods similar to conventional machine translation. Instead of u...
详细信息
We analyze the usage of Speeded Up Robust Features (SURF) as local descriptors for face recognition. The effect of different feature extraction and viewpoint consistency constrained matching approaches are analyzed. F...
详细信息
暂无评论