检索结果-内蒙古大学图书馆

IMPORTANCE OF DIFFERENT TEMPORAL MODULATIONS OF speech: A TALE OF TWO PERSPECTIVES

学校读者我要写书评

暂无评论

arXiv 2022年

How important are different temporal speech modulations for speech recognition? We answer this question from two complementary perspectives. Firstly, we quantify the amount of phonetic information in the modulation spectrum of speech by computing the mutual information between temporal modulations with frame-wise phoneme labels. Looking from another perspective, we ask - which speech modulations an Automatic speech Recognition (ASR) system prefers for its operation. Data-driven weights are learned over the modulation spectrum and optimized for an end-to-end ASR task. Both methods unanimously agree that speech information is mostly contained in slow modulation. Maximum mutual information occurs around 3-6 Hz which also happens to be the range of modulations most preferred by the ASR. In addition, we show that the incorporation of this knowledge into ASRs significantly reduces their dependency on the amount of training data. © 2022, CC BY.

关键词： speech recognition

Complex Frequency Domain Linear Prediction: A Tool to Compute Modulation Spectrum of speech

学校读者我要写书评

暂无评论

arXiv 2022年

Conventional Frequency Domain Linear Prediction (FDLP) technique models the squared Hilbert envelope of speech with varied degrees of approximation which can be sampled at the required frame rate and used as features for Automatic speech Recognition (ASR). Although previously the complex cepstrum of the conventional FDLP model has been used as compact frame-wise speech features, it has lacked interpretability in the context of the Hilbert envelope. In this paper, we propose a modification of the conventional FDLP model that allows easy interpretability of the complex cepstrum as temporal modulations in an all-pole model approximation of the power of the speech signal. Additionally, our "complex" FDLP yields significant speed-ups in comparison to conventional FDLP for the same degree of approximation. © 2022, CC BY.

关键词： Frequency domain analysis

Contextualization with SPLADE for High Recall Retrieval

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Eugene Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

High Recall Retrieval (HRR), such as eDiscovery and medical systematic review, is a search problem that optimizes the cost of retrieving most relevant documents in a given collection. Iterative approaches, such as iterative relevance feedback and uncertainty sampling, are shown to be effective under various operational scenarios. Despite neural models demonstrating success in other text-related tasks, linear models such as logistic regression, in general, are still more effective and efficient in HRR since the model is trained and retrieves documents from the same fixed collection. In this work, we leverage SPLADE, an efficient retrieval model that transforms documents into contextualized sparse vectors, for HRR. Our approach combines the best of both worlds, leveraging both the contextualization from pretrained language models and the efficiency of linear models. It reduces 10% and 18% of the review cost in two HRR evaluation collections under a one-phase review workflow with a target recall of 80%. The experiment is implemented with TARexp and is available at https://***/eugene-yang/LSR-for-TAR. © 2024, CC BY.

关键词：

Extending Translate-Train for ColBERT-X to African language CLIR

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Yang, Eugene Lawrie, Dawn J. McNamee, Paul Mayfield, James Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023. Our submissions use machine translation models to translate the documents and the training passages, and ColBERT-X as the retrieval model. Additionally, we present a set of unofficial runs that use an alternative training procedure with a similar training setting. © 2024, CC BY.

关键词： Fires

ACOUSTIC MODELING FOR OVERLAPPING speech RECOGNITION: JHU CHIME-5 CHALLENGE SYSTEM

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Manohar, Vimal Chen, Szu-Jui Wang, Zhiqi Fujita, Yusuke Watanabe, Shinji Khudanpur, Sanjeev Center for Language and Speech Processing Johns Hopkins University BaltimoreMD21218 United States Human Language Technology Center Of Excellence Johns Hopkins University BaltimoreMD21218 United States Hitachi Ltd. Research & Development Group Kokubunji-shi Tokyo Japan

This paper summarizes our acoustic modeling efforts in the Johns Hopkins University speech recognition system for the CHiME-5 challenge to recognize highly-overlapped dinner party speech recorded by multiple microphone arrays. We explore data augmentation approaches, neural network architectures, front-end speech dereverberation, beamforming and robust i-vector extraction with comparisons of our in-house implementations and publicly available tools. We finally achieved a word error rate of 69.4% on the development set, which is a 11.7% absolute improvement over the previous baseline of 81.1%, and release this improved baseline with refined techniques/tools as an advanced CHiME-5 recipe. Copyright © 2024, The Authors. All rights reserved.

关键词： Acoustic Modeling

DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Helin Thebaud, Thomas Villalba, Jesús Sydnor, Myra Lammers, Becky Dehak, Najim Moro-Velazquez, Laureano Center for Language and Speech Processing Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States Department of Physical Medicine and Rehabilitation Johns Hopkins University School of Medicine United States

We present a novel typical-to-atypical voice conversion approach (DuTa-VC), which (i) can be trained with nonparallel data (ii) first introduces diffusion probabilistic model (iii) preserves the target speaker identity (iv) is aware of the phoneme duration of the target speaker. DuTa-VC consists of three parts: an encoder transforms the source mel-spectrogram into a duration-modified speaker-independent mel-spectrogram, a decoder performs the reverse diffusion to generate the target mel-spectrogram, and a vocoder is applied to reconstruct the waveform. Objective evaluations conducted on the UAspeech show that DuTa-VC is able to capture severity characteristics of dysarthric speech, reserves speaker identity, and significantly improves dysarthric speech recognition as a data augmentation. Subjective evaluations by two expert speech pathologists validate that DuTa-VC can preserve the severity and type of dysarthria of the target speakers in the synthesized speech. © 2023, CC0.

关键词： Diffusion

MultiVENT: Multilingual Videos of Events with Aligned Natural Text

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Sanders, Kate Etter, David Kriz, Reno Van Durme, Benjamin Johns Hopkins University Human Language Technology Center of Excellence United States

Everyday news coverage has shifted from traditional broadcasts towards a wide range of presentation formats such as first-hand, unedited video footage. Datasets that reflect the diverse array of multimodal, multilingual news sources available online could be used to teach models to benefit from this shift, but existing news video datasets focus on traditional news broadcasts produced for English-speaking audiences. We address this limitation by constructing MultiVENT, a dataset of multilingual, event-centric videos grounded in text documents across five target languages. MultiVENT includes both news broadcast videos and non-professional event footage, which we use to analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models. Finally, we provide a model for complex, multilingual video retrieval to serve as a baseline for information retrieval using MultiVENT. © 2023, CC BY.

关键词：

Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling

学校读者我要写书评

暂无评论

Improving Neural Diarization through Speaker Attribute Attra...

International Conference on Acoustics, speech, and Signal processing (ICASSP)

作者： David Palzer Matthew Maciejewski Eric Fosler-Lussier Computer Science and Engineering The Ohio State University Human Language Technology Center of Excellence The Johns Hopkins University

In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed ‘speaker attributes’ through a multi-stage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved diarization performance on the CALLHOME dataset.

关键词：

Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

学校读者我要写书评

暂无评论

Do Text-to-Text Multi-Task Learners Suffer from Task Conflic...

2022 Findings of the Association for Computational Linguistics: EMNLP 2022

作者： Mueller, David Andrews, Nicholas Dredze, Mark Department of Computer Science Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States

Traditional multi-task learning architectures learn a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks (Raffel et al., 2020). Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multitask conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures. © 2022 Association for Computational Linguistics.

关键词： Decoding