检索结果-内蒙古大学图书馆

Unsupervised acoustic unit discovery by leveraging a language-independent subword discriminative feature representation

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Feng, Siyuan Zelasko, Piotr Moro-Velázquez, Laureano Scharenborg, Odette Multimedia Computing Group Delft University of Technology Delft Netherlands Center for Language and Speech Processing Johns Hopkins University BaltimoreMD United States Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation, and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. In the first stage, a recently proposed method in the task of unsupervised subword modeling is improved by replacing a monolingual out-of-domain (OOD) ASR system with a multilingual one to create a subword-discriminative representation that is more language-independent. In the second stage, segment-level k-means is adopted, and two methods to represent the variable-length speech segments as fixed-dimension feature vectors are compared. Experiments on a very low-resource Mboshi language corpus show that our approach outperforms state-of-the-art AUD in both normalized mutual information (NMI) and F-score. The multilingual ASR improved upon the monolingual ASR in providing OOD phone labels and in estimating the phone boundaries. A comparison of our systems with and without knowing the ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that the current approach can significantly benefit from improved phone boundary estimation. Copyright © 2021, The Authors. All rights reserved.

关键词： Telephone sets

A parallelizable lattice rescoring strategy with neural language models

学校读者我要写书评

暂无评论

arXiv 2021年

作者： Li, Ke Povey, Daniel Khudanpur, Sanjeev Center for Language and Speech Processing The Johns Hopkins University BaltimoreMD21218 United States Human Language Technology Center of Excellence The Johns Hopkins University BaltimoreMD21218 United States Xiaomi Corp. Beijing China

This paper proposes a parallel computation strategy and a posterior-based lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. First, lattices from first-pass decoding are expanded by the proposed posterior-based lattice expansion algorithm. Second, each expanded lattice is converted into a minimal list of hypotheses that covers every arc. Each hypothesis is constrained to be the best path for at least one arc it includes. For each lattice, the neural LM scores of the minimal list are computed in parallel and are then integrated back to the lattice in the rescoring stage. Experiments on the Switchboard dataset show that the proposed rescoring strategy obtains comparable recognition performance and generates more compact lattices than a competitive baseline method. Furthermore, the parallel rescoring method offers more flexibility by simplifying the integration of PyTorch-trained neural LMs for lattice rescoring with Kaldi. Copyright © 2021, The Authors. All rights reserved.

关键词： speech recognition

CTC Alignments Improve Autoregressive Translation

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yan, Brian Dalmia, Siddharth Higuchi, Yosuke Neubig, Graham Metze, Florian Black, Alan W. Watanabe, Shinji Language Technologies Institute Carnegie Mellon University United States Department of Communications and Computer Engineering Waseda University Japan Human Language Technology Center of Excellence Johns Hopkins University United States

Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework wherein CTC’s core properties can counteract several key weaknesses of pure-attention models during training and decoding. To validate this conjecture, we modify the Hybrid CTC/Attention model originally proposed for ASR to support text-to-text translation (MT) and speech-to-text translation (ST). Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks. Copyright © 2022, The Authors. All rights reserved.

关键词： Decoding

Do Text-to-Text Multi-Task Learners Suffer from Task Conflict?

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Mueller, David Andrews, Nicholas Dredze, Mark Department of Computer Science Johns Hopkins University United States Human Language Technology Center of Excellence Johns Hopkins University United States

Traditional multi-task learning architectures train a single model across multiple tasks through a shared encoder followed by task-specific decoders. Learning these models often requires specialized training algorithms that address task-conflict in the shared parameter updates, which otherwise can lead to negative transfer. A new type of multi-task learning within NLP homogenizes multi-task architectures as a shared encoder and language model decoder, which does surprisingly well across a range of diverse tasks (Raffel et al., 2020). Does this new architecture suffer from task-conflicts that require specialized training algorithms? We study how certain factors in the shift towards text-to-text models affects multi-task conflict and negative transfer, finding that both directional conflict and transfer are surprisingly constant across architectures. © 2022, CC BY.

关键词： Signal encoding

Sources of Transfer in Multilingual Named Entity Recognition

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Mueller, David Andrews, Nicholas Dredze, Mark Center for Language and Speech Processing Johns Hopkins University Human Language Technology Center of Excellence Johns Hopkins University

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters. Copyright © 2020, The Authors. All rights reserved.

关键词：

An Asynchronous WFST-Based Decoder for Automatic speech Recognition

学校读者我要写书评

暂无评论

An Asynchronous WFST-Based Decoder for Automatic Speech Reco...

IEEE International Conference on Acoustics, speech and Signal processing

作者： Hang Lv Zhehuai Chen Hainan Xu Daniel Povey Lei Xie Sanjeev Khudanpur Audio Speech and Language Processing Lab (ASLP@NPU) School of Computer Science Northwestern Polytechnical University Xi’an China Center of Language and Speech Processing Johns Hopkins University Baltimore MD USA Shanghai Jiao Tong University Xiaomi Corporation Beijing China Human Language Technology Center of Excellence Johns Hopkins University Baltimore MD USA

We introduce asynchronous dynamic decoder, which adopts an efficient A~* algorithm to incorporate big language models in the one-pass decoding for large vocabulary continuous speech recognition. Unlike standard one-pass decoding with on-the-fly composition decoder which might induce a significant computation overhead, the asynchronous dynamic decoder has a novel design where it has two fronts, with one performing "exploration" and the other "backfill". The computation of the two fronts alternates in the decoding process, resulting in more effective pruning than the standard one-pass decoding with an on-the-fly composition decoder. Experiments show that the proposed decoder works notably faster than the standard one-pass decoding with on-the-fly composition decoder, while the acceleration will be more obvious with the increment of data complexity.

关键词： Vocabulary Heuristic algorithms Conferences Computational modeling Signal processing algorithms Signal processing Decoding

DRAWING ORDER RECOVERY FOR HANDWRITING CHINESE CHARACTERS 44

学校读者我要写书评

暂无评论

DRAWING ORDER RECOVERY FOR HANDWRITING CHINESE CHARACTERS

44th IEEE International Conference on Acoustics, speech and Signal processing (ICASSP)

作者： Zhao, Bocheng Yang, Minghao Tao, Jianhua Center for Language and Speech Processing The Johns Hopkins University Baltimore USA Human Language Technology Center of Excellence The Johns Hopkins University Baltimore USA

ISBN: (纸本)9781479981311

Recover drawing orders from a Chinese handwriting image is a challenge issue. Most of English drawing order recovery( DOR) methods perform unsatisfactorily in Chinese. This paper proposes a novel image-to-sequence algorithm to deal with Chinese DOR problem. The proposed method utilizes two regression convolution neural network(CNN) models to generate two corresponding pen-tip movement heat-maps. To estimate pen-tip movement for most of the normal states in writing process, the algorithm analyzes the above two heat-maps with a specifically designed framework. Then the drawing order is restored through a simple iteration process based on the proposed framework. Experiments on public online handwriting database show that our method have got a remarkable result for Chinese DOR tasks. In addition, for English tasks, our method performs superiorly among state-of-the-art methods.

关键词： Drawing order recovery Chinese handwriting Convolution neural network image-to-sequence model

Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Bhati, Saurabhchand Villalba, Jesús Żelasko, Piotr Dehak, Najim Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phonetic properties of a frame more than other factors of variability. We achieve this via a self-expressing autoencoder framework. It consists of a single encoder and two decoders with shared weights. The encoder projects the input features into a latent representation. One of the decoders tries to reconstruct the input from these latent representations and the other from the self-expressed version of them. We use the obtained features to segment and cluster the speech data. We evaluate the performance of the proposed method in the Zero Resource 2020 challenge unit discovery task. The proposed system consistently outperforms the baseline, demonstrating the usefulness of the method in learning representations. Copyright © 2020, The Authors. All rights reserved.

关键词： Signal encoding

Learning speaker embedding from text-to-speech

学校读者我要写书评

暂无评论

arXiv 2020年

作者： Cho, Jaejin Zelasko, Piotr Villalba, Jesús Watanabe, Shinji Dehak, Najim Center for Language and Speech Processing Human Language Technology Center of Excellence Johns Hopkins University BaltimoreMD United States

Zero-shot multi-speaker Text-to-speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input. TTS reconstruction can also be combined with speaker classification to enhance these embeddings further. Once trained, the speaker encoder computes representations for the speaker verification task, while the rest of the TTS blocks are discarded. We investigated training TTS from either manual or ASR-generated transcripts. The latter allows us to train embeddings on datasets without manual transcripts. We compared ASR transcripts and Kaldi phone alignments as TTS inputs, showing that the latter performed better due to their finer resolution. Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset. TTS with speaker classification loss improved EER by 0.28% and 0.73% absolutely from a model using only speaker classification loss in LibriTTS and Voxceleb1 respectively. Copyright © 2020, The Authors. All rights reserved.

关键词： Embeddings