检索结果-内蒙古大学图书馆

Efficient Supernet Training with Orthogonal Softmax for Scalable ASR Model Compression

学校读者我要写书评

暂无评论

Efficient Supernet Training with Orthogonal Softmax for Scal...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Jingjing Xu Eugen Beck Zijian Yang Ralf Schlüter Machine Learning and Human Language Technology Group RWTH Aachen University Germany AppTek GmbH Germany

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

关键词： Training Speech recognition Signal processing Hardware Acoustics Speech processing

Right Label Context in End-to-End Training of Time-Synchronous ASR Models

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Raissi, Tina Schlüter, Ralf Ney, Hermann Machine Learning and Human Language Technology Group RWTH Aachen University Germany AppTek GmbH Germany

Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion’s gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h. © 2025, CC BY.

关键词： Hidden Markov models

Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

学校读者我要写书评

暂无评论

Prompting and Fine-Tuning of Small LLMs for Length-Controlla...

Foundation and Large language Models (FLLM), International Conference on

作者： David Thulke Yingbo Gao Rricha Jalota Christian Dugast Hermann Ney AppTek GmbH Aachen Machine Learning and Human Language Technology Group RWTH Aachen University

ISBN: (数字)9798350354799

ISBN: (纸本)9798350354805

This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judge-based evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2-7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system.

关键词： Training Measurement Accuracy Large language models Speech recognition Oral communication Telephone sets Data models Reliability Synthetic data

Investigating the Effect of Label Topology and Training Criterion on ASR Performance and Alignment Quality

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Raissi, Tina Lüscher, Christoph Berger, Simon Schlüter, Ralf Ney, Hermann Machine Learning and Human Language Technology Group RWTH Aachen University Germany AppTek GmbH Germany

The ongoing research scenario for automatic speech recognition (ASR) envisions a clear division between end-to-end approaches and classic modular systems. Even though a high-level comparison between the two approaches in terms of their requirements and (dis)advantages is commonly addressed, a closer comparison under similar conditions is not readily available in the literature. In this work, we present a comparison focused on the label topology and training criterion. We compare two discriminative alignment models with hidden Markov model (HMM) and connectionist temporal classification topology, and two first-order label context ASR models utilizing factored HMM and strictly monotonic recurrent neural network transducer, respectively. We use different measurements for the evaluation of the alignment quality, and compare word error rate and real time factor of our best systems. Experiments are conducted on the LibriSpeech 960h and Switchboard 300h tasks. © 2024, CC BY.

关键词： Alignment

Prompting and Fine-Tuning of Small LLMs for Length-Controllable Telephone Call Summarization

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Thulke, David Gao, Yingbo Jalota, Rricha Dugast, Christian Ney, Hermann AppTek GmbH Aachen Germany Machine Learning and Human Language Technology Group RWTH Aachen University Germany

This paper explores the rapid development of a telephone call summarization system utilizing large language models (LLMs). Our approach involves initial experiments with prompting existing LLMs to generate summaries of telephone conversations, followed by the creation of a tailored synthetic training dataset utilizing stronger frontier models. We place special focus on the diversity of the generated data and on the ability to control the length of the generated summaries to meet various use-case specific requirements. The effectiveness of our method is evaluated using two state-of-the-art LLM-as-a-judgebased evaluation techniques to ensure the quality and relevance of the summaries. Our results show that fine-tuned Llama-2- 7B-based summarization model performs on-par with GPT-4 in terms of factual accuracy, completeness and conciseness. Our findings demonstrate the potential for quickly bootstrapping a practical and efficient call summarization system. © 2024, CC BY-SA.

关键词： Modeling languages

Chunked Attention-Based Encoder-Decoder Model for Streaming Speech Recognition

学校读者我要写书评

暂无评论

Chunked Attention-Based Encoder-Decoder Model for Streaming ...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Mohammad Zeineldeen Albert Zeyer Ralf Schlüter Hermann Ney Computer Science Department Machine Learning and Human Language Technology RWTH Aachen University Germany AppTek GmbH Germany

We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transducer model that operates on chunks instead of frames, where EOC corresponds to the blank symbol. We further explore the remaining differences between a standard transducer and our model. Additionally, we examine relevant aspects such as long-form speech generalization, beam size, and length normalization. Through experiments on Librispeech and TED-LIUM-v2, and by concatenating consecutive sequences for long-form trials, we find that our streamable model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.

关键词：

Comparative Analysis of the wav2vec 2.0 Feature Extractor

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Vieting, Peter Schlüter, Ralf Ney, Hermann Machine Learning and Human Language Technology RWTH Aachen University Germany AppTek GmbH Germany

Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates directly on the speech waveform. However, it is not yet studied extensively in the literature. In this work, we study its capability to replace the standard feature extraction methods in a connectionist temporal classification (CTC) ASR model and compare it to an alternative neural FE. We show that both are competitive with traditional FEs on the LibriSpeech benchmark and analyze the effect of the individual components. Furthermore, we analyze the learned filters and show that the most important information for the ASR system is obtained by a set of bandpass filters. Copyright © 2023, The Authors. All rights reserved.

关键词： Speech recognition

Combining TF-GridNet And Mixture Encoder For Continuous Speech Separation For Meeting Transcription

学校读者我要写书评

暂无评论

Combining TF-GridNet And Mixture Encoder For Continuous Spee...

IEEE Spoken language technology Workshop

作者： Peter Vieting Simon Berger Thilo von Neumann Christoph Boeddeker Ralf Schlüter Reinhold Haeb-Umbach Machine Learning and Human Language Technology Group RWTH Aachen University Germany AppTek GmbH Germany Paderborn University Germany

ISBN: (数字)9798350392258

ISBN: (纸本)9798350392265

Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A common method involves first separating the speech into overlap-free streams on which ASR is performed. Recently, TF-GridNet has shown impressive performance in speech separation in real reverberant conditions. Furthermore, a mixture encoder was proposed that leverages the mixed speech to mitigate the effect of separation artifacts. In this work, we extended the mixture encoder from a static two-speaker scenario to a natural meeting context featuring an arbitrary number of speakers and varying degrees of overlap. We further demonstrate its limits by the integration with separators of varying strength including TF-GridNet. Our experiments result in a new state-of-the-art performance on LibriCSS using a single microphone. They show that TF-GridNet largely closes the gap between previous methods and oracle separation independent of mixture encoding. We further investigate the remaining potential for improvement.

关键词： Particle separators Conferences Encoding Data models Reverberation Speech processing Streams Microphones Automatic speech recognition

Classification Error Bound for Low Bayes Error Conditions in machine learning

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Yang, Zijian Eminyan, Vahe Schlüter, Ralf Ney, Hermann Machine Learning and Human Language Technology Group Lehrstuhl Informatik 6 Computer Science Department RWTH Aachen University Germany AppTek GmbH Germany

In statistical classification and machine learning, classification error is an important performance measure, which is minimized by the Bayes decision rule. In practice, the unknown true distribution is usually replaced with a model distribution estimated from the training data in the Bayes decision rule. This substitution introduces a mismatch between the Bayes error and the model-based classification error. In this work, we apply classification error bounds to study the relationship between the error mismatch and the Kullback-Leibler divergence in machine learning. Motivated by recent observations of low model-based classification errors in many machine learning tasks, bounding the Bayes error to be lower, we propose a linear approximation of the classification error bound for low Bayes error conditions. Then, the bound for class priors are discussed. Moreover, we extend the classification error bound for sequences. Using automatic speech recognition as a representative example of machine learning applications, this work analytically discusses the correlations among different performance measures with extended bounds, including cross-entropy loss, language model perplexity, and word error rate. Copyright © 2025, The Authors. All rights reserved.

关键词： Speech recognition