检索结果-内蒙古大学图书馆

The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Paraskevopoulos, Georgios Tsoukala, Chara Katsamanis, Athanasios Katsouros, Vassilis Institute for Speech and Language Processing Athena Research Center Athens Greece

The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly supervised corpora serves as a cost-effective strategy for advancing speech technologies in under-resourced languages. © 2024, CC BY.

关键词： speech recognition

Meltemi: The first open Large language Model for Greek

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Voukoutis, Leon Roussis, Dimitris Paraskevopoulos, Georgios Sofianopoulos, Sokratis Prokopidis, Prokopis Papavasileiou, Vassilis Katsamanis, Athanasios Piperidis, Stelios Katsouros, Vassilis Institute for Speech and Language Processing Athena Research Center Artemidos 6 & Epidavrou Athens Greece

We describe the development and capabilities of Meltemi 7B, the first open Large language Model for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available1 under the Apache 2.0 license. © 2024, CC BY.

关键词：

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection 48

学校读者我要写书评

暂无评论

Joint Generative-Contrastive Representation Learning for Ano...

48th IEEE International Conference on Acoustics, speech and Signal processing, ICASSP 2023

作者： Zeng, Xiao-Min Song, Yan Zhuo, Zhu Zhou, Yu Li, Yu-Hong Xue, Hui Dai, Li-Rong McLoughlin, Ian Alibaba Group China University of Science and Technology of China National Engineering Research Center of Speech and Language Information Processing Hefei China Singapore Institute of Technology Ict Cluster Singapore

ISBN: (纸本)9781728163277

In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to perform frame-level prediction. The output of the PAE together with original normal samples, are used for supervised contrastive representative learning in a multi-task framework. Besides cross-entropy loss between classes, contrastive loss is used to separate PAE output and original samples within each class. GeCo aims to better capture context information among frames, thanks to the self-attention mechanism for PAE model. Furthermore, GeCo combines generative and contrastive learning from which we aim to yield more effective and informative representations, compared to existing methods. Extensive experiments have been conducted on the DCASE2020 Task2 development dataset, showing that GeCo outperforms state-of-the-art generative and discriminative methods. © 2023 IEEE.

关键词： anomalous sound detection contrastive learning predictive autoencoder representation learning

Arduino Voice Control for Arabic speech Recognition using Smartphone 6

学校读者我要写书评

暂无评论

Arduino Voice Control for Arabic Speech Recognition using Sm...

6th International Hybrid Conference on Informatics and Applied Mathematics, IAM 2023

作者： Bakri, Adil Lounnas, Khaled Lichouri, Mohamed Scientific and Technical Research Centre on Arid Regions CRSTRA Biskra Algeria Scientific Research and Technical Center for the Development of Arabic Language CRSTDLA Algiers Algeria Speech Communication and Signal Processing Laboratory LCPTS Faculty of Electronics and Computer Science USTHB Algiers Algeria

Engaging with our surroundings through voice control has emerged as an increasingly intriguing aspect. This technology is gaining prevalence in our daily lives, whether applied in smart homes, mobile phones, or the control of comfort features in vehicles. The pertinent inquiry is whether voice control will find its place in the production industry, as its reliability remains a critical concern. This project showcases an alternative approach to voice processing by integrating Arabic speech recognition in MIT App Inventor for voice command control. In this setup, voice signals undergo processing, enabling the operation of a LED light circuit through Bluetooth technology. The voice command is given in Arabic through the smartphone device having Bluetooth and the command is transferred and converted to string by the BT Voice Control for Arduino and is transferred to the Bluetooth Module connected to the Arduino board for the control of the LED light circuit. The voice command Arabic is given through a Smartphone device. © 2023 Copyright for this paper by its authors.

关键词： Smartphones

RPO: Retrieval Preference Optimization for Robust Retrieval-Augmented Generation

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Yan, Shi-Qi Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei China

While Retrieval-Augmented Generation (RAG) has exhibited promise in utilizing external knowledge, its generation process heavily depends on the quality and accuracy of the retrieved context. Large language models (LLMs) struggle to evaluate the correctness of non-parametric knowledge retrieved externally when it differs from internal memorization, leading to knowledge conflicts during response generation. To this end, we introduce the Retrieval Preference Optimization (RPO), a lightweight and effective alignment method to adaptively leverage multi-source knowledge based on retrieval relevance. An implicit representation of retrieval relevance is derived and incorporated into the reward model to integrate retrieval evaluation and response generation into a single model, solving the problem that previous methods necessitate the additional procedure to assess the retrieval quality. Notably, RPO is the only RAG-dedicated alignment approach that quantifies the awareness of retrieval relevance in training, overcoming mathematical obstacles. Experiments on four datasets demonstrate that RPO outperforms RAG by 4-10% in accuracy without any extra component, exhibiting its robust generalization. Copyright © 2025, The Authors. All rights reserved.

关键词： Content based retrieval

SAMOS: A Neural MOS Prediction Model Leveraging Semantic Representations and Acoustic Features

学校读者我要写书评

暂无评论

SAMOS: A Neural MOS Prediction Model Leveraging Semantic Rep...

International Symposium on Chinese Spoken language processing

作者： Yu-Fei Shi Yang Ai Ye-Xin Lu Hui-Peng Du Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

Assessing the naturalness of speech using mean opinion score (MOS) prediction models has positive implications for the auto-matic evaluation of speech synthesis systems. Early MOS prediction models took the raw waveform or amplitude spectrum of speech as input, whereas more advanced methods employed self-supervised-learning (SSL) based models to extract semantic representations from speech for MOS prediction. These methods utilized limited aspects of speech information for MOS prediction, resulting in restricted prediction accuracy. Therefore, in this paper, we propose SAMOS, a MOS prediction model that leverages both Semantic and Acoustic information of speech to be assessed. Specifically, the proposed SAMOS leverages a pretrained wav2vec2 to extract semantic representations and uses the feature extractor of a pretrained BiVocoder to extract acoustic features. These two types of features are then fed into the prediction network, which includes multitask heads and an aggregation layer, to obtain the final MOS score. Ex-perimental results demonstrate that the proposed SAMOS outperforms current state-of-the-art MOS prediction models on the BVCC dataset and performs comparable performance on the BC2019 dataset, according to the results of system-level evaluation metrics.

关键词： Measurement Training Accuracy Semantics Predictive models Feature extraction Acoustics speech synthesis

ERVQ: Enhanced Residual Vector Quantization with Intra-and-Inter-Codebook Optimization for Neural Audio Codecs

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zheng, Rui-Chen Du, Hui-Peng Jiang, Xiao-Hang Ai, Yang Ling, Zhen-Hua National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China China

Current neural audio codecs typically use residual vector quantization (RVQ) to discretize speech signals. However, they often experience codebook collapse, which reduces the effective codebook size and leads to suboptimal performance. To address this problem, we introduce ERVQ, Enhanced Residual Vector Quantization, a novel enhancement strategy for the RVQ framework in neural audio codecs. ERVQ mitigates codebook collapse and boosts codec performance through both intra- and inter-codebook optimization. Intra-codebook optimization incorporates an online clustering strategy and a code balancing loss to ensure balanced and efficient codebook utilization. Inter-codebook optimization improves the diversity of quantized features by minimizing the similarity between successive quantizations. Our experiments show that ERVQ significantly enhances audio codec performance across different models, sampling rates, and bitrates, achieving superior quality and generalization capabilities. It also achieves 100% codebook utilization on one of the most advanced neural audio codecs. Further experiments indicate that audio codecs improved by the ERVQ strategy can improve unified speech-and-text large language models (LLMs). Specifically, there is a notable improvement in the naturalness of generated speech in downstream zero-shot text-to-speech tasks. Audio samples are available here.1 Copyright © 2024, The Authors. All rights reserved.

关键词： Vector quantization

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm

学校读者我要写书评

暂无评论

APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Com...

International Symposium on Chinese Spoken language processing

作者： Hui-Peng Du Yang Ai Rui-Chen Zheng Zhen-Hua Ling National Engineering Research Center of Speech and Language Information Processing University of Science and Technology of China Hefei

ISBN: (数字)9798331516826

ISBN: (纸本)9798331516833

This paper proposes a novel neural audio codec, named AP-Codec+, which is an improved version of APCodec. The AP-Codec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm.

关键词： Training Codecs Quantization (signal) Bit rate Training data Encoding Decoding

WEAKLY-SUPERVISED AUTOMATED AUDIO CAPTIONING VIA TEXT ONLY TRAINING

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Kouzelis, Theodoros Katsouros, Vassilis Institute for Language and Speech Processing Athena Research Center Marousi15125 Greece

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to 83% compared to fully supervised approaches trained with paired target data. Our code is available at: https://***/zelaki/wsac. © 2023, CC BY.

关键词： Embeddings