Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we...
Recognizing text from palm leaf manuscripts in low-resource, non-Latin languages like Balinese, Khmer, and Sundanese poses significant challenges due to limited annotated data and complex structures. Unlike modern lan...
详细信息
Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation tasks...
详细信息
In this paper, we have shown the development of a Part of speech (POS) tagger for Hadoti - a prominent language spoken in Rajasthan, India - despite its limited resources. For this, we manually tagged a corpus of 50,0...
详细信息
Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks i...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Being a form of biometric identification, the security of the speaker identification (SID) system is of utmost importance. To better understand the robustness of SID systems, we aim to perform more realistic attacks in SID, which are challenging for humans and machines to detect. In this study, we propose DiffAttack, a novel timbre-reserved adversarial attack approach, that exploits the capability of a diffusion-based voice conversion (DiffVC) model to generate adversarial fake audio with distinct target speaker attribution. By introducing adversarial constraints into the diffusion-based voice conversion model’s generative process, we aim to craft fake samples that effectively mislead target models while preserving the speaker-wised characteristics. Specifically, inspired by the utilization of randomly sampled Gaussian noise in conventional adversarial attack and diffusion processes, we incorporate adversarial constraints into the reverse diffusion process. As a result, these adversarial constraints subtly guide the reverse diffusion process toward aligning with the target speaker distribution. Our experiments on the LibriTTS dataset indicate that our proposed DiffAttack significantly improves the attack success rate compared to vanilla DiffVC or other methods. Furthermore, objective and subjective evaluations demonstrate that introducing adversarial constraints does not compromise the speech quality generated by the DiffVC model.
Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in ...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Code-switching automatic speech recognition (ASR) aims to transcribe speech that contains two or more languages accurately. To better capture language-specific speech representations and address language confusion in code-switching ASR, the mixture-of-experts (MoE) architecture and an additional language diarization (LD) decoder are commonly employed. However, most researches remain stagnant in simple operations like weighted summation or concatenation to fuse language-specific speech representations, leaving significant opportunities to explore the enhancement of integrating language bias information. In this paper, we introduce CAMEL, a cross-attention-based MoE and language bias approach for code-switching ASR. Specifically, after each MoE layer, we fuse language-specific speech representations with cross-attention, leveraging its strong contextual modeling abilities. Additionally, we design a source attention-based mechanism to incorporate the language information from the LD decoder output into text embeddings. Experimental results demonstrate that our approach achieves state-of-the-art performance on the SEAME, ASRU200, and ASRU700+Librispeech460 Mandarin-English code-switching ASR datasets.
We present a novel Automatic speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowdsourcing initiative, encompa...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
We present a novel Automatic speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowdsourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://***/turinaf/sagalee and we encourage its use for further research and development in Oromo speechprocessing.
We present a novel Automatic speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encomp...
详细信息
language is a primary means of communication. It is a medium through which we can interact with society. Recognizing it, each language has its own set of grammatical rules. This study focused on the development of a r...
详细信息
Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation tasks...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Generative models have attracted considerable attention for speech separation tasks, and among these, diffusion-based methods are being explored. Despite the notable success of diffusion techniques in generation tasks, their adaptation to speech separation has encountered challenges, notably slow convergence and suboptimal separation outcomes. To address these issues and enhance the efficacy of diffusion-based speech separation, we introduce EDSep, a novel single-channel method grounded in score matching via stochastic differential equation (SDE). This method enhances generative modeling for speech source separation by optimizing training and sampling efficiency. Specifically, a novel denoiser function is proposed to approximate data distributions, which obtains ideal denoiser outputs. Additionally, a stochastic sampler is carefully designed to resolve the reverse SDE during the sampling process, gradually separating speech from mixtures. Extensive experiments on databases such as WSJ0-2mix, LRS2-2mix, and VoxCeleb2-2mix demonstrate our proposed method’s superior performance over existing diffusion and discriminative models, validating its efficacy.
暂无评论