检索结果-内蒙古大学图书馆

Cross-Domain Diffusion Based Speech Enhancement for Very Noisy Speech

学校读者我要写书评

暂无评论

Cross-Domain Diffusion Based Speech Enhancement for Very Noi...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Heming Wang DeLiang Wang Department of Computer Science and Engineering The Ohio State University USA Center for Cognitive and Brain Sciences The Ohio State University USA

Deep learning based speech enhancement has achieved remarkable success, but challenges remain in low signal-to-noise ratio (SNR) nonstationary noise scenarios. In this study, we propose to incorporate diffusion-based learning into an enhancement model and improve robustness in extremely noisy conditions. Specifically, a frequency-domain diffusion-based generative module is employed, and it accepts the enhanced signal obtained from a time-domain supervised enhancement module as an auxiliary input to learn to recover clean speech spectrograms. Experimental results on the TIMIT dataset demonstrate the advantage of this approach and show better enhancement performance over other strong baselines in both -5 and -10 dB SNR noisy conditions.

关键词： Training Deep learning Frequency-domain analysis Speech enhancement Robustness Noise measurement Background noise

Multi-Resolution Location-Based Training for Multi-Channel Continuous Speech Separation

学校读者我要写书评

暂无评论

Multi-Resolution Location-Based Training for Multi-Channel C...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Hassan Taherian DeLiang Wang Department of Computer Science and Engineering The Ohio State University USA Center for Cognitive and Brain Sciences The Ohio State University USA

The performance of automatic speech recognition (ASR) systems severely degrades when multi-talker speech overlap occurs. In meeting environments, speech separation is typically performed to improve the robustness of ASR systems. Recently, location-based training (LBT) was proposed as a new training criterion for multi-channel talker-independent speaker separation. Assuming fixed array geometry, LBT outperforms widely-used permutation-invariant training in fully overlapped utterances and matched reverberant conditions. This paper extends LBT to conversational multi-channel speaker separation. We introduce multi-resolution LBT to estimate the complex spectrograms from low to high time and frequency resolutions. With multi-resolution LBT, convolutional kernels are assigned consistently based on speaker locations in physical space. Evaluation results show that multi-resolution LBT consistently outperforms other competitive methods on the recorded LibriCSS corpus.

关键词： Training Geometry Time-frequency analysis Convolution Robustness Frequency estimation Speech processing

Systematic Biases in LLM Simulations of Debates

学校读者我要写书评

暂无评论

Systematic Biases in LLM Simulations of Debates

2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024

作者： Taubenfeld, Amir Dover, Yaniv Reichart, Roi Goldstein, Ariel The Hebrew University of Jerusalem School of Computer Science and Engineering Israel Google Research United States The Hebrew University Business School Jerusalem Israel Federmann Center for the Study of Rationality Hebrew University Jerusalem Israel Faculty of Data and Decision Sciences Technion Israel Department of Cognitive and Brain Sciences Hebrew University Jerusalem Israel

ISBN: (纸本)9798891761643

The emergence of Large Language Models (LLMs), has opened exciting possibilities for constructing computational simulations designed to replicate human behavior accurately. Current research suggests that LLM-based agents become increasingly human-like in their performance, sparking interest in using these AI agents as substitutes for human participants in behavioral studies. However, LLMs are complex statistical learners without straightforward deductive rules, making them prone to unexpected behaviors. Hence, it is crucial to study and pinpoint the key behavioral distinctions between humans and LLM-based agents. In this study, we highlight the limitations of LLMs in simulating human interactions, particularly focusing on LLMs' ability to simulate political debates on topics that are important aspects of people's day-to-day lives and decision-making processes. Our findings indicate a tendency for LLM agents to conform to the model's inherent social biases despite being directed to debate from certain political perspectives. This tendency results in behavioral patterns that seem to deviate from well-established social dynamics among humans. We reinforce these observations using an automatic self-fine-tuning method, which enables us to manipulate the biases within the LLM and demonstrate that agents subsequently align with the altered biases. These results underscore the need for further research to develop methods that help agents overcome these biases, a critical step toward creating more realistic simulations. © 2024 Association for Computational Linguistics.

关键词： Computational linguistics

Spiking Neural Network Analysis of Mt-Mst Pathways in Biological Motion Processing

学校读者我要写书评

暂无评论

SSRN

SSRN 2024年

作者： Zhang, Yun Liu, Ying Feng, Tingting Zhang, Tao Qu, Hong Yi, Zhang School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu610054 China State Key Laboratory of Brain and Cognitive Science Institute of Psychology Chinese Academy of Sciences Beijing100101 China Department of Psychology University of Chinese Academy of Sciences Beijing100049 China College of Computer Science Sichuan University Chengdu610065 China

Understanding the neural mechanisms underlying biological motion perception remains a significant challenge in neuroscience. To further explore this mechanism, we construct the BioMotion-SNN model using real bioneural data from the MT to MST regions in macaques. To characterize neuron activity within specific time windows, we propose the windows learning strategy, which employs windowed learning to extract crucial information related to specific events or stimuli. By analyzing the connectivity structure of the BioMotion-SNN model, we identify regular projection patterns from MT to MST, reflected in the varying response characteristics of MT neurons based on their projection strength to different MST neuron populations. This work not only verifies some biological properties of MST neurons, but also underscores the potential of neurodevelopmental SNN models, driven byreal electrophysiological data, in improving our understanding of neural processing. © 2024, The Authors. All rights reserved.

关键词： Neural networks

Leveraging Laryngograph Data for Robust Voicing Detection in Speech

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Yixuan Wang, Heming Wang, DeLiang The Department of Computer Science and Engineering The Ohio State University ColumbusOH43210 United States The Department of Computer Science and Engineering The Center for Cognitive and Brain Sciences The Ohio State University ColumbusOH43210 United States

Accurately detecting voiced intervals in speech signals is a critical step in pitch tracking and has numerous applications. While conventional signal processing methods and deep learning algorithms have been proposed for this task, their need to fine-tune threshold parameters for different datasets and limited generalization restrict their utility in real-world applications. To address these challenges, this study proposes a supervised voicing detection model that leverages recorded laryngograph data. The model is based on a densely-connected convolutional recurrent neural network (DC-CRN), and trained on data with reference voicing decisions extracted from laryngograph data sets. Pretraining is also investigated to improve the generalization ability of the model. The proposed model produces robust voicing detection results, outperforming other strong baseline methods, and generalizes well to unseen datasets. The source code of the proposed model with pretraining is provided along with the list of used laryngograph datasets to facilitate further research in this area. Copyright © 2023, The Authors. All rights reserved.

关键词： Recurrent neural networks

Spiking representation learning for associative memories

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ravichandran, Naresh Lansner, Anders Herman, Pawel Computational Cognitive Brain Science Group Department of Computational Science and Technology School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm Sweden Department of Mathematics Stockholm University Stockholm Sweden Digital Futures KTH Royal Institute of Technology Stockholm Sweden

Networks of interconnected neurons communicating through spiking signals offer the bedrock of neural computations. Our brain’s spiking neural networks have the computational capacity to achieve complex pattern recognition and cognitive functions effortlessly. However, solving real-world problems with artificial spiking neural networks (SNNs) has proved to be difficult for a variety of reasons. Crucially, scaling SNNs to large networks and processing large-scale real-world datasets have been challenging, especially when compared to their non-spiking deep learning counterparts. The critical operation that is needed of SNNs is the ability to learn distributed representations from data and use these representations for perceptual, cognitive and memory operations. In this work, we introduce a novel SNN that performs unsupervised representation learning and associative memory operations leveraging Hebbian synaptic and activity-dependent structural plasticity coupled with neuron-units modelled as Poisson spike generators with sparse firing (~1 Hz mean and ~100 Hz maximum firing rate). Crucially, the architecture of our model derives from the neocortical columnar organization and combines feedforward projections for learning hidden representations and recurrent projections for forming associative memories. We evaluated the model on properties relevant for attractor-based associative memories such as pattern completion, perceptual rivalry, distortion resistance, and prototype extraction. © 2024, CC BY-SA.

关键词： Unsupervised learning

AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Kalkhorani, Vahid Ahmadi Yu, Cheng Kumar, Anurag Tan, Ke Xu, Buye Wang, DeLiang Department of Computer Science and Engineering Ohio State University ColumbusOH43210 United States Meta Reality Labs RedmondWA20004 United States Department of Computer Science and Engineering The Center for Cognitive and Brain Sciences Ohio State University ColumbusOH43210 United States

Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AVCrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AVCrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets. © 2024, CC BY.

关键词： Extraction

MULTI-RESOLUTION LOCATION-BASED TRAINING FOR MULTI-CHANNEL CONTINUOUS SPEECH SEPARATION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Taherian, Hassan Wang, DeLiang Department of Computer Science and Engineering The Ohio State University United States Center for Cognitive and Brain Sciences The Ohio State University United States

关键词： Location

Multi-channel Conversational Speaker Separation via Neural Diarization

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Taherian, Hassan Wang, DeLiang Department of Computer Science and Engineering The Ohio State University ColumbusOH43210-1277 United States Department of Computer Science and Engineering the Center for Cognitive and Brain Sciences The Ohio State University ColumbusOH43210-1277 United States

When dealing with overlapped speech, the performance of automatic speech recognition (ASR) systems substantially degrades as they are designed for single-talker speech. To enhance ASR performance in conversational or meeting environments, continuous speaker separation (CSS) is commonly employed. However, CSS requires a short separation window to avoid many speakers inside the window and sequential grouping of discontinuous speech segments. To address these limitations, we introduce a new multi-channel framework called "speaker separation via neural diarization" (SSND) for meeting environments. Our approach utilizes an end-to-end diarization system to identify the speech activity of each individual speaker. By leveraging estimated speaker boundaries, we generate a sequence of embeddings, which in turn facilitate the assignment of speakers to the outputs of a multi-talker separation model. SSND addresses the permutation ambiguity issue of talker-independent speaker separation during the diarization phase through location-based training, rather than during the separation process. This unique approach allows multiple non-overlapped speakers to be assigned to the same output stream, making it possible to efficiently process long segments—a task impossible with CSS. Additionally, SSND is naturally suitable for speaker-attributed ASR. We evaluate our proposed diarization and separation methods on the open LibriCSS dataset, advancing state-of-the-art diarization and ASR results by a large margin. Copyright © 2023, The Authors. All rights reserved.

关键词： Speech recognition