检索结果-内蒙古大学图书馆

ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Conformer-Based Diffusion Model

学校读者我要写书评

暂无评论

ZCS-CDiff: A Zero-Shot Code-Switching TTS System with Confor...

International Conference on Acoustics, speech, and Signal Processing (ICASSP)

作者： Ke Chen Zhihua Huang Liang He Yonghong Yan School of Computer Science and Technology Xinjiang University Urumqi China Xinjiang Key Laboratory of Signal Detection and Processing Urumqi China Department of Electronic Engineering Tsinghua University Beijing China University of Chinese Academy of Sciences Beijing China Key Laboratory of Speech Acoustics and Content Understanding Institute of Acoustics CAS Beijing China

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Code-Switching (CS) Text-To-speech (TTS) models have gained attention due to the increasing prevalence of multilingual communication. However, existing models struggle to meet the growing demand for personalized CS TTS, particularly in Zero-Shot (ZS) scenarios where the model should generate speech for an unseen speaker from reference speech. In this paper, we propose ZCS-CDiff, a zero-shot code-switching TTS system with the Conformer-based diffusion model. We disentangle speech features and use the diffusion model to precisely model these disentangled attributes, resulting in high-quality ZS CS speech. Additionally, we introduce a Conformer-based WaveNet as the denoising network within the diffusion model to further improve the accuracy of modeling different attributes. We also designed a speaker-assist module to help the model better handle speaker information extracted from the reference speech, resulting in CS speech with higher speaker similarity. Experimental results and ablation studies demonstrate that the ZS CS speech generated by ZCS-CDiff has good speech naturalness, intelligibility, and speaker similarity while confirming the effectiveness of our design choices. Audio samples are available 1 .

关键词： Training Adaptation models Accuracy speech coding Noise reduction speech enhancement Signal processing Diffusion models Probabilistic logic Text to speech

Unified Domain Adaptive Semantic Segmentation

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Zhe Wu, Gaochang Zhang, Jing Zhu, Xiatian Tao, Dacheng Chai, Tianyou The State Key Laboratory of Synthetical Automation for Process Industries Northeastern University Shenyang China The Surrey Institute for People-Centred Artificial Intelligence Centre for Vision Speech and Signal Processing University of Surrey Guildford United Kingdom The School of Computer Science Faculty of Engineering University of Sydney Sydney Australia

Unsupervised Domain Adaptive Semantic Segmentation (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled and shifted target domain. The majority of existing UDA-SS works typically consider images whilst recent attempts have extended further to tackle videos by modeling the temporal dimension. Although two lines of research share the major challenges – overcoming the underlying domain distribution shift, their studies are largely independent. It causes several issues: (1) The insights gained from each line of research remain fragmented, leading to a lack of holistic understanding of the problem and potential solutions. (2) Preventing the unification of methods and best practices across two scenarios (images and videos) will lead to redundant efforts and missed opportunities for cross-pollination of ideas. (3) Without a unified approach, the knowledge and advancements made in one scenario may not be effectively transferred to the other, leading to suboptimal performance and slower progress. Under this observation, we advocate unifying the study of UDA-SS across video and image scenarios, enabling a more comprehensive understanding, synergistic advancements, and efficient knowledge sharing. To that end, we explore the unified UDA-SS from a general domain augmentation perspective, serving as a unifying framework, enabling improved generalization, and potential for cross-pollination, ultimately contributing to the practical impact and overall progress. Specifically, we propose a Quad-directional Mixup (QuadMix) method, characterized by tackling intra-domain discontinuity, fragmented gap bridging, and feature inconsistencies through four-directional paths designed for intra- and inter-domain mixing within an explicit feature space. To deal with temporal shifts within videos, we incorporate optical flow-guided feature aggregation across spatial and temporal dimensions for fine-grained domain alignment, which is extendable to image scenarios

关键词： Semantic Segmentation

Enhanced multi-stage network for defocus deblurring using dual-pixel images 13

学校读者我要写书评

暂无评论

Enhanced multi-stage network for defocus deblurring using du...

13th International Conference on Signal Processing Systems, ICSPS 2021

作者： Li, Ru Xie, Junwei Xue, Yuyang Zou, Wenbin Tong, Tong Luo, Ming Gao, Qinquan College of Physics and Information Engineering Fuzhou University China Fujian Key Lab of Medical Instrumentation & Pharmaceutical Technology Fuzhou University China Imperial Vision Technology Fujian China Department of Computer Science University of Tsukuba Tsukuba Japan Fujian Provincial Key Laboratory of Photonics Technology Fujian Normal University China

ISBN: (数字)9781510653184

ISBN: (纸本)9781510653177

The defocus deblurring raised from the finite aperture size and exposure time is an essential problem in the shooting process, which seriously affects the quality of the images. However, studies based on defocus deblurring in monocular images yielded good results, while those on binocular images are rare. The current methods directly merge the left and right views regardless of their unique features. Objects within the camera's DoF will not have a difference in phase, while light rays from outside the DoF will have a relative shift that is directly correlated with the amount of defocus blur. In this paper, we firstly proposed an enhanced multi-stage network for defocus deblurring using dual-pixel Images. Taking into account the parallax between the left and right views, the first two stages learn the information of them, respectively, and correct the deviation of the images under the supervision of the ground truth. The third stage consists of EERG and ERGS. It merges with the feature map of the previous stage, so that the left and right views are mutually enhanced, and a good restored image is obtained. ERGS uses the residual block as the basic unit to restore the details of the blurred area while maintaining the clear. Experimental results show that our proposed network can achieve better accuracy than state-of-the-art approaches on the public DPD dataset. © COPYRIGHT SPIE.

关键词： Geometrical optics

Deep Neural Decision Forest for Acoustic Scene Classification

学校读者我要写书评

暂无评论

Deep Neural Decision Forest for Acoustic Scene Classificatio...

European Signal Processing Conference (EUSIPCO)

作者： Jianyuan Sun Xubo Liu Xinhao Mei Jinzheng Zhao Mark D. Plumbley Volkan Kılıç Wenwu Wang Centre for Vision Speech and Signal Processing (CVSSP) University of Surrey UK College of Computer Science and Technology Qingdao University China Department of Electrical and Electronics Engineering Izmir Katip Celebi University Turkey

ISBN: (数字)9789082797091

ISBN: (纸本)9781665467995

Acoustic scene classification (ASC) aims to classify an audio clip based on the characteristic of the recording environment. In this regard, deep learning based approaches have emerged as a useful tool for ASC problems. Conventional approaches to improving the classification accuracy include integrating auxiliary methods such as attention mechanism, pre-trained models and ensemble multiple sub-networks. However, due to the complexity of audio clips captured from different environments, it is difficult to distinguish their categories without using any auxiliary methods for existing deep learning models using only a single classifier. In this paper, we propose a novel approach for ASC using deep neural decision forest (DNDF). DNDF combines a fixed number of convolutional layers and a decision forest as the final classifier. The decision forest consists of a fixed number of decision tree classifiers, which have been shown to offer better classification performance than a single classifier in some datasets. In particular, the decision forest differs substantially from traditional random forests as it is stochastic, differentiable, and capable of using the back-propagation to update and learn feature representations in neural network. Experimental results on the DCASE2019 and ESC-50 datasets demonstrate that our proposed DNDF method improves the ASC performance in terms of classification accuracy and shows competitive performance as compared with state-of-the-art baselines.

关键词： Deep learning Image analysis Neural networks Stochastic processes Training data Forestry Acoustics

Riemannian Self-Attention Mechanism for SPD Networks

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Rui Wu, Xiao-Jun Li, Hui Kittler, Josef School of Artificial Intelligence and Computer Science Jiangnan University Wuxi214122 China Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence Jiangnan University China Centre for Vision Speech and Signal Processing University of Surrey GuildfordGU2 7XH United Kingdom

Symmetric positive definite (SPD) matrix has been demonstrated to be an effective feature descriptor in many scientific areas, as it can encode spatiotemporal statistics of the data adequately on a curved Riemannian manifold, i.e., SPD manifold. Although there are many different ways to design network architectures for SPD matrix nonlinear learning, very few solutions explicitly mine the geometrical dependencies of features at different layers. Motivated by the great success of self-attention mechanism in capturing long-range relationships, an SPD manifold self-attention mechanism (SMSA) is proposed in this paper using some manifold-valued geometric operations, mainly the Riemannian metric, Riemannian mean, and Riemannian optimization. Then, an SMSA-based geometric learning module (SMSA-GLM) is designed for the sake of improving the discrimination of the generated deep structured representations. Extensive experimental results achieved on three benchmarking datasets show that our modification against the baseline network further alleviates the information degradation problem and leads to improved accuracy. Copyright © 2023, The Authors. All rights reserved.

关键词： Network architecture

Subspace Gaussian Mixture Modeling for low-resource non-native Punjabi Language speech Recognition 6

学校读者我要写书评

暂无评论

Subspace Gaussian Mixture Modeling for low-resource non-nati...

6th International Conference on Futuristic Trends in Networks and Computing Technologies, FTNCT 2024

作者： Bawa, Puneet Kadyan, Virender Chhabra, Gunjan Chopra, Ashish Centre of Excellence for Speech and Multimodal Laboratory Chitkara University Institute of Engineering and Technology Chitkara University Punjab India Machine Intelligence Research Centre School of Computer Science UPES Energy Acres Bidholi Uttarakhand Dehradun248007 India Department of CSE Graphic Era Hill University Graphic Era Deemed to Be University Uttarakhand Dehradun248007 India Department of Computer Science and Applications Seth Jai Parkash Mukand Lal Institute of Engineering and Technology Haryana Radaur India

The advancement of non-native recognition of speech is becoming more significant as individual research interest in communicating with various languages has developed. However, because of the limited ability to include semantic level contextual signals for non-native language, the resilience and performance of Automatic speech Recognition (ASR) systems have been degraded. The objectives of this research include the collection and pre-processing of speech data, the training of an acoustic model using Gaussian Mixture Models (GMMs), the development of a Punjabi-specific language model, and the creation of a comprehensive pronunciation dictionary. Similarly, the performance of proposed ASR system has been evaluated using relevant measures that consider the unique characteristics of excellent non-native Punjabi speakers' speech. Furthermore, three three distinct kinds of language models using a combined subspace Gaussian Mixture Modeling and Hidden Markov Modeling (sGMM-HMM) technique has been employed with an objective of tackling the issue of shared-state parameters due to a limited dataset. Overall, the findings obtained on sGMM-HMM yielded a Word Error Rate (WER) of 16.41%, which appears respectable in contrast to the native language ASR system. © 2025 The Author(s). Published by Elsevier B.V.

关键词： speech analysis

Joint Design of Radar Receive Filter and Unimodular ISAC Waveform with Sidelobe Level Control

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Zhang, Kecheng Liu, Ya-Feng Wang, Zhongbin Yuan, Weijie Keskin, Musa Furkan Wymeersch, Henk Xia, Shuqiang School of System Design and Intelligent Manufacturing The Shenzhen Key Laboratory of Robotics and Computer Vision Southern University of Science and Technology Shenzhen518055 China State Key Laboratory of Scientific and Engineering Computing Institute of Computational Mathematics and Scientific/Engineering Computing Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing100190 China Department of Electrical Engineering Chalmers University of Technology Gothenburg41296 Sweden ZTE Corporation The State Key Laboratory of Mobile Network and Mobile Multimedia Technology Shenzhen518055 China

Integrated sensing and communication (ISAC) has been considered a key feature of next-generation wireless networks. This paper investigates the joint design of the radar receive filter and dual-functional transmit waveform for the multiple-input multiple-output (MIMO) ISAC system. While optimizing the mean square error (MSE) of the radar receive spatial response and maximizing the achievable rate at the communication receiver, besides the constraints of full-power radar receiving filter and unimodular transmit sequence, we control the maximum range sidelobe level, which is often overlooked in existing ISAC waveform design literature, for better radar imaging performance. To solve the formulated optimization problem with convex and nonconvex constraints, we propose an inexact augmented Lagrangian method (ALM) algorithm. For each subproblem in the proposed inexact ALM algorithm, we custom-design a block successive upper-bound minimization (BSUM) scheme with closed-form solutions for all blocks of variable to enhance the computational efficiency. Convergence analysis shows that the proposed algorithm is guaranteed to provide a stationary and feasible solution. Extensive simulations are performed to investigate the impact of different system parameters on communication and radar imaging performance. Comparison with the existing works shows the superiority of the proposed algorithm. © 2025, CC BY-NC-SA.

关键词： Mean square error

FIRST-SHOT UNSUPERVISED ANOMALOUS SOUND DETECTION WITH UNKNOWN ANOMALIES ESTIMATED BY METADATA-ASSISTED AUDIO GENERATION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Zhang, Hejing Zhu, Qiaoxi Guan, Jian Liu, Haohe Xiao, Feiyang Tian, Jiantong Mei, Xinhao Liu, Xubo Wang, Wenwu Group of Intelligent Signal Processing College of Computer Science and Technology Harbin Engineering University Harbin China National Engineering Laboratory for Modeling and Emulation in E-Government Harbin Engineering University Harbin China Centre for Audio Acoustics and Vibration University of Technology Sydney Ultimo Australia Centre for Vision Speech and Signal Processing University of Surrey Guildford United Kingdom

First-shot (FS) unsupervised anomalous sound detection (ASD) is a brand-new task introduced in DCASE 2023 Challenge Task 2, where the anomalous sounds for the target machine types are unseen in training. Existing methods often rely on the availability of normal and abnormal sound data from the target machines. However, due to the lack of anomalous sound data for the target machine types, it becomes challenging when adapting the existing ASD methods to the first-shot task. In this paper, we propose a new framework for the first-shot unsupervised ASD, where metadata-assisted audio generation is used to estimate unknown anomalies, by utilising the available machine information (i.e., metadata and sound data) to fine-tune a text-to-audio generation model for generating the anomalous sounds that contain unique acoustic characteristics accounting for each different machine type. We then use the method of Time-Weighted Frequency domain audio Representation with Gaussian Mixture Model (TWFR-GMM) as the backbone to achieve the first-shot unsupervised ASD. Our proposed FS-TWFR-GMM method achieves competitive performance amongst top systems in DCASE 2023 Challenge Task 2, while requiring only 1% model parameters for detection, as validated in our experiments. Copyright © 2023, The Authors. All rights reserved.

关键词： Metadata

GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Tang, Chao Huang, Dehao Ge, Wenqi Liu, Weiyu Zhang, Hong Shenzhen Key Laboratory of Robotics and Computer Vision Southern University of Science and Technology Shenzhen China Department of Electronic and Electrical Engineering Southern University of Science and Technology Shenzhen China Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta United States

Task-oriented grasping (TOG) refers to the problem of predicting grasps on an object that enable subsequent manipulation tasks. To model the complex relationships between objects, tasks, and grasps, existing methods incorporate semantic knowledge as priors into TOG pipelines. However, the existing semantic knowledge is typically constructed based on closed-world concept sets, restraining the generalization to novel concepts out of the pre-defined sets. To address this issue, we propose GraspGPT, a large language model (LLM) based TOG framework that leverages the open-end semantic knowledge from an LLM to achieve zero-shot generalization to novel concepts. We conduct experiments on Language Augmented TaskGrasp (LA-TaskGrasp) dataset and demonstrate that GraspGPT outperforms existing TOG methods on different held-out settings when generalizing to novel concepts out of the training set. The effectiveness of GraspGPT is further validated in real-robot experiments. Our code, data, appendix, and video are publicly available at https://***/view/graspgpt. Copyright © 2023, The Authors. All rights reserved.

关键词： Zero-shot learning