检索结果-内蒙古大学图书馆

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Han, Bing Chen, Zhengyang Qian, Yanmin Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

The mismatch between close-set training and open-set testing usually leads to significant performance degradation for speaker verification task. For existing loss functions, metric learning-based objectives depend strongly on searching effective pairs which might hinder further improvements. And popular multi-classification methods are usually observed with degradation when evaluated on unseen speakers. In this work, we introduce SphereFace2 framework which uses several binary classifiers to train the speaker model in a pair-wise manner instead of performing multi-classification. Benefiting from this learning paradigm, it can efficiently alleviate the gap between training and evaluation. Experiments conducted on Voxceleb show that the SphereFace2 outperforms other existing loss functions, especially on hard trials. Besides, large margin fine-tuning strategy is proven to be compatible with it for further improvements. Finally, SphereFace2 also shows its strong robustness to class-wise noisy labels which has the potential to be applied in the semi-supervised training scenario with inaccurate estimated pseudo labels. © 2023 IEEE.

关键词： binary classification large margin fine-tuning speaker verification sphereface2

来源：评论

学校读者我要写书评

暂无评论

Robust Audio-Visual ASR with Unified Cross-Modal Attention 48

Robust Audio-Visual ASR with Unified Cross-Modal Attention

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Li, Jiahong Li, Chenda Wu, Yifei Qian, Yanmin Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

Audio-visual speech recognition (AVSR) takes advantage of noise-invariant visual information to improve the robustness of automatic speech recognition (ASR) systems. While previous works mainly focused on the clean condition, we believe the visual modality is more effective in noisy environments. The challenges arise from the difficulty of adaptive fusion of audio-visual information and the possible interferences inside the training data. In this paper, we present a new audio-visual speech recognition model with a unified cross-modal attention mechanism. In particular, the auxiliary visual evidence is combined with the acoustic feature along the temporal dimension in the unified space before the deep encoding network. This method provides a flexible cross-modal context and requires no forced alignment such that the model can learn to leverage the audio-visual information in relevant frames. In experiments, the proposed model is demonstrated to be robust to the potential absence of the visual modality or misalignment in audio-visual frames. On the large-scale audio-visual dataset LRS3, our new model further reduces the state-of-The-Art WER for clean utterances and significantly improves the performance under noisy conditions. © 2023 IEEE.

关键词： Speech recognition

来源：评论

学校读者我要写书评

暂无评论

DiffVoice: Text-to-Speech with Latent Diffusion 48

DiffVoice: Text-to-Speech with Latent Diffusion

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Liu, Zhijun Guo, Yiwei Yu, Kai Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-Lance Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

In this work, we present DiffVoice, a novel text-to-speech model based on latent diffusion. We propose to first encode speech signals into a phoneme-rate latent representation with a variational autoencoder enhanced by adversarial training, and then jointly model the duration and the latent representation with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS datasets demonstrate that our method beats the best publicly available systems in naturalness. By adopting recent generative inverse problem solving algorithms for diffusion models, DiffVoice achieves the state-of-the-art performance in text-based speech editing, and zero-shot adaptation. © 2023 IEEE.

关键词： diffusion probabilistic model speech editing speech synthesis variational autoencoder zero-shot adaptation

来源：评论

学校读者我要写书评

暂无评论

A Civil Aviation Customer Service Ontology and Its Applications

引用

Data intelligence 2023年第4期5卷 1063-1081页

作者： Meixiang Lv Xudong Cao Tianxing Wu Yuehua Li School of Computer Science and Engineering Southeast UniversityNanjingChina Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications(Southeast University) Ministry of EducationChina Zhejiang Lab HangzhouChina

In the process of developing the C919 large aircraft customer service intelligence system,we find that heterogeneous and incomplete data cause the inefficient and inaccurate decision ***,to solve this problem,we propose to introduce the idea of ontology modeling and reasoning into competitive intelligence system building in this *** first present the building principles and methods of the civil aviation customer service *** then define the classes and properties to contribute a real-world civil aviation customer service ontology,which is published on the Web(http:/***/dataset/cacso).We finally design SWRL rules corresponding to different intelligence analysis targets to support reasoning in our designed competitive intelligence system.

关键词： Ontology Building intelligence Service Civil Aviation Customer Service

来源：评论

学校读者我要写书评

暂无评论

ALPSolver: A Solver for Assumable Logic Programming 5

ALPSolver: A Solver for Assumable Logic Programming

引用

5th International Conference on Intelligent Computing and Human-computer Interaction, ICHCI 2024

作者： Zhang, Zhizheng Chen, Jiayi Tian, Huangdezhong School of Computer Science and Engineering Southeast University Nanjing China Key Lab. of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications Ministry of Education China

ISBN: (纸本)9798350368284

Assumable Logic Programming (ALP), an extension of Answer Set Programming (ASP), has been theoretically demonstrated to possess significant advantages in addressing problems involving incomplete information. Therefore, the development of ALP solvers is urgently needed to facilitate further research and applications. This paper proposes a solving algorithm named Answer Set Based View Search (ASBVS) and its optimization to compute the results of ALP programs. Based on this, the ALPSolver has been implemented and integrated into an online platform for public use. This paper experimentally validates the correctness of the solving algorithm and confirms the advantages of ALP in handling default information, indirect exception and abductive reasoning. Additionally, the experimental results demonstrate the effectiveness of the optimization algorithm, with a notable increase in efficiency as the problem scale increases. © 2024 IEEE.

关键词： Inductive logic programming (ILP)

来源：评论

学校读者我要写书评

暂无评论

MetaSTC: A Backbone Agnostic Spatio-Temporal Framework for Traffic Forecasting 24

MetaSTC: A Backbone Agnostic Spatio-Temporal Framework for T...

引用

24th IEEE International Conference on Data Mining, ICDM 2024

作者： Xu, Kexin Yu, Zhemeng Gao, Yucen Zhang, Songjian Fang, Jun Gao, Xiaofeng Chen, Guihai Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Department of Computer Science and Engineering Shanghai China Didi Chuxing Technology Co. Beijing China

ISBN: (纸本)9798331506681

Traffic flow prediction is a critical issue in transportation engineering and presents distinct challenges when handling large-scale datasets in the real world. Existing complex spatio-temporal forecasting paradigms use the same parameters to fit traffic sequences with varying spatio-temporal features, and tend to train an average performance model over different time series. This approach greatly reduces their accuracy when applied to larger road networks. Moreover, the significant differences in traffic data distribution from one city to another can also pose great challenges. The same model may be excellent for one city and mediocre when applied to another. To this end, we propose a Meta Backbone Agnostic Spatio-Temporal Clustering Framework for Traffic Forecasting on Large-Scale Road Networks named MetaSTC. We tackle the disparities of spatio-temporal features of traffic flow through a spatio-temporal clustering-based strategy. We design meta-learner for large-scale road network that dynamically extracts the shared information across roads in the same sub-task. In this way, the model can represent task-specific details with a simpler model and make quick and accurate predictions. Our paradigm is backbone-agnostic and can be combined with different traffic prediction models, solving the problem caused by the difference in data distribution. Extensive experimental results conducted on real-world traffic dataset demonstrate the high accuracy and computational efficiency of our model over SOTA approaches. © 2024 IEEE.

关键词： Large datasets

来源：评论

学校读者我要写书评

暂无评论

Adaptive Large Margin Fine-Tuning For Robust Speaker Verification 48

Adaptive Large Margin Fine-Tuning For Robust Speaker Verific...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Zhang, Leying Chen, Zhengyang Qian, Yanmin Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

Large margin fine-tuning (LMFT) is an effective strategy to improve the speaker verification system's performance and is widely used in speaker verification challenge systems. Because the large margin in the loss function could make the training task too difficult, people usually use longer training segments to alleviate this problem in LMFT. However, the LMFT model could have a duration mismatch with the real scenario verification, where the verification speech may be very short. In our experiments, we also find that LMFT fails in short duration and other verification scenarios. To solve this problem, we propose the duration-based and similarity-based adaptive large margin fine-tuning (ALMFT) strategy. To verify its effectiveness, we constructed fixed, variable length, and asymmetric verification trials based on VoxCeleb1. Experimental results demonstrate that ALMFT algorithms are very effective and robust, which not only achieve comparable improvement with LMFT in official VoxCeleb evaluation trials but also overcome performance degradation problems in short-duration and asymmetric scenarios respectively. © 2023 IEEE.

关键词： asymmetric scenario duration mismatch large margin fine-tuning speaker verification

来源：评论

学校读者我要写书评

暂无评论

Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR 48

Factorized AED: Factorized Attention-Based Encoder-Decoder f...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Gong, Xun Wang, Wei Shao, Hang Chen, Xie Qian, Yanmin Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

End-to-end automatic speech recognition (ASR) systems have gained popularity given their simplified architecture and promising results. However, text-only domain adaptation remains a big challenge for E2E systems. Text-to-speech (TTS) based approaches fine-tune ASR models by synthesized speech with an auxiliary TTS model, thus increase deployment costs. Language model (LM) fusion based approaches can achieve good performance but are sensitive to interpolation parameters. In order to factorize out the language component in the AED model, we propose the factorized attention-based encoder-decoder (Factorized AED) model whose decoder takes as input the posterior probabilities of a jointly trained LM. Moreover, in the context of domain adaptation, the domain specific LM serves as a plug-and-play component for a well-trained factorized AED model. In-domain experiments on LibriSpeech and out-of-domain experiments adapting from LibriSpeech to a variety of domains in GigaSpeech are conducted to validate the effectiveness of our proposed methods. Results show 20% / 24% relative word error rate (WER) reduction for LibriSpeech test sets and 8 ∼34% relative WER reduction for 8 GigaSpeech target domains test sets compared to the AED baseline. © 2023 IEEE.

关键词： domain adaptation end-to-end speech recognition factorized AED text-only

来源：评论

学校读者我要写书评

暂无评论

LongFNT: Long-Form Speech Recognition with Factorized Neural Transducer 48

LongFNT: Long-Form Speech Recognition with Factorized Neural...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Gong, Xun Wu, Yu Li, Jinyu Liu, Shujie Zhao, Rui Chen, Xie Qian, Yanmin Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-LANCE Lab Department of Computer Science and Engineering China Microsoft

ISBN: (纸本)9781728163277

Traditional automatic speech recognition (ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the LongFNT architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate (WER) reduction, respectively. © 2023 IEEE.

关键词： Speech recognition

来源：评论

学校读者我要写书评

暂无评论

Emodiff: Intensity Controllable Emotional Text-to-Speech with Soft-label Guidance 48

Emodiff: Intensity Controllable Emotional Text-to-Speech wit...

引用

48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

作者： Guo, Yiwei Du, Chenpeng Chen, Xie Yu, Kai Shanghai Jiao Tong University MoE Key Lab of Artificial Intelligence Ai Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai China

ISBN: (纸本)9781728163277

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and Neutral is set to α and 1 - α respectively. The α here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process. © 2023 IEEE.

关键词： classifier guidance de-noising diffusion models emotion intensity control Emotional TTS

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：