检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Wang, Runzhong Zhang, Yunhao Guo, Ziao Chen, Tianyi Yang, Xiaokang Yan, Junchi Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University China Shanghai AI Laboratory China

Encoding constraints into neural networks is attractive. This paper studies how to introduce the popular positive linear satisfiability to neural networks. We propose the first differentiable satisfiability layer based on an extension of the classic Sinkhorn algorithm for jointly encoding multiple sets of marginal distributions. We further theoretically characterize the convergence property of the Sinkhorn algorithm for multiple marginals. In contrast to the sequential decision e.g. reinforcement learning-based solvers, we showcase our technique in solving constrained (specifically satisfiability) problems by one-shot neural networks, including i) a neural routing solver learned without supervision of optimal solutions;ii) a partial graph matching network handling graphs with unmatchable outliers on both sides;iii) a predictive network for financial portfolios with continuous constraints. To our knowledge, there exists no one-shot neural solver for these scenarios when they are formulated as satisfiability problems. Source code is available at https://***/Thinklab-SJTU/LinSATNet. © 2024, CC BY.

关键词： Encoding (symbols)

来源：评论

学校读者我要写书评

暂无评论

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

META-GUI: Towards Multi-modal Conversational Agents on Mobil...

引用

2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

作者： Sun, Liangtai Chen, Xingyu Chen, Lu Dai, Tianle Zhu, Zichen Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai Jiao Tong University Shanghai China

Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available. © 2022 Association for Computational Linguistics.

关键词： Graphical user interfaces

来源：评论

学校读者我要写书评

暂无评论

Reorder and then Parse, Fast and Accurate Discontinuous Constituency Parsing

Reorder and then Parse, Fast and Accurate Discontinuous Cons...

引用

2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

作者： Sun, Kailai Li, Zuchao Zhao, Hai Department of Computer Science and Engineering Shanghai Jiao Tong University China MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University China School of Computer Science Wuhan University China

Discontinuous constituency parsing is still kept developing for its efficiency and accuracy are far behind its continuous counterparts. Motivated by the observation that a discontinuous constituent tree can be simply transformed into a pseudo-continuous one by artificially reordering words in the sentence, we propose a novel reordering method, thereby construct fast and accurate discontinuous constituency parsing systems working in continuous way. Specifically, we model the relative position changes of words as a list of actions. By parsing and performing this actions, the corresponding pseudo-continuous sequence is derived. Discontinuous parse tree can be further inferred via integrating a high-performance pseudo-continuous constituency parser. Our systems are evaluated on three classical discontinuous constituency treebanks, achieving new state-of-the-art on two treebanks and showing a distinct advantage in speed. © 2022 Association for Computational Linguistics.

关键词： Syntactics

来源：评论

学校读者我要写书评

暂无评论

STORYTTS: A HIGHLY EXPRESSIVE TEXT-TO-SPEECH DATASET WITH RICH TEXTUAL EXPRESSIVENESS ANNOTATIONS

arXiv

引用

arXiv 2024年

作者： Liu, Sen Guo, Yiwei Chen, Xie Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS. Copyright © 2024, The Authors. All rights reserved.

关键词： Large datasets

来源：评论

学校读者我要写书评

暂无评论

VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matching

VoiceFlow: Efficient Text-To-Speech with Rectified Flow Matc...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Yiwei Guo Chenpeng Du Ziyang Ma Xie Chen Kai Yu Department of Computer Science and Engineering X-LANCE Lab Shanghai Jiao Tong University Shanghai China MoE Key Lab of Artificial Intelligence AI Institute

Although diffusion models in text-to-speech have become a popular choice due to their strong generative ability, the intrinsic complexity of sampling from diffusion models harms their efficiency. Alternatively, we propose VoiceFlow, an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps. VoiceFlow formulates the process of generating mel-spectrograms into an ordinary differential equation conditional on text inputs, whose vector field is then estimated. The rectified flow technique then effectively straightens its sampling trajectory for efficient synthesis. Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart. Ablation studies further verified the validity of the rectified flow technique in VoiceFlow.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Acoustic BPE for Speech Generation with Discrete Tokens

Acoustic BPE for Speech Generation with Discrete Tokens

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Feiyu Shen Yiwei Guo Chenpeng Du Xie Chen Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE’s potential to other speech generation tasks.

关键词：

来源：评论

学校读者我要写书评

暂无评论

A DETAILED AUDIO-TEXT DATA SIMULATION PIPELINE USING SINGLE-EVENT SOUNDS

arXiv

引用

arXiv 2024年

作者： Xu, Xuenan Xu, Xiaohang Xie, Zeyu Zhang, Pingyue Wu, Mengyue Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details1. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning. Copyright © 2024, The Authors. All rights reserved.

关键词： Pipelines

来源：评论

学校读者我要写书评

暂无评论

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding

arXiv

引用

arXiv 2024年

作者： Li, Bohan Wang, Hankun Zhang, Situo Guo, Yiwei Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

The auto-regressive (AR) architecture, exemplified by models such as GPT, is extensively utilized in modern Text-to-Speech (TTS) systems. However, it often leads to considerable inference delays, primarily due to the challenges associated with next-token prediction in long speech sequences. In this work, we introduce VADUSA, one of the first approaches to accelerate AR-based TTS through speculative decoding. Our findings demonstrate that VADUSA not only delivers a significant reduction in inference time but also enhances TTS quality by employing draft heads to predict future speech tokens in an auto-regressive manner. Additionally, the incorporation of a tolerance mechanism during the sampling process further boosts performance, yielding approximately a 3 × speedup in AR TTS. Moreover, our approach exhibits strong generalization across diverse datasets and various speech token *** Codes 68T07 Copyright © 2024, The Authors. All rights reserved.

关键词： Decoding

来源：评论

学校读者我要写书评

暂无评论

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion wi...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Junjie Li Yiwei Guo Xie Chen Kai Yu X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China

Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the speaker embedding of the target speaker, the speaker similarity still lags behind the ground truth recordings. In this paper, we propose SEF-VC, a speaker embedding free voice conversion model, which is designed to learn and incorporate speaker timbre from reference speech via a powerful position-agnostic cross-attention mechanism, and then reconstruct waveform from HuBERT semantic tokens in a non-autoregressive manner. The concise design of SEF-VC enhances its training stability and voice conversion performance. Objective and subjective evaluations demonstrate the superiority of SEF-VC to generate high-quality speech with better similarity to target reference than strong zero-shot VC baselines, even for very short reference speeches.

关键词：

来源：评论

学校读者我要写书评

暂无评论

A Detailed Audio-Text Data Simulation Pipeline Using Single-Event Sounds

A Detailed Audio-Text Data Simulation Pipeline Using Single-...

引用

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Xuenan Xu Xiaohang Xu Zeyu Xie Pingyue Zhang Mengyue Wu Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details 1 . Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：