检索结果-内蒙古大学图书馆

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI

学校读者我要写书评

暂无评论

META-GUI: Towards Multi-modal Conversational Agents on Mobil...

2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022

作者： Sun, Liangtai Chen, Xingyu Chen, Lu Dai, Tianle Zhu, Zichen Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai Jiao Tong University Shanghai China

Task-oriented dialogue (TOD) systems have been widely used by mobile phone intelligent assistants to accomplish tasks such as calendar scheduling or hotel reservation. Current TOD systems usually focus on multi-turn text/speech interaction, then they would call back-end APIs designed for TODs to perform the task. However, this API-based architecture greatly limits the information-searching capability of intelligent assistants and may even lead to task failure if TOD-specific APIs are not available or the task is too complicated to be executed by the provided APIs. In this paper, we propose a new TOD architecture: GUI-based task-oriented dialogue system (GUI-TOD). A GUI-TOD system can directly perform GUI operations on real APPs and execute tasks without invoking TOD-specific backend APIs. Furthermore, we release META-GUI, a dataset for training a Multi-modal convErsaTional Agent on mobile GUI. We also propose a multi-model action prediction and response model, which show promising results on META-GUI. The dataset, codes and leaderboard are publicly available. © 2022 Association for Computational Linguistics.

关键词： Graphical user interfaces

LinSATNet: The Positive Linear Satisfiability Neural Networks

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Wang, Runzhong Zhang, Yunhao Guo, Ziao Chen, Tianyi Yang, Xiaokang Yan, Junchi Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence Shanghai Jiao Tong University China Shanghai AI Laboratory China

Encoding constraints into neural networks is attractive. This paper studies how to introduce the popular positive linear satisfiability to neural networks. We propose the first differentiable satisfiability layer based on an extension of the classic Sinkhorn algorithm for jointly encoding multiple sets of marginal distributions. We further theoretically characterize the convergence property of the Sinkhorn algorithm for multiple marginals. In contrast to the sequential decision e.g. reinforcement learning-based solvers, we showcase our technique in solving constrained (specifically satisfiability) problems by one-shot neural networks, including i) a neural routing solver learned without supervision of optimal solutions;ii) a partial graph matching network handling graphs with unmatchable outliers on both sides;iii) a predictive network for financial portfolios with continuous constraints. To our knowledge, there exists no one-shot neural solver for these scenarios when they are formulated as satisfiability problems. Source code is available at https://***/Thinklab-SJTU/LinSATNet. © 2024, CC BY.

关键词： Encoding (symbols)

LoRA-Whisper: Parameter-Efficient and Extensible Multilingual ASR

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Song, Zheshu Zhuo, Jianheng Yang, Yifan Ma, Ziyang Zhang, Shixiong Chen, Xie MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Tencent AI Lab United States

Recent years have witnessed significant progress in multilingual automatic speech recognition (ASR), driven by the emergence of end-to-end (E2E) models and the scaling of multilingual datasets. Despite that, two main challenges persist in multilingual ASR: language interference and the incorporation of new languages without degrading the performance of the existing ones. This paper proposes LoRA-Whisper, which incorporates LoRA matrix into Whisper for multilingual ASR, effectively mitigating language interference. Furthermore, by leveraging LoRA and the similarities between languages, we can achieve better performance on new languages while upholding consistent performance on original ones. Experiments on a real-world task across eight languages demonstrate that our proposed LoRA-Whisper yields a relative gain of 18.5% and 23.0% over the baseline system for multilingual ASR and language expansion respectively. © 2024, CC BY.

关键词： Expansion

STORYTTS: A HIGHLY EXPRESSIVE TEXT-TO-SPEECH DATASET WITH RICH TEXTUAL EXPRESSIVENESS ANNOTATIONS

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Liu, Sen Guo, Yiwei Chen, Xie Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

While acoustic expressiveness has long been studied in expressive text-to-speech (ETTS), the inherent expressiveness in text lacks sufficient attention, especially for ETTS of artistic works. In this paper, we introduce StoryTTS, a highly ETTS dataset that contains rich expressiveness both in acoustic and textual perspective, from the recording of a Mandarin storytelling show. A systematic and comprehensive labeling framework is proposed for textual expressiveness. We analyze and define speech-related textual expressiveness in StoryTTS to include five distinct dimensions through linguistics, rhetoric, etc. Then we employ large language models and prompt them with a few manual annotation examples for batch annotation. The resulting corpus contains 61 hours of consecutive and highly prosodic speech equipped with accurate text transcriptions and rich textual expressiveness annotations. Therefore, StoryTTS can aid future ETTS research to fully mine the abundant intrinsic textual and acoustic features. Experiments are conducted to validate that TTS models can generate speech with improved expressiveness when integrating with the annotated textual labels in StoryTTS. Copyright © 2024, The Authors. All rights reserved.

关键词： Large datasets

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

学校读者我要写书评

暂无评论

StoryTTS: A Highly Expressive Text-to-Speech Dataset with Ri...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Sen Liu Yiwei Guo Xie Chen Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

关键词：

Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhang, Yaoyun Xu, Xuenan Wu, Mengyue MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

The video-to-audio (V2A) generation task has drawn attention in the field of multimedia due to the practicality in producing Foley sound. Semantic and temporal conditions are fed to the generation model to indicate sound events and temporal occurrence. Recent studies on synthesizing immersive and synchronized audio are faced with challenges on videos with moving visual presence. The temporal condition is not accurate enough while low-resolution semantic condition exacerbates the problem. To tackle these challenges, we propose Smooth-Foley, a V2A generative model taking semantic guidance from the textual label across the generation to enhance both semantic and temporal alignment in audio. Two adapters are trained to leverage pre-trained text-to-audio generation models. A frame adapter integrates high-resolution frame-wise video features while a temporal adapter integrates temporal conditions obtained from similarities of visual frames and textual labels. The incorporation of semantic guidance from textual labels achieves precise audio-video alignment. We conduct extensive quantitative and qualitative experiments. Results show that Smooth-Foley performs better than existing models on both continuous sound scenarios and general scenarios. With semantic guidance, the audio generated by Smooth-Foley exhibits higher quality and better adherence to physical laws. © 2024, CC BY-NC-SA.

关键词： Semantics

Acoustic BPE for Speech Generation with Discrete Tokens

学校读者我要写书评

暂无评论

Acoustic BPE for Speech Generation with Discrete Tokens

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Feiyu Shen Yiwei Guo Chenpeng Du Xie Chen Kai Yu Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of the token sequence. Additionally, this approach places the burden on the model to establish correlations between tokens, further complicating the modeling process. To address this issue, we propose acoustic BPE which encodes frequent audio token patterns by utilizing byte-pair encoding. Acoustic BPE effectively reduces the sequence length and leverages the prior morphological information present in token sequence, which alleviates the modeling challenges of token correlation. Through comprehensive investigations on a speech language model trained with acoustic BPE, we confirm the notable advantages it offers, including faster inference and improved syntax capturing capabilities. In addition, we propose a novel rescore method to select the optimal synthetic speech among multiple candidates generated by rich-diversity TTS system. Experiments prove that rescore selection aligns closely with human preference, which highlights acoustic BPE’s potential to other speech generation tasks.

关键词：

A DETaiLED AUDIO-TEXT DATA SIMULATION PIPELINE USING SINGLE-EVENT SOUNDS

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Xu, Xuenan Xu, Xiaohang Xie, Zeyu Zhang, Pingyue Wu, Mengyue Yu, Kai MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China

Recently, there has been an increasing focus on audio-text cross-modal learning. However, most of the existing audio-text datasets contain only simple descriptions of sound events. Compared with classification labels, the advantages of such descriptions are significantly limited. In this paper, we first analyze the detailed information that human descriptions of audio may contain beyond sound event labels. Based on the analysis, we propose an automatic pipeline for curating audio-text pairs with rich details1. Leveraging the property that sounds can be mixed and concatenated in the time domain, we control details in four aspects: temporal relationship, loudness, speaker identity, and occurrence number, in simulating audio mixtures. Corresponding details are transformed into captions by large language models. Audio-text pairs with rich details in text descriptions are thereby obtained. We validate the effectiveness of our pipeline with a small amount of simulated data, demonstrating that the simulated data enables models to learn detailed audio captioning. Copyright © 2024, The Authors. All rights reserved.

关键词： Pipelines

Enhancing Audio Generation Diversity with Visual Information

学校读者我要写书评

暂无评论

Enhancing Audio Generation Diversity with Visual Information

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zeyu Xie Baihan Li Xuenan Xu Mengyue Wu Kai Yu Department of Computer Science and Engineering AI Institute MoE Key Lab of Artificial Intelligence X-LANCE Lab Shanghai Jiao Tong University Shanghai China

Audio and sound generation has garnered significant attention in recent years, with a primary focus on improving the quality of generated audios. However, there has been limited research on enhancing the diversity of generated audio, particularly when it comes to audio generation within specific categories. Current models tend to produce homogeneous audio samples within a category. This work aims to address this limitation by improving the diversity of generated audio with visual information. We propose a clustering-based method, leveraging visual information to guide the model in generating distinct audio content within each category. Results on seven categories indicate that extra visual input can largely enhance audio generation diversity. Audio samples are available at DemoWeb.

关键词：