检索结果-内蒙古大学图书馆

2024 Spoken Language Technology Workshop

作者： Zhang, Xueyao Xue, Liumeng Gu, Yicheng Wang, Yuancheng Li, Jiaqi He, Haorui Wang, Chaoren Liu, Songting Chen, Xi Zhang, Junan Fang, Zihao Chen, Haopeng Tang, Tze Ying Zou, Lexiao Wang, Mingxuan Han, Jun Chen, Kai Li, Haizhou Wu, Zhizheng Chinese Univ Hong Kong Shenzhen Peoples R China Shanghai AI Lab Shanghai Peoples R China Shenzhen Reseach Inst Big Data Shenzhen Peoples R China

ISBN: (纸本)9798350392265;9798350392258

Amphion is an open-source toolkit for Audio, Music, and Speech Generation, targeting to ease the way for junior researchers and engineers into these fields. It presents a unified framework that includes diverse generation tasks and models, with the added bonus of being easily extendable for new incorporation. The toolkit is designed with beginner-friendly workflows and pre-trained models, allowing both beginners and seasoned researchers to kick-start their projects with relative ease. The initial release of Amphion v0.1 supports a range of tasks including Text to Speech (TTS), Text to Audio (TTA), and Singing Voice Conversion (SVC), supplemented by essential components like data preprocessing, state-of-the-art vocoders, and evaluation metrics. This paper presents a high-level overview of Amphion. Amphion is open-sourced at https://***/open-mmlab/Amphion.

关键词： Speech generation audio generation music generation vocoder open-source software audio toolkit

来源：评论

学校读者我要写书评

暂无评论

ControlVideo: conditional control for one-shot text-driven video editing and beyond

引用

Science China(Information Sciences) 2025年第3期68卷 150-162页

作者： Min ZHAO Rongzhen WANG Fan BAO Chongxuan LI Jun ZHU Department of Computer Science and Technology Institute for AI Tsinghua-Bosch Joint ML CenterTsinghua Laboratory of Brain and Intelligence Lab Tsinghua University ShengShu Technology Gaoling School of Artificial Intelligence Renmin University of China Beijing Key Laboratory of Big Data Management and Analysis Methods Pazhou Laboratory (Huangpu)

This paper presents ControlVideo for text-driven video editing — generating a video that aligns with a given text while preserving the structure of the source video. Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency by incorporating additional conditions(such as edge maps), and fine-tuning the key-frame and temporal attention on the source video-text pair via an in-depth exploration of the design space. Extensive experimental results demonstrate that ControlVideo outperforms various competitive baselines by delivering videos that exhibit high fidelity w.r.t. the source content, and temporal consistency, all while aligning with the text. By incorporating low-rank adaptation layers into the model before training, ControlVideo is further empowered to generate videos that align seamlessly with reference images. More importantly, ControlVideo can be readily extended to the more challenging task of long video editing(e.g., with hundreds of frames), where maintaining long-range temporal consistency is crucial. To achieve this, we propose to construct a fused ControlVideo by applying basic ControlVideo to overlapping short video segments and key frame videos and then merging them by pre-defined weight functions. Empirical results validate its capability to create videos across 140 frames, which is approximately 5.83 to 17.5 times more than what previous studies achieved. The code is available at https://***/thu-ml/controlvideo.

关键词： diffusion models controllable generation text-driven editing video editing long video editing

来源：评论

学校读者我要写书评

暂无评论

Transformer-Based Visual Segmentation: A Survey

引用

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024年第12期46卷 10138-10163页

作者： Li, Xiangtai Ding, Henghui Yuan, Haobo Zhang, Wenwei Pang, Jiangmiao Cheng, Guangliang Chen, Kai Liu, Ziwei Loy, Chen Change Nanyang Technol Univ S Lab Singapore 639798 Singapore Fudan Univ Inst Big Data Shanghai 200437 Peoples R China Shanghai AI Lab Shanghai 200240 Peoples R China Univ Liverpool Liverpool L69 7ZX Merseyside England

Visual segmentation seeks to partition images, video frames, or point clouds into multiple segments or groups. This technique has numerous real-world applications, such as autonomous driving, image editing, robot sensing, and medical analysis. Over the past decade, deep learning-based methods have made remarkable strides in this area. Recently, transformers, a type of neural network based on self-attention originally designed for natural language processing, have considerably surpassed previous convolutional or recurrent approaches in various vision processing tasks. Specifically, vision transformers offer robust, unified, and even simpler solutions for various segmentation tasks. This survey provides a thorough overview of transformer-based visual segmentation, summarizing recent advancements. We first review the background, encompassing problem definitions, datasets, and prior convolutional methods. Next, we summarize a meta-architecture that unifies all recent transformer-based approaches. Based on this meta-architecture, we examine various method designs, including modifications to the meta-architecture and associated applications. We also present several specific subfields, including 3D point cloud segmentation, foundation model tuning, domain-aware segmentation, efficient segmentation, and medical segmentation. Additionally, we compile and re-evaluate the reviewed methods on several well-established datasets. Finally, we identify open challenges in this field and propose directions for future research.

关键词： Image segmentation Transformers Surveys Task analysis Measurement Object detection Visualization Vision transformer review dense prediction image segmentation video segmentation scene understanding

来源：评论

学校读者我要写书评

暂无评论

Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training 14

Low-Resourced Speech Recognition for Iu Mien Language via We...

引用

14th International Symposium on Chinese Spoken Language Processing

作者： Dong, Lukuan Qin, Donghong Bai, Fengbo Song, Fanhua Liu, Yan Xu, Chen Ou, Zhijian Guangxi Minzu Univ Sch Artificial Intelligence AI & Big Data Int Cooperat Joint Lab Nanning Peoples R China Tsinghua Univ Speech Proc & Machine Intelligence SPMI Lab Beijing Peoples R China

ISBN: (纸本)9798331516833;9798331516826

The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.

关键词： speech recognition Iu Mien language low-resourced

来源：评论

学校读者我要写书评

暂无评论

USED: Universal Speaker Extraction and Diarization

引用

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2025年 33卷 96-110页

作者： Ao, Junyi Yldrm, Mehmet Sinan Tao, Ruijie Ge, Meng Wang, Shuai Qian, Yanmin Li, Haizhou Chinese Univ Hong Kong Shenzhen Res Inst Big Data Sch Data Sci Shenzhen 518172 Peoples R China Natl Univ Singapore Dept Elect & Comp Engn Singapore 119077 Singapore Natl Univ Singapore Saw Swee Hock Sch Publ Hlth Singapore 117549 Singapore Shenzhen Res Inst Big Data Shenzhen 518172 Peoples R China Shanghai Jiao Tong Univ Auditory Cognit & Computat Acoust Lab AI Inst Dept Comp Sci & Engn Shanghai 200240 Peoples R China Shanghai Jiao Tong Univ AI Inst MoE Key Lab Artificial Intelligence Shanghai 200240 Peoples R China

Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating 'who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about 'who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified model called Universal Speaker Extraction and Diarization (USED) to address output inconsistency and scenario mismatch issues. It is designed to manage speech mixtures with varying overlap ratios and variable number of speakers. We show that the USED model significantly outperforms the competitive baselines for speaker extraction and diarization tasks on LibriMix and SparseLibriMix datasets. We further validate the diarization performance on CALLHOME, a dataset based on real recordings, and experimental results indicate that our model surpasses recently proposed approaches.

关键词： Speech recognition data mining Training Multitasking Speech enhancement Time-domain analysis Recording Predictive models Particle separators Oral communication Speaker extraction speaker diarization multi-talker scenario LibriMix CALLHOME

来源：评论

学校读者我要写书评

暂无评论

A survey on cross-user federated recommendation

引用

Science China(Information Sciences) 2025年第4期68卷 7-32页

作者： Enyue YANG Yudi XIONG Wei YUAN Weike PAN Qiang YANG Zhong MING College of Computer Science and Software Engineering Shenzhen University School of Electrical Engineering and Computer Science The University of Queensland WeBank AI Lab WeBank Department of Computer Science and Engineering Hong Kong University of Science and Technology College of Big Data and Internet Shenzhen Technology University Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)

Recommender systems are effective in mitigating information overload, yet the centralized storage of user data raises significant privacy concerns. Cross-user federated recommendation(CUFR) provides a promising distributed paradigm to address these concerns by enabling privacy-preserving recommendations directly on user devices. In this survey, we review and categorize current progress in CUFR, focusing on four key aspects: privacy, security, accuracy, and efficiency. Firstly,we conduct an in-depth privacy analysis, discuss various cases of privacy leakage, and then review recent methods for privacy protection. Secondly, we analyze security concerns and review recent methods for untargeted and targeted *** untargeted attack methods, we categorize them into data poisoning attack methods and parameter poisoning attack methods. For targeted attack methods, we categorize them into user-based methods and item-based methods. Thirdly,we provide an overview of the federated variants of some representative methods, and then review the recent methods for improving accuracy from two categories: data heterogeneity and high-order information. Fourthly, we review recent methods for improving training efficiency from two categories: client sampling and model compression. Finally, we conclude this survey and explore some potential future research topics in CUFR.

关键词： cross-user federated recommendation federated recommendation federated learning recommender systems user privacy

来源：评论

学校读者我要写书评

暂无评论

SaliencyMix plus : Noise-Minimized Image Mixing Method With Saliency Map in data Augmentation

引用

IEEE ACCESS 2025年 13卷 21734-21743页

作者： Lee, Hajeong Jin, Zhixiong Woo, Jiyoung Noh, Byeongjoon Soonchunhyang Univ Dept AI & Big Data Asan 31538 South Korea Univ Gustave Eiffel ENTPE LICIT ECO7 F-69500 Lyon France Ecole Polytech Fed Lausanne EPFL Urban Transport Syst Lab LUTS CH-1015 Ecublens Switzerland

data augmentation is vital in deep learning for enhancing model robustness by artificially expanding training datasets. However, advanced methods like CutMix blend images and assign labels based on pixel ratios, often introducing label noise by neglecting the significance of blended regions, and SaliencyMix applies uniform patch generation across a batch, resulting in suboptimal augmentation. This paper introduces SaliencyMix+, a novel data augmentation technique that enhances the performance of deep-learning models using saliency maps for image mixing and label generation. It identifies critical patch coordinates in batch images and refines label generation based on target object proportions, reducing label noise. Experiments on CIFAR-100 and Oxford-IIIT Pet datasets show that SaliencyMix+ consistently outperforms CutMix and SaliencyMix, achieving the lowest Top-1 errors of 24.95% and 34.89%, and Top-5 errors of 7.00% and 12.13% on CIFAR-100 and Oxford-IIIT Pet, respectively. These findings highlight the effectiveness of SaliencyMix+ in boosting model accuracy and robustness across different models and datasets.

关键词： data augmentation data models Training Computational modeling Robustness Noise Image classification Solid modeling Predictive models Object detection SaliencyMix saliency map label noise minimization image classification

来源：评论

学校读者我要写书评

暂无评论

Optimization of Cross-Lingual Voice Conversion With Linguistics Losses to Reduce Foreign Accents

引用

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 2023年 31卷 1916-1926页

作者： Zhou, Yi Wu, Zhizheng Tian, Xiaohai Li, Haizhou Natl Univ Singapore Dept Elect & Comp Engn Singapore 119077 Singapore Chinese Univ Hong Kong Shenzhen Res Inst Big Data Sch Data Sci Shenzhen 518172 Peoples R China Bytedance AI lab Speech & Audio Dept Singapore 569933 Singapore

Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to that of a target speaker who speaks a different language. Due to the intrinsic differences between languages, the converted speech may carry an unwanted foreign accent. In this paper, we first investigate the intelligibility of the converted speech and confirm the performance degradation caused by the accent/intelligibility issue. With the goal of generating native-sounding speech, this paper further proposes a novel training scheme with two additional linguistic losses for speech waveform generation: 1) a frame-wise phonetic content loss derived from bottleneck features, and 2) an automatic speech recognition loss on characters. Experiments were conducted between English and Mandarin Chinese conversions. The experimental results confirmed that the generated speech sounds more natural with the proposed linguistic losses and the proposed solution significantly improves speech intelligibility.

关键词： Cross-lingual voice conversion (XVC) speech intelligibility linguistic loss

来源：评论

学校读者我要写书评

暂无评论

DocReal: Robust Document Dewarping of Real-Life Images via Attention-Enhanced Control Point Prediction

DocReal: Robust Document Dewarping of Real-Life Images via A...

引用

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

作者： Yu, Fangchen Xie, Yina Wu, Lei Wen, Yafei Wang, Guozhi Ren, Shuai Chen, Xiaoxin Mao, Jianfeng Li, Wenye Chinese Univ Hong Kong Shenzhen Peoples R China Vivo AI Lab Shenzhen Peoples R China Shenzhen Res Inst Big Data Shenzhen Peoples R China

ISBN: (纸本)9798350318920;9798350318937

Document image dewarping is a crucial task in computer vision with numerous practical applications. The control point method, as a popular image dewarping approach, has attracted attention due to its simplicity and efficiency. However, inaccurate control point prediction due to varying background noises and deformation types can result in unsatisfactory performance. To address these issues, we propose a robust document dewarping approach for real-life images, namely DocReal, which utilizes Enet to effectively remove background noise and an attention-enhanced control point (AECP) module to better capture local deformations. Moreover, we augment the training data by synthesizing 2D images with 3D deformations and additional deformation types. Our proposed method achieves state-of-the-art performance on the DocUNet benchmark and a newly proposed benchmark of 200 Chinese distorted images, exhibiting superior dewarping accuracy, OCR performance, and robustness to various types of image distortion.

关键词： Algorithms Algorithms Algorithms datasets and evaluations Image recognition and understanding Low-level and physics-based vision

来源：评论

学校读者我要写书评

暂无评论

Understanding adversarial robustness against on-manifold adversarial examples

引用

PATTERN RECOGNITION 2025年 159卷

作者： Xiao, Jiancong Yang, Liusha Fan, Yanbo Wang, Jue Luo, Zhi-Quan Chinese Univ Hong Kong Shenzhen 518172 Peoples R China Shenzhen Res Inst Big Data Shenzhen 518172 Peoples R China Tencent AI Lab Shenzhen 518063 Peoples R China Univ Penn Philadelphia PA USA Shenzhen Technol Univ Shenzhen Peoples R China Ant Res Shenzhen Peoples R China Dzine AI Kortrijk Belgium

Deep neural networks (DNNs) are shown to be vulnerable to adversarial examples. A well-trained model can be easily attacked by adding small perturbations to the original data. One of the hypotheses of the existence of the adversarial examples is the off-manifold assumption: adversarial examples lie off the data manifold. However, recent researches showed that on-manifold adversarial examples also exist. In this paper, we revisit the off-manifold assumption and study a question: at what level is the poor adversarial robustness of neural networks due to on-manifold adversarial examples? Since the true data manifold is unknown in practice, we consider two approximated on-manifold adversarial examples on both real and synthesis datasets. On real datasets, we show that on-manifold adversarial examples have greater attack rates than off-manifold adversarial examples on both standard-trained and adversarially-trained models. On synthetic datasets, theoretically, we prove that on-manifold adversarial examples are powerful, yet adversarial training focuses on off-manifold directions and ignores the on-manifold adversarial examples. Furthermore, we provide analysis to show that the properties derived theoretically can also be observed in practice. Our analysis suggests that on-manifold adversarial examples are important. We should pay more attention to on-manifold adversarial examples to train robust models.

关键词： Adversarial robustness On-manifold adversarial examples

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：