检索结果-内蒙古大学图书馆

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Cong, Gaoxiang Pan, Jiadong Li, Liang Qi, Yuankai Peng, Yuxin van den Hengel, Anton Yang, Jian Huang, Qingming Key Laboratory of Intelligent Information Processing Institute of Computing Technology CAS China Macquarie University Australia Peking University China University of Adelaide Australia

Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation;(2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user’s emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods. Copyright © 2024, The Authors. All rights reserved.

关键词： Speech intelligibility

EDGE: Unknown-aware Multi-label Learning by Energy Distribution Gap Expansion

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Sun, Yuchen Xu, Qianqian Wang, Zitai Yang, Zhiyong He, Junwei Key Laboratory of Intelligent Information Processing Institute of Computing Technology CAS Beijing China School of Computer Science and Technology University of Chinese Academy of Sciences Beijing China

Multi-label Out-Of-Distribution (OOD) detection aims to discriminate the OOD samples from the multi-label In-Distribution (ID) ones. Compared with its multiclass counterpart, it is crucial to model the joint information among classes. To this end, JointEnergy, which is a representative multi-label OOD inference criterion, summarizes the logits of all the classes. However, we find that JointEnergy can produce an imbalance problem in OOD detection, especially when the model lacks enough discrimination ability. Specifically, we find that the samples only related to minority classes tend to be classified as OOD samples due to the ambiguous energy decision boundary. Besides, imbalanced multi-label learning methods, originally designed for ID ones, would not be suitable for OOD detection scenarios, even producing a serious negative transfer effect. In this paper, we resort to auxiliary outlier exposure (OE) and propose an unknown-aware multi-label learning framework to reshape the uncertainty energy space layout. In this framework, the energy score is separately optimized for tail ID samples and unknown samples, and the energy distribution gap between them is expanded, such that the tail ID samples can have a significantly larger energy score than the OOD ones. What’s more, a simple yet effective measure is designed to select more informative OE datasets. Finally, comprehensive experimental results on multiple multi-label and OOD datasets reveal the effectiveness of the proposed method. © 2024, CC BY.

关键词： Contrastive Learning

A survey on deep learning for polyp segmentation: techniques, challenges and future trends

学校读者我要写书评

暂无评论

Visual Intelligence 2025年第1期3卷 1-20页

作者： Mei, Jiaxin Zhou, Tao Huang, Kaiwen Zhang, Yizhe Zhou, Yi Wu, Ye Fu, Huazhu PCA Lab Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education Nanjing China School of Computer Science and Engineering Nanjing University of Science and Technology Nanjing China School of Computer Science and Engineering Southeast University Nanjing China Institute of High Performance Computing A*STAR Singapore Singapore

Early detection and assessment of polyps play a crucial role in the prevention and treatment of colorectal cancer (CRC). Polyp segmentation provides an effective solution to assist clinicians in accurately locating and segmenting polyp regions. In the past, people often relied on manually extracted lower-level features such as color, texture, and shape, which often had problems capturing global context and lacked robustness to complex scenarios. With the advent of deep learning, more and more medical image segmentation algorithms based on deep learning networks have emerged, making significant progress in the field. This paper provides a comprehensive review of polyp segmentation algorithms. We first review some traditional algorithms based on manually extracted features and deep segmentation algorithms, and then describe benchmark datasets related to the topic. Specifically, we carry out a comprehensive evaluation of recent deep learning models and results based on polyp size, taking into account the focus of research topics and differences in network structures. Finally, we discuss the challenges of polyp segmentation and future trends in the field. © The Author(s) 2025.

关键词： Comprehensive evaluation Deep learning Medical imaging Polyp segmentation

Mind Modeling in Intelligence Science 4th

学校读者我要写书评

暂无评论

Mind Modeling in Intelligence Science

4th IFIP TC 12 International Conference on Intelligence Science, ICIS 2020

作者： Shi, Zhongzhi Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences Beijing100190 China

ISBN: (纸本)9783030748258

Intelligence Science is an interdisciplinary subject which dedicates to joint research on basic theory and technology of intelligence by brain science, cognitive science, artificial intelligence and others. Mind modeling is the core of intelligence science. Here mind means a series of cognitive abilities, which enable individuals to have consciousness, sense the outside world, think, make judgment, and remember things. The mind model consciousness and memory (CAM) is proposed by the Intelligence Science laboratory. The CAM model is a framework for artificial general intelligence and will lead the development of a new generation of artificial intelligence. This paper will outline the age of intelligence, mind model CAM, brain computer integration. © 2021, IFIP International Federation for information processing.

关键词： Artificial intelligence

Synthesizing Knowledge-enhanced Features for Real-world Zero-shot Food Detection

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Zhou, Pengfei Min, Weiqing Song, Jiajun Zhang, Yang Jiang, Shuqiang The Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences Beijing100190 China The University of Chinese Academy of Sciences Beijing100049 China The Institute of Intelligent Computing Technology Chinese Academy of Sciences Suzhou215124 China

Food computing brings various perspectives to computer vision like vision-based food analysis for nutrition and health. As a fundamental task in food computing, food detection needs Zero-Shot Detection (ZSD) on novel unseen food objects to support real-world scenarios, such as intelligent kitchens and smart restaurants. Therefore, we first benchmark the task of Zero-Shot Food Detection (ZSFD) by introducing FOWA dataset with rich attribute annotations. Unlike ZSD, fine-grained problems in ZSFD like inter-class similarity make synthesized features inseparable. The complexity of food semantic attributes further makes it more difficult for current ZSD methods to distinguish various food categories. To address these problems, we propose a novel framework ZSFDet to tackle fine-grained problems by exploiting the interaction between complex attributes. Specifically, we model the correlation between food categories and attributes in ZSFDet by multi-source graphs to provide prior knowledge for distinguishing fine-grained features. Within ZSFDet, Knowledge-Enhanced Feature Synthesizer (KEFS) learns knowledge representation from multiple sources (e.g., ingredients correlation from knowledge graph) via the multi-source graph fusion. Conditioned on the fusion of semantic knowledge representation, the region feature diffusion model in KEFS can generate fine-grained features for training the effective zero-shot detector. Extensive evaluations demonstrate the superior performance of our method ZSFDet on FOWA and the widely-used food dataset UECFOOD-256, with significant improvements by 1.8% and 3.7% ZSD mAP compared with the strong baseline RRFS. Further experiments on PASCAL VOC and MS COCO prove that enhancement of the semantic knowledge can also improve the performance on general ZSD. Code and dataset are available at https://***/LanceZPF/KEFS. Copyright © 2024, The Authors. All rights reserved.

关键词： Object detection

Combating Online Misinformation Videos: Characterization, Detection, and Future Directions

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Bu, Yuyan Sheng, Qiang Cao, Juan Qi, Peng Wang, Danding Li, Jintao Institute of Computing Technology Chinese Academy of Sciences University of Chinese Academy of Sciences China Institute of Computing Technology Chinese Academy of Sciences China Key Lab of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences China

With information consumption via online video streaming becoming increasingly popular, misinformation video poses a new threat to the health of the online information ecosystem. Though previous studies have made much progress in detecting misinformation in text and image formats, video-based misinformation brings new and unique challenges to automatic detection systems: 1) high information heterogeneity brought by various modalities, 2) blurred distinction between misleading video manipulation and nonmalicious artistic video editing, and 3) new patterns of misinformation propagation due to the dominant role of recommendation systems on online video platforms. To facilitate research on this challenging task, we conduct this survey to present advances in misinformation video detection. We first analyze and characterize the misinformation video from three levels including signals, semantics, and intents. Based on the characterization, we systematically review existing works for detection from features of various modalities to techniques for clue integration. We also introduce existing resources including representative datasets and useful tools. Besides summarizing existing studies, we discuss related areas and outline open issues and future directions to encourage and guide more research on misinformation video detection. The corresponding repository is at https://***/ICTMCG/Awesome-Misinfo-Video-Detection. © 2023, CC BY-NC-SA.

关键词： Semantics

Bridge Frame and Event: Common Spatiotemporal Fusion for High-Dynamic Scene Optical Flow

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Zhou, Hanyu Wang, Haonan Liu, Haoyue Duan, Yuxing Chang, Yi Yan, Luxin National Key Lab of Multispectral Information Intelligent Processing Technology School of Artificial Intelligence and Automation Huazhong University of Science and Technology China School of Computing National University of Singapore Singapore

High-dynamic scene optical flow is a challenging task, which suffers spatial blur and temporal discontinuous motion due to large displacement in frame imaging, thus deteriorating the spatiotemporal feature of optical flow. Typically, existing methods mainly introduce event camera to directly fuse the spatiotemporal features between the two modalities. However, this direct fusion is ineffective, since there exists a large gap due to the heterogeneous data representation between frame and event modalities. To address this issue, we explore a common-latent space as an intermediate bridge to mitigate the modality gap. In this work, we propose a novel common spatiotemporal fusion between frame and event modalities for high-dynamic scene optical flow, including visual boundary localization and motion correlation fusion. Specifically, in visual boundary localization, we figure out that frame and event share the similar spatiotemporal gradients, whose similarity distribution is consistent with the extracted boundary distribution. This motivates us to design the common spatiotemporal gradient to constrain the reference boundary localization. In motion correlation fusion, we discover that the frame-based motion possesses spatially dense but temporally discontinuous correlation, while the event-based motion has spatially sparse but temporally continuous correlation. This inspires us to use the reference boundary to guide the complementary motion knowledge fusion between the two modalities. Moreover, common spatiotemporal fusion can not only relieve the cross-modal feature discrepancy, but also make the fusion process interpretable for dense and continuous optical flow. Extensive experiments have been performed to verify the superiority of the proposed method. Copyright © 2025, The Authors. All rights reserved.

关键词： Optical flows

Fan-Net: Fourier-Based Adaptive Normalization for Cross-Domain Stroke Lesion Segmentation 48

学校读者我要写书评

暂无评论

Fan-Net: Fourier-Based Adaptive Normalization for Cross-Doma...

48th IEEE International Conference on Acoustics, Speech and Signal processing, ICASSP 2023

作者： Yu, Weiyi Lei, Yiming Shan, Hongming Fudan University School of Computer Science Shanghai Key Lab of Intelligent Information Processing Shanghai200433 China Institute of Science and Technology for Brain-Inspired Intelligence China Shanghai Center for Brain Science and Brain-Inspired Technology Shanghai201210 China

ISBN: (纸本)9781728163277

Since stroke is the main cause of various cerebrovascular diseases, deep learning-based stroke lesion segmentation on magnetic resonance (MR) images has attracted considerable attention. However, the existing methods often neglect the domain shift among MR images collected from different sites, which has limited performance improvement. To address this problem, we intend to change style information without affecting high-level semantics via adaptively changing the low-frequency amplitude components of the Fourier transform so as to enhance model robustness to varying domains. Thus, we propose a novel FAN-Net, a U-Net-based segmentation network incorporated with a Fourier-based adaptive normalization (FAN) and a domain classifier with a gradient reversal layer. The FAN module is tailored for learning adaptive affine parameters for the amplitude components of different domains, which can dynamically normalize the style information of source images. Then, the domain classifier provides domain-agnostic knowledge to endow FAN with strong domain generalizability. The experimental results on the ATLAS dataset, which consists of MR images from 9 sites, show the superior performance of the proposed FAN-Net compared with baseline methods. © 2023 IEEE.

关键词： Magnetic resonance

Rich-text document styling restoration via reinforcement learning

学校读者我要写书评

暂无评论

Frontiers of Computer Science 2021年第4期15卷 93-103页

作者： Hongwei LI Yingpeng HU Yixuan CAO Ganbin ZHOU Ping LUO Key Lab of Intelligent Information Processing of Chinese Academy of Sciences(CAS) Institute of Computing TechnologyCASBeijing 100190China University of Chinese Academy of Sciences Beijing 100049China Search Product Center WeChat Search Application DepartmentTencentBeijing 100080China

Richly formatted documents,such as financial disclosures,scientific articles,government regulations,widely exist on ***,since most of these documents are only for public reading,the styling information inside them is usually missing,making them improper or even burdensome to be displayed and edited in different formats and *** this study we formulate the task of document styling restoration as an optimization problem,which aims to identify the styling settings on the document elements,e.g.,lines,table cells,text,so that rendering with the output styling settings results in a document,where each element inside it holds the(closely)exact position with the one in the original *** that each styling setting is a decision,this problem can be transformed as a multi-step decision-making task over all the document elements,and then be solved by reinforcement ***,Monte-Carlo Tree Search(MCTS)is leveraged to explore the different styling settings,and the policy function is learnt under the supervision of the delayed *** a case study,we restore the styling information inside tables,where structural and functional data in the documents are usually *** shows that,our best reinforcement method successfully restores the stylings in 87.65%of the tables,with 25.75%absolute improvement over the *** also discuss the tradeoff between the inference time and restoration success rate,and argue that although the reinforcement methods cannot be used in real-time scenarios,it is suitable for the offline tasks with high-quality ***,this model has been applied in a PDF parser to support cross-format display.

关键词： styling restoration monte-carlo tree search reinforcement learning richly formatted documents tables