检索结果-内蒙古大学图书馆

Delving into the Frequency: Temporally Consistent Human Motion Transfer in the Fourier Space

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yang, Guang Liu, Wu Liu, Xinchen Gu, Xiaoyan Cao, Juan Li, Jintao Institute of Computing Technology Chinese Academy of Sciences China Jd Explore Academy Institute of Information Engineering Chinese Academy of Sciences China The Key Lab of Intelligent Information Processing of Chinese Academy of Sciences The University of Chinese Academy of Sciences China

Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos that enable one person to imitate the motion of others. However, current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality, yet is far from solved by existing methods in the pixel domain. Recently, some works on DeepFake detection try to distinguish the natural and synthetic images in the frequency domain because of the frequency insufficiency of image synthesizing methods. Nonetheless, there is no work to study the temporal inconsistency of synthetic videos from the aspects of the frequency-domain gap between natural and synthetic videos. Therefore, in this paper, we propose to delve into the frequency space for temporally consistent human motion transfer. First of all, we make the first comprehensive analysis of natural and synthetic videos in the frequency domain to reveal the frequency gap in both the spatial dimension of individual frames and the temporal dimension of the video. To close the frequency gap between the natural and synthetic videos, we propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos. FreMOTR explores two novel frequency-based regularization modules: 1) the Frequency-domain Appearance Regularization (FAR) to improve the appearance of the person in individual frames and 2) Temporal Frequency Regularization (TFR) to guarantee the temporal consistency between adjacent frames. Finally, comprehensive experiments demonstrate that the FreMOTR not only yields superior performance in temporal consistency metrics but also improves the frame-level visual quality of synthetic videos. In particular, the temporal consistency metrics are improved by nearly 30% than the state-of-the-art model. © 2022, CC BY.

关键词： Generative adversarial networks

Hard-instance learning for quantum adiabatic prime factorization

学校读者我要写书评

暂无评论

Physical Review A 2022年第6期105卷 062455-062455页

作者： Jian Lin Zhengfeng Zhang Junping Zhang Xiaopeng Li State Key Laboratory of Surface Physics Institute of Nanoelectronics and Quantum Computing and Department of Physics Fudan University Shanghai 200433 China Shanghai Key Lab of Intelligent Information Processing and School of Computer Science Fudan University Shanghai 200433 China Shanghai Qi Zhi Institute Xuhui District Shanghai 200032 China Shanghai Research Center for Quantum Sciences Shanghai 201315 China

Prime factorization is a difficult problem with classical computing, whose exponential hardness is the foundation of Rivest-Shamir-Adleman cryptography. With programable quantum devices, adiabatic quantum computing has been proposed as a plausible approach to solve prime factorization, having promising advantage over classical computing. Here, we find there are certain hard instances that are consistently intractable for both classical simulated annealing and unconfigured adiabatic quantum computing (AQC). Aiming at an automated architecture for optimal configuration of quantum adiabatic factorization, we apply a deep reinforcement learning (RL) method to configure the AQC algorithm. By setting the success probability of the worst-case problem instances as the reward to RL, we show the AQC performance on the hard instances is dramatically improved by RL configuration. The success probability also becomes more evenly distributed over different problem instances, meaning the configured AQC is more stable as compared to the unconfigured case. Through a technique of transfer learning, we find prominent evidence that the framework of AQC configuration is scalable—the configured AQC as trained on five qubits remains working efficiently on nine qubits with a minimal amount of additional training cost.

关键词： Machine learning Quantum algorithms Quantum computation Quantum simulation

OmniVL: one foundation model for image-language and video-language tasks 22

学校读者我要写书评

暂无评论

OmniVL: one foundation model for image-language and video-la...

Proceedings of the 36th International Conference on Neural information processing Systems

作者： Junke Wang Dongdong Chen Zuxuan Wu Chong Luo Luowei Zhou Yucheng Zhao Yujia Xie Ce Liu Yu-Gang Jiang Lu Yuan Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University and Shanghai Collaborative Innovation Center on Intelligent Visual Computing Microsoft Cloud + AI Microsoft Research Asia

ISBN: (纸本)9781713871088

This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture. It adopts a unified transformer-based visual encoder for both image and video inputs, and thus can perform joint image-language and video-language pretraining. We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer (e.g., use image-language to help video-language). To this end, we propose a decoupled joint pretraining of image-language and video-language to effectively decompose the vision-language modeling into spatial and temporal dimensions and obtain performance boost on both image and video tasks. Moreover, we introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together, so that both supervised and noisily supervised pretraining data are utilized as much as possible. Without incurring extra task-specific adaptors, OmniVL can simultaneously support visual only tasks (e.g., image classification, video action recognition), cross-modal alignment tasks (e.g., image/video-text retrieval), and multi-modal understanding and generation tasks (e.g., image/video question answering, captioning). We evaluate OmniVL on a wide range of downstream tasks and achieve state-of-the-art or competitive results with similar model size and data scale.

关键词：

SVFormer: Semi-supervised Video Transformer for Action Recognition

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Xing, Zhen Dai, Qi Hu, Han Chen, Jingjing Wu, Zuxuan Jiang, Yu-Gang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Microsoft Research Asia China

Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (i.e., EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks. Code is released at https://***/ChenHsing/SVFormer. Copyright © 2022, The Authors. All rights reserved.

关键词： Machine learning

DISTANCE RESTRICTED TRANSFORMER ENCODER FOR MULTI-labEL CLASSIFICATION

学校读者我要写书评

暂无评论

DISTANCE RESTRICTED TRANSFORMER ENCODER FOR MULTI-LABEL CLAS...

2021 IEEE International Conference on Multimedia and Expo, ICME 2021

作者： Wang, Xiaomei Li, Yaqian Luo, Tong Guo, Yandong Fu, Yanwei Xue, Xiangyang School of Computer Science Fudan University Shanghai China OPPO Research Institute Shanghai China The School of Data Science MOE Frontiers Center for Brain Science Shanghai Key Lab of Intelligent Information Processing Fudan University Shanghai China

ISBN: (纸本)9781665438643

Multi-label image classification is a fundamental but challenging task in Multimedia *** aims to predict a set of labels presented in an image. Great progress has been made by exploring convolutional neural network with binary cross-entropy loss recently. However, conventional approaches are limited to highlight the key visual contents associated with target labels and pay little attention to confining the distances between visual and positive/negative label representations. To target these aspects, we firstly introduce a variant transformer encoder model for acquiring the underlying and crucial visual information related to ground truth labels. Specifically, a novel primal feature guided net is designed to maintain the original visual features during encoding process. Secondly, we exploit a distance restricted learning strategy in a common semantic space to shrink the distances of images with positive labels while expand with the negative ones during training stage. Extensive experiments are executed on MSCOCO and WIDER Attribute datasets and outstanding performance is achieved compared with other state-of-the-art models. © 2021 IEEE

关键词： Semantics

Stable Attribute Group Editing for Reliable Few-shot Image Generation

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Ding, Guanqi Han, Xinzhe Wang, Shuhui Wu, Shuzhe Jin, Xin Tu, Dandan Huang, Qingming The School of Computer Science and Technology University of Chinese Academy of Sciences Beijing101408 China The Key Laboratory of Intelligent Information Processing Institute of Computing Technology Chinese Academy of Sciences Beijing100190 China Peng Cheng Laboratory Shenzhen518066 China Huawei Cloud EI Innovation Lab China

Few-shot image generation aims to generate data of an unseen category based on only a few samples. Apart from basic content generation, a bunch of downstream applications hopefully benefit from this task, such as low-data detection and few-shot classification. To achieve this goal, the generated images should guarantee category retention for classification beyond the visual quality and diversity. In our preliminary work, we present an "editing-based" framework Attribute Group Editing (AGE) for reliable few-shot image generation, which largely improves the performance compared with existing methods that require re-training a GAN with limited data. Nevertheless, AGE's performance on downstream classification is not as satisfactory as expected. This paper investigates the class inconsistency problem and proposes Stable Attribute Group Editing (SAGE) for more stable class-relevant image generation. Different from AGE which directly edits from a one-shot image, SAGE takes use of all given few-shot images and estimates a class center embedding based on the category-relevant attribute dictionary. Meanwhile, according to the projection weights on the category-relevant attribute dictionary, we can select category-irrelevant attributes from the similar seen categories. Consequently, SAGE injects the whole distribution of the novel class into StyleGAN's latent space, thus largely remains the category retention and stability of the generated images. Going one step further, we find that class inconsistency is a common problem in GAN-generated images for downstream classification. Even though the generated images look photo-realistic and requires no category-relevant editing, they are usually of limited help for downstream classification. We systematically discuss this issue from both the generative model and classification model perspectives, and propose to boost the downstream classification performance of SAGE by enhancing the pixel and frequency components. Extensive experime

关键词： Generative adversarial networks

Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Yu, Jiashuo Pu, Junfu Cheng, Ying Feng, Rui Shan, Ying The School of Computer Science Shanghai Key Lab of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai200438 China The Academy for Engineering and Technology Fudan University Shanghai200438 China The Applied Research Center PCG Tencent Shenzhen518000 China

Although audio-visual representation has been proven to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated. Considering the intrinsic alignment between the cadent movement of the dancer and music rhythm, we introduce MuDaR, a novel Music-Dance Representation learning framework to perform the synchronization of music and dance rhythms both in explicit and implicit ways. Specifically, we derive the dance rhythms based on visual appearance and motion cues inspired by the music rhythm analysis. Then the visual rhythms are temporally aligned with the music counterparts, which are extracted by the amplitude of sound intensity. Meanwhile, we exploit the implicit coherence of rhythms implied in audio and visual streams by contrastive learning. The model learns the joint embedding by predicting the temporal consistency between audio-visual pairs. The music-dance representation, together with the capability of detecting audio and visual rhythms, can further be applied to three downstream tasks: (a) dance classification, (b) music-dance retrieval, and (c) music-dance retargeting. Extensive experiments demonstrate that our proposed framework outperforms other self-supervised methods by a large margin. © 2022, CC BY.

关键词： Supervised learning

Improving Adversarial Transferability with Neighbourhood Gradient information

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Guo, Haijing Wang, Jiafeng Chen, Zhaoyu Jiang, Kaixun Hong, Lingyi Guo, Pinxue Li, Jinglun Zhang, Wenqiang Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University Shanghai200433 China Shanghai Engineering Research Center of AI & Robotics Academy for Engineering & Technology Fudan University Shanghai200433 China Engineering Research Center of Robotics Ministry of Education Academy for Engineering & Technology Fudan University Shanghai200433 China

Deep neural networks (DNNs) are known to be susceptible to adversarial examples, leading to significant performance degradation. In black-box attack scenarios, a considerable attack performance gap between the surrogate model and the target model persists. This work focuses on enhancing the transferability of adversarial examples to narrow this performance gap. We observe that the gradient information around the clean image, i.e. Neighbourhood Gradient information, can offer high transferability. Leveraging this, we propose the NGI-Attack, which incorporates Example Backtracking and Multiplex Mask strategies, to use this gradient information and enhance transferability fully. Specifically, we first adopt Example Backtracking to accumulate Neighbourhood Gradient information as the initial momentum term. Multiplex Mask, which forms a multi-way attack strategy, aims to force the network to focus on non-discriminative regions, which can obtain richer gradient information during only a few iterations. Extensive experiments demonstrate that our approach significantly enhances the adversarial transferability. Especially, when attacking numerous defense models, we achieve an average attack success rate of 95.8%. Notably, our method can plugin with any off-the-shelf algorithm to improve their attack performance without additional time cost. Copyright © 2024, The Authors. All rights reserved.

关键词：

Evidence-Aware Multi-Modal Data Fusion and its Application to Total Knee Replacement Prediction

学校读者我要写书评

暂无评论

Evidence-Aware Multi-Modal Data Fusion and its Application t...

Proceedings of the Digital Image computing: Technqiues and Applications (DICTA)

作者： Xinwen Liu Jing Wang S. Kevin Zhou Craig Engstrom Shekhar S. Chandra School of Electrical Engineering and Computer Science The University of Queensland Brisbane Australia The Commonwealth Scientific and Industrial Research Organisation Canberra Australia Center for Medical Imaging Robotics Analytic Computing & Learning (MIRACLE) School of Biomedical Engineering & Suzhou Institute for Advanced Research University of Science and Technology of China Suzhou China Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS) Institute of Computing Technology CAS Beijing China School of Human Movement and Nutrition Sciences The University of Queensland Brisbane Australia

ISBN: (数字)9798350379037

ISBN: (纸本)9798350379044

Deep neural networks have been widely studied to predict a medical condition, such as total knee replacement (TKR). It has shown that data of different modalities, such as imaging data, clinical variables, and demographic information, provide complementary information and thus can improve the prediction accuracy together. However, the data sources of various modalities may not always be of high quality, and each modality may have only partial information of medical condition. Thus, predictions from different modalities can be in conflict, and the final prediction may fail in the presence of such a conflict. Therefore, it is important to account for the reliability of each source data and the prediction output when making a final decision. In this paper, we propose an evidence-aware multimodal data fusion framework based on the Dempster-Shafer theory (DST). The backbone models contain an image branch, a non-image branch and a fusion branch. For each branch, there is an evidence network that takes the extracted features as input and outputs an evidence score, which is designed to represent the reliability of the output from the current branch. The output probabilities along with the evidence scores from multiple branches are combined with the Dempster's combination rule to make a final prediction. Experimental results on the public OA initiative (OAI) dataset for the TKR prediction task show that the proposed method has better performance by accounting for conflicts from various modalities.

关键词： Medical conditions Soft sensors Digital images Data integration Imaging Artificial neural networks Reliability theory Feature extraction Reliability engineering Knee replacement