检索结果-内蒙古大学图书馆

FreeCloth: Free-form Generation Enhances Challenging Clothed Human Modeling

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Ye, Hang Ma, Xiaoxuan Ci, Hai Zhu, Wentao Wang, Yizhou Center on Frontiers of Computing Studies School of Computer Science Peking University China Inst. for Artificial Intelligence Peking University China Nat’l Eng. Research Center of Visual Technology China State Key Laboratory of General Artificial Intelligence Peking University China

Achieving realistic animated human avatars requires accurate modeling of pose-dependent clothing deformations. Existing learning-based methods heavily rely on the Linear Blend Skinning (LBS) of minimally-clothed human models like SMPL to model deformation. However, they struggle to handle loose clothing, such as long dresses, where the canonicalization process becomes ill-defined when the clothing is far from the body, leading to disjointed and fragmented results. To overcome this limitation, we propose FreeCloth, a novel hybrid framework to model challenging clothed humans. Our core idea is to use dedicated strategies to model different regions, depending on whether they are close to or distant from the body. Specifically, we segment the human body into three categories: unclothed, deformed, and generated. We simply replicate unclothed regions that require no deformation. For deformed regions close to the body, we leverage LBS to handle the deformation. As for the generated regions, which correspond to loose clothing areas, we introduce a novel free-form, part-aware generator to model them, as they are less affected by movements. This free-form generation paradigm brings enhanced flexibility and expressiveness to our hybrid framework, enabling it to capture the intricate geometric details of challenging loose clothing, such as skirts and dresses. Experimental results on the benchmark dataset featuring loose clothing demonstrate that FreeCloth achieves state-of-the-art performance with superior visual fidelity and realism, particularly in the most challenging cases. Copyright © 2024, The Authors. All rights reserved.

关键词： Human form models

CAMG: Context-Aware Moment Graph Network for Multimodal Temporal Activity Localization via Language 12th

学校读者我要写书评

暂无评论

CAMG: Context-Aware Moment Graph Network for Multimodal Tem...

12th National CCF Conference on Natural Language Processing and Chinese computing, NLPCC 2023

作者： Hu, Yuelin Xu, Yuanwu Zhang, Yuejie Feng, Rui Zhang, Tao Lu, Xuequan Gao, Shang School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai200433 China School of Information Management and Engineering Shanghai Key Laboratory of Financial Information Technology Shanghai University of Finance and Economics Shanghai200433 China School of Information Technology Deakin University Waurn PondsVIC3216 Australia

ISBN: (纸本)9783031446924

Temporal Activity Localization via Language (TALL) is a challenging task for language based video understanding, especially when a video contains multiple moments of interest and the language query has words describing complex context dependencies between the moments. Latest studies have proposed various ways to exploit the temporal context of adjacent moments, but two apparent limitations remained. First, only limited context information was encoded based on RNNs or 2-D convolutions, which highly depended on the pre-sorting of proposals and lacked flexibility. Second, semantically correlated content in different moments was ignored, i.e., semantic context. To address these limitations, we propose a novel GCN-based framework, i.e., Context-Aware Moment Graph (CAMG) network, to jointly model the temporal context and semantic context. Also, we design a multi-step fusion scheme to aggregate object, motion and textual features. A Query-Gated Integration Module is further designed to select queried objects and filter out noisy ones. Our model achieves superior performance to state-of-the-art methods on two widely-used benchmark datasets. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

关键词： Semantics

Multi-Modality Deep Network for Extreme Learned Image Compression

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Jiang, Xuhao Tan, Weimin Tan, Tian Yan, Bo Shen, Liquan School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China School of Communication Shanghai University Shanghai China

Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years, but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior information to guide image compression for better compression performance. We fully study the role of text description in different components of the codec, and demonstrate its effectiveness. In addition, we adopt the image-text attention module and image-request complement module to better fuse image and text features, and propose an improved multimodal semantic-consistent loss to produce semantically complete reconstructions. Extensive experiments, including a user study, prove that our method can obtain visually pleasing results at extremely low bitrates, and achieves a comparable or even better performance than state-of-the-art methods, even though these methods are at 2× to 4× bitrates of ours. Copyright © 2023, The Authors. All rights reserved.

关键词： Image compression

PlugAT: A Plug and Play Module to Defend against Textual Adversarial Attack 29

学校读者我要写书评

暂无评论

PlugAT: A Plug and Play Module to Defend against Textual Adv...

29th International Conference on Computational Linguistics, COLING 2022

作者： Zheng, Rui Bao, Rong Liu, Qin Gui, Tao Zhang, Qi Huang, Xuanjing Xie, Rui Wu, Wei School of Computer Science Fudan University China Ant Group China Viterbi School of Engineering University of Southern California China Institute of Modern Languages and Linguistics Fudan University Shanghai China Shanghai Collaborative Innovation Center of Intelligent Visual Computing China Meituan Inc. Beijing China

Adversarial training, which minimizes the loss of adversarially perturbed examples, has received considerable attention. However, these methods require modifying all model parameters and optimizing the model from scratch, which is parameter inefficient and unfriendly to the already deployed models. As an alternative, we propose a pluggable defense module PlugAT, to provide robust predictions by adding a few trainable parameters to the model inputs while keeping the original model frozen. To reduce the potential side effects of using defense modules, we further propose a novel forgetting restricted adversarial training, which filters out bad adversarial examples that impair the performance of original ones. The PlugAT-equipped BERT model substantially improves robustness over several strong baselines on various text classification tasks, whilst training only 9.1% parameters. We observe that defense modules trained under the same model architecture have domain adaptation ability between similar text classification datasets. © 2022 Proceedings - International Conference on Computational Linguistics, COLING. All rights reserved.

关键词： Classification (of information)

Multi-Modality Deep Network for JPEG Artifacts Reduction

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Jiang, Xuhao Tan, Weimin Lin, Qing Ma, Chenxi Yan, Bo Shen, Liquan School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China School of Communication Shanghai University Shanghai China

In recent years, many convolutional neural network-based models are designed for JPEG artifacts reduction, and have achieved notable progress. However, few methods are suitable for extreme low-bitrate image compression artifacts reduction. The main challenge is that the highly compressed image loses too much information, resulting in reconstructing high-quality image difficultly. To address this issue, we propose a multimodal fusion learning method for text-guided JPEG artifacts reduction, in which the corresponding text description not only provides the potential prior information of the highly compressed image, but also serves as supplementary information to assist in image deblocking. We fuse image features and text semantic features from the global and local perspectives respectively, and design a contrastive loss built upon contrastive learning to produce visually pleasing results. Extensive experiments, including a user study, prove that our method can obtain better deblocking results compared to the state-of-the-art methods. Copyright © 2023, The Authors. All rights reserved.

关键词： Semantics

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

学校读者我要写书评

暂无评论

Learning Open-Vocabulary Semantic Segmentation Models From N...

Conference on computer Vision and Pattern Recognition (CVPR)

作者： Jilan Xu Junlin Hou Yuejie Zhang Rui Feng Yi Wang Yu Qiao Weidi Xie Shanghai Key Lab of Intelligent Information Processing School of Computer Science Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai AI Laboratory CMIC Shanghai Jiao Tong University

This paper considers the problem of open-vocabulary semantic segmentation (OVS), that aims to segment objects of arbitrary classes beyond a pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled imagetext pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slotattention based binding module, then aligns the group tokens to corresponding caption embeddings. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, encouraging the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on four benchmark datasets, PASCAL VOC, PASCAL Context, COCO Object, and ADE20K. OVSegmentor achieves superior results over state-of-the-art approaches on PASCAL VOC using only 3% data (4M vs 134M) for pre-training.

关键词：

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Su, Yuchen Chen, Zhineng Shao, Zhiwen Du, Yuning Ji, Zhilong Bai, Jinfeng Zhou, Yong Jiang, Yu-Gang Shanghai Collaborative Innovation Center of Intelligent Visual Computing School of Computer Science Fudan University China Baidu Inc. Iran China University of Mining and Technology China Tomorrow Advancing Life China

Recently, regression-based methods, which predict parameterized text shapes for text localization, have gained popularity in scene text detection. However, the existing parameterized text shape methods still have limitations in modeling arbitrary-shaped texts due to ignoring the utilization of text-specific shape information. Moreover, the time consumption of the entire pipeline has been largely overlooked, leading to a suboptimal overall inference speed. To address these issues, we first propose a novel parameterized text shape method based on low-rank approximation. Unlike other shape representation methods that employ data-irrelevant parameterization, our approach utilizes singular value decomposition and reconstructs the text shape using a few eigenvectors learned from labeled text contours. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. Next, we propose a dual assignment scheme for speed acceleration. It adopts a sparse assignment branch to accelerate the inference speed, and meanwhile, provides ample supervised signals for training through a dense assignment branch. Building upon these designs, we implement an accurate and efficient arbitrary-shaped text detector named LRANet. Extensive experiments are conducted on several challenging benchmarks, demonstrating the superior accuracy and efficiency of LRANet compared to state-of-the-art methods. Code is available at: https://***/ychensu/***. Copyright © 2023, The Authors. All rights reserved.

关键词： Parameterization

The NeRF Signature: Codebook-Aided Watermarking for Neural Radiance Fields

学校读者我要写书评

暂无评论

arXiv 2025年

作者： Luo, Ziyuan Rocha, Anderson Shi, Boxin Guo, Qing Li, Haoliang Wan, Renjie Department of Computer Science Hong Kong Baptist University Hong Kong Institute of Computing University of Campinas Brazil State Key Laboratory of Multimedia Information Processing and National Engineering Research Center of Visual Technology School of Computer Science Peking University Beijing100871 China A*STAR Singapore Department of Electrical Engineering City University of Hong Kong Hong Kong

Neural Radiance Fields (NeRF) have been gaining attention as a significant form of 3D content representation. With the proliferation of NeRF-based creations, the need for copyright protection has emerged as a critical issue. Although some approaches have been proposed to embed digital watermarks into NeRF, they often neglect essential model-level considerations and incur substantial time overheads, resulting in reduced imperceptibility and robustness, along with user inconvenience. In this paper, we extend the previous criteria for image watermarking to the model level and propose NeRF Signature, a novel watermarking method for NeRF. We employ a Codebook-aided Signature Embedding (CSE) that does not alter the model structure, thereby maintaining imperceptibility and enhancing robustness at the model level. Furthermore, after optimization, any desired signatures can be embedded through the CSE, and no fine-tuning is required when NeRF owners want to use new binary signatures. Then, we introduce a joint pose-patch encryption watermarking strategy to hide signatures into patches rendered from a specific viewpoint for higher robustness. In addition, we explore a Complexity-Aware Key Selection (CAKS) scheme to embed signatures in high visual complexity patches to enhance imperceptibility. The experimental results demonstrate that our method outperforms other baseline methods in terms of imperceptibility and robustness. The source code is available at: https://***/luo-ziyuan/NeRF_Signature. Copyright © 2025, The Authors. All rights reserved.

关键词： Image watermarking

SAT-HMR: Real-Time Multi-Person 3D Mesh Estimation via Scale-Adaptive Tokens

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Su, Chi Ma, Xiaoxuan Su, Jiajun Wang, Yizhou Center on Frontiers of Computing Studies School of Computer Science Peking University China Inst. for Artificial Intelligence Peking University China Nat’l Eng. Research Center of Visual Technology China State Key Laboratory of General Artificial Intelligence Peking University China China

We propose SAT-HMR, a one-stage framework for real-time multi-person 3D human mesh estimation from a single RGB image. While current one-stage methods, which follow a DETR-style pipeline, achieve state-of-the-art (SOTA) performance with high-resolution inputs, we observe that this particularly benefits the estimation of individuals in smaller scales of the image (e.g., those of young age or far from the camera), but at the cost of significantly increased computation overhead. To address this, we introduce scale-adaptive tokens that are dynamically adjusted based on the relative scale of each individual in the image within the DETR framework. Specifically, individuals in smaller scales are processed at higher resolutions, larger ones at lower resolutions, and background regions are further distilled. These scale-adaptive tokens more efficiently encode the image features, facilitating subsequent decoding to regress the human mesh, while allowing the model to allocate computational resources more effectively and focus on more challenging cases. Experiments show that our method preserves the accuracy benefits of high-resolution processing while substantially reducing computational cost, achieving real-time inference with performance comparable to SOTA methods. Copyright © 2024, The Authors. All rights reserved.

关键词： Image coding