检索结果-内蒙古大学图书馆

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Zhenxin Li Shiyi Lan Jose M. Alvarez Zuxuan Wu Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing NVIDIA

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Recently, the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However, we argue that dense BEV frameworks remain important due to their out-standing abilities in depth estimation and object localization, depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components, including a CRF-modulated depth estimation module enforcing object-level consistencies, a long-term temporal aggregation module with extended receptive fields, and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a “modernized” dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark, BEVNeXt outperforms both BEV-based and query-based frameworks under various settings, achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set.

关键词： Location awareness Three-dimensional displays Estimation Object detection Detectors Benchmark testing Transformers

来源：评论

学校读者我要写书评

暂无评论

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vis...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Wujian Peng Sicheng Xie Zuyao You Shiyi Lan Zuxuan Wu Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing NVIDIA

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353013

Vision language models (VLM) have demonstrated re-markable performance across various downstream tasks. However, understanding fine-grained visual-linguistic con-cepts, such as attributes and inter-object relationships, re-mains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary fo-cus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We intro-duce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we con-duct a thorough evaluation offour leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we pro-pose a simple yet effective approach to optimize VLMs in fine- grained understanding, achieving significant improve-ments on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://***/wjpoom/SPEC.

关键词： visualization Computer vision Codes Computational modeling Pipelines Benchmark testing Linguistics

来源：评论

学校读者我要写书评

暂无评论

Multi-Modality Deep Network for Extreme Learned Image Compression

arXiv

引用

arXiv 2023年

作者： Jiang, Xuhao Tan, Weimin Tan, Tian Yan, Bo Shen, Liquan School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China School of Communication Shanghai University Shanghai China

Image-based single-modality compression learning approaches have demonstrated exceptionally powerful encoding and decoding capabilities in the past few years, but suffer from blur and severe semantics loss at extremely low bitrates. To address this issue, we propose a multimodal machine learning method for text-guided image compression, in which the semantic information of text is used as prior information to guide image compression for better compression performance. We fully study the role of text description in different components of the codec, and demonstrate its effectiveness. In addition, we adopt the image-text attention module and image-request complement module to better fuse image and text features, and propose an improved multimodal semantic-consistent loss to produce semantically complete reconstructions. Extensive experiments, including a user study, prove that our method can obtain visually pleasing results at extremely low bitrates, and achieves a comparable or even better performance than state-of-the-art methods, even though these methods are at 2× to 4× bitrates of ours. Copyright © 2023, The Authors. All rights reserved.

关键词： Image compression

来源：评论

学校读者我要写书评

暂无评论

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Detection Hub: Unifying Object Detection Datasets via Query ...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Lingchen Meng Xiyang Dai Yinpeng Chen Pengchuan Zhang Dongdong Chen Mengchen Liu Jianfeng Wang Zuxuan Wu Lu Yuan Yu-Gang Jiang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing Microsoft

Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonomy difference and domain gap. In this paper, we address these challenges by a new design (named Detection Hub) that is dataset-aware and category-aligned. It not only mitigates the dataset inconsistency but also provides coherent guidance for the detector to learn across multiple datasets. In particular, the dataset-aware design is achieved by learning a dataset embedding that is used to adapt object queries as well as convolutional kernels in detection heads. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding and leveraging the semantic coherence of language embedding. Detection Hub fulfills the benefits of large data on object detection. Experiments demonstrate that joint training on multiple datasets achieves significant performance gains over training on each dataset alone. Detection Hub further achieves SoTA performance on UODB benchmark with wide variety of datasets.

关键词：

来源：评论

学校读者我要写书评

暂无评论

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with ...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Yang Jiao Zequn Jie Shaoxiang Chen Jingjing Chen Lin Ma Yu-Gang Jiang Shanghai Key Lab of Intell. Info. Processing School of CS Fudan University Shanghai Collaborative Innovation Center of Intelligent Visual Computing Meituan

Fusing LiDAR and camera information is essential for accurate and reliable 3D object detection in autonomous driving systems. This is challenging due to the difficulty of combining multi-granularity geometric and semantic features from two drastically different modalities. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as “seeds”) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques. However, depth information is under-investigated in these approaches when lifting points into 3D space, thus 2D semantics can not be reliably fused with 3D points. Moreover, their multi-modal fusion strategy, which is implemented as concatenation or attention, either can not effectively fuse 2D and 3D information or is unable to perform fine-grained interactions in the voxel space. To this end, we propose a novel framework with better utilization of the depth information and fine-grained cross-modal interaction between LiDAR and camera, which consists of two important components. First, a Multi-Depth Unprojection (MDU) method is used to enhance the depth quality of the lifted points at each interaction level. Second, a Gated Modality-Aware Convolution (GMA-Conv) block is applied to modulate voxels involved with the camera modality in a fine-grained manner and then aggregate multi-modal features into a unified space. Together they provide the detection head with more comprehensive features from LiDAR and camera. On the nuScenes test benchmark, our proposed method, abbreviated as MSMD-Fusion, achieves state-of-the-art results on both 3D object detection and tracking tasks without using test-time-augmentation and ensemble techniques. The code is available at https://***/SxJyJay/MSMDFusion.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Multi-Modality Deep Network for JPEG Artifacts Reduction

arXiv

引用

arXiv 2023年

作者： Jiang, Xuhao Tan, Weimin Lin, Qing Ma, Chenxi Yan, Bo Shen, Liquan School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai China School of Communication Shanghai University Shanghai China

In recent years, many convolutional neural network-based models are designed for JPEG artifacts reduction, and have achieved notable progress. However, few methods are suitable for extreme low-bitrate image compression artifacts reduction. The main challenge is that the highly compressed image loses too much information, resulting in reconstructing high-quality image difficultly. To address this issue, we propose a multimodal fusion learning method for text-guided JPEG artifacts reduction, in which the corresponding text description not only provides the potential prior information of the highly compressed image, but also serves as supplementary information to assist in image deblocking. We fuse image features and text semantic features from the global and local perspectives respectively, and design a contrastive loss built upon contrastive learning to produce visually pleasing results. Extensive experiments, including a user study, prove that our method can obtain better deblocking results compared to the state-of-the-art methods. Copyright © 2023, The Authors. All rights reserved.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

SPTNET: Span-based Prompt Tuning for Video Grounding

SPTNET: Span-based Prompt Tuning for Video Grounding

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Yiren Zhang Yuanwu Xu Mohan Chen Yuejie Zhang Rui Feng Shang Gao School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University School of Information Technology Deakin University

When a Pre-trained Language Model (PLM) is adopted in video grounding task, it usually acts as a text encoder without having its knowledge fully utilized. Also, there exists an inconsistency problem between the pre-training and downstream objectives. To solve the issues, we propose a new paradigm, named Span-based Prompt Tuning (SPTNet). It can convert the video grounding task into a cloze form. Specifically, a query is first changed into a form with mask token by a template, then the video and the query embeddings are integrated through a cross-modal transformer. The start and end points of the query matching time span are predicted with the embedding of the mask token. Experimental results on two public benchmarks ActivityNet Captions and Charades-STA show that our SPTNet achieves surpassing performance compared with state-of-the-art methods.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Conditional Video-Text Reconstruction Network with Cauchy Mask for Weakly Supervised Temporal Sentence Grounding

Conditional Video-Text Reconstruction Network with Cauchy Ma...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Jueqi Wei Yuanwu Xu Mohan Chen Yuejie Zhang Rui Feng Shang Gao School of Computer Science Shanghai Key Laboratory of Intelligent Information Processing Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University School of Information Technology Deakin University

Temporal sentence grounding aims to detect the target segment most related to a given query in an untrimmed video. To alleviate the expensive annotation cost for temporal labels, researchers paid more attention to weakly supervised setting. Prior studies neglected the utilization of video representation reconstruction, which led to an unbalanced alignment learning. Moreover, they used different strategies to generate proposals which ignored the temporal structure in a query. In this paper, we propose a novel Conditional Video-Text Reconstruction Network (CVTRN). It supports conditional reconstruction of video and text representation. Specifically, video and text features are fused to compute semantic alignment, which is the condition of reconstruction. A new mask strategy for mask conditioned sentence reconstruction is also devised. This strategy focuses more on boundary regions than the widely used Gaussian mask in previous methods. Experimental results on two public benchmark datasets show that our CVTRN outperforms the state-of-the-art methods.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

Learning Open-Vocabulary Semantic Segmentation Models From N...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Jilan Xu Junlin Hou Yuejie Zhang Rui Feng Yi Wang Yu Qiao Weidi Xie Shanghai Key Lab of Intelligent Information Processing School of Computer Science Shanghai Collaborative Innovation Center of Intelligent Visual Computing Fudan University Shanghai AI Laboratory CMIC Shanghai Jiao Tong University

This paper considers the problem of open-vocabulary semantic segmentation (OVS), that aims to segment objects of arbitrary classes beyond a pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled imagetext pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slotattention based binding module, then aligns the group tokens to corresponding caption embeddings. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, encouraging the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on four benchmark datasets, PASCAL VOC, PASCAL Context, COCO Object, and ADE20K. OVSegmentor achieves superior results over state-of-the-art approaches on PASCAL VOC using only 3% data (4M vs 134M) for pre-training.

关键词：

来源：评论

学校读者我要写书评

暂无评论

M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection

arXiv

引用

arXiv 2021年

作者： Wang, Junke Wu, Zuxuan Ouyang, Wenhao Han, Xintong Chen, Jingjing Lim, Ser-Nam Jiang, Yu-Gang Shanghai Key Lab of Intelligent Information Processing School of Computer Science Fudan University China Shanghai Collaborative Innovation Center on Intelligent Visual Computing China Huya Inc Meta AI

The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins. Copyright © 2021, The Authors. All rights reserved.

关键词： Frequency domain analysis

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：