检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

4,653 篇 会议
2 篇 期刊文献

馆藏范围

4,655 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

1,715 篇 工学
- 1,623 篇 计算机科学与技术...
- 182 篇 软件工程
- 142 篇 机械工程
- 133 篇 光学工程
- 41 篇 生物工程
- 29 篇 信息与通信工程
- 18 篇 电气工程
- 9 篇 电子科学与技术（可...
- 9 篇 化学工程与技术
- 9 篇 交通运输工程
- 8 篇 控制科学与工程
- 8 篇 生物医学工程（可授...
- 7 篇 安全科学与工程
- 4 篇 材料科学与工程（可...
- 4 篇 建筑学
- 3 篇 土木工程
- 3 篇 农业工程
173 篇 理学
- 135 篇 物理学
- 42 篇 生物学
- 30 篇 数学
- 16 篇 统计学（可授理学、...
- 10 篇 化学
- 2 篇 大气科学
14 篇 管理学
- 7 篇 管理科学与工程(可...
- 7 篇 图书情报与档案管...
- 3 篇 工商管理
10 篇 医学
- 10 篇 临床医学
5 篇 法学
- 3 篇 社会学
- 2 篇 法学
3 篇 教育学
- 3 篇 教育学
2 篇 农学
1 篇 经济学

主题

2,867 篇 computer vision
1,227 篇 training
1,038 篇 pattern recognit...
870 篇 computational mo...
794 篇 conferences
693 篇 visualization
593 篇 three-dimensiona...
469 篇 codes
460 篇 benchmark testin...
422 篇 semantics
420 篇 computer archite...
349 篇 accuracy
301 篇 adaptation model...
282 篇 feature extracti...
267 篇 transformers
242 篇 cameras
225 篇 diffusion models
223 篇 solid modeling
214 篇 pipelines
210 篇 measurement

机构

72 篇 tsinghua univers...
69 篇 zhejiang univers...
58 篇 university of sc...
57 篇 shanghai jiao to...
52 篇 google research
47 篇 nanyang technolo...
44 篇 national univers...
40 篇 shanghai ai labo...
39 篇 university of ch...
37 篇 adobe research
37 篇 the university o...
37 篇 the chinese univ...
35 篇 stanford univers...
34 篇 harbin institute...
34 篇 shanghai artific...
34 篇 carnegie mellon ...
33 篇 university of el...
30 篇 peng cheng labor...
30 篇 sun yat-sen univ...
29 篇 s-lab nanyang te...

作者

75 篇 timofte radu
24 篇 yu qiao
22 篇 luc van gool
19 篇 ying shan
16 篇 van gool luc
15 篇 radu timofte
14 篇 xin li
13 篇 li xin
12 篇 chen chen
12 篇 zhang zhao
12 篇 boxin shi
11 篇 lizhuang ma
11 篇 fan haoqiang
11 篇 loy chen change
11 篇 zheng-jun zha
11 篇 liu shuaicheng
11 篇 kai zhang
11 篇 marcos v. conde
11 篇 chen wei-ting
11 篇 ziwei liu

语言

4,654 篇 英文
1 篇 中文

检索条件"任意字段=2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024"

共 4655 条记录，以下是341-350 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Test-Time Adaptation for Depth Completion

Test-Time Adaptation for Depth Completion

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Park, Hyoungseob Gupta, Anjali Wong, Alex Yale Vision Lab New Haven CT 06501 USA

ISBN: (纸本)9798350353006

It is common to observe performance degradation when transferring models trained on some (source) datasets to target testing data due to a domain gap between them. Existing methods for bridging this gap, such as domain adaptation (DA), may require the source data on which the model was trained (often not available), while others, i.e., source-free DA, require many passes through the testing data. We propose an online test-time adaptation method for depth completion, the task of inferring a dense depth map from a single image and associated sparse depth map, that closes the performance gap in a single pass. We first present a study on how the domain shift in each data modality affects model performance. Based on our observations that the sparse depth modality exhibits a much smaller covariate shift than the image, we design an embedding module trained in the source domain that preserves a mapping from features encoding only sparse depth to those encoding image and sparse depth. During test time, sparse depth features are projected using this map as a proxy for source domain features and are used as guidance to train a set of auxiliary parameters (i.e., adaptation layer) to align image and sparse depth features from the target test domain to that of the source domain. We evaluate our method on indoor and outdoor scenarios and show that it improves over baselines by an average of 21.1%. Code available at ***/seobbro/TTA-depth-completion.

关键词： 3D computer vision Depth completion Depth estimation Multi-modal fusion Test time adaptation

来源：评论

学校读者我要写书评

暂无评论

On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?

On the test-time zero-shot generalization of vision-language...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Zanella, Maxime Ben Ayed, Ismail UCLouvain Louvain Belgium UMons Mons Belgium ETS Montreal Montreal PQ Canada

ISBN: (纸本)9798350353006

The development of large vision-language models, notably CLIP, has catalyzed research into effective adaptation techniques, with a particular focus on soft prompt tuning. Conjointly, test-time augmentation, which utilizes multiple augmented views of a single image to enhance zero-shot generalization, is emerging as a significant area of interest. This has predominantly directed research efforts toward test-time prompt tuning. In contrast, we introduce a robust MeanShift for Test-time Augmentation (MTA), which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally, our method does not rely on ad hoc rules (e.g., confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead, MTA incorporates a quality assessment variable for each view directly into its optimization process, termed as the inlierness score. This score is jointly optimized with a density mode seeking process, leading to an efficient training- and hyperparameter-free approach. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency. Deployed easily as plug-and-play module on top of zero-shot models and state-of-the-art few-shot methods, MTA shows systematic and consistent improvements.

关键词： CLIP test-time augmentation training-free vision-language zero-shot

来源：评论

学校读者我要写书评

暂无评论

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large vision-Language Models

THRONE: An Object-based Hallucination Benchmark for the Free...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Kaul, Prannay Li, Zhizhong Yang, Hao Dukler, Yonatan Swaminathan, Ashwin Taylor, C. J. Soatto, Stefano Univ Oxford VGG Oxford England AWS AI Labs Oxford England

ISBN: (纸本)9798350353006

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats-typically a multiple-choice response regarding a particular object or attribute-which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we pro-pose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

关键词： benchmark hallucination large language model large vision-language model LLM LVLM

来源：评论

学校读者我要写书评

暂无评论

Sharingan: A Transformer Architecture for Multi-Person Gaze Following

Sharingan: A Transformer Architecture for Multi-Person Gaze ...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Tafasca, Samy Gupta, Anshul Odobez, Jean-Marc Idiap Res Inst Martigny Switzerland Ecole Polytech Fed Lausanne Lausanne Switzerland

ISBN: (纸本)9798350353013;9798350353006

Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures, but they have been constrained by the need to process one person at a time, which proves to be highly inefficient. In this paper, we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction [38, 39], they use a fixed set of learnable embeddings to decode both the person and its gaze target, which requires a matching step afterward to link the predictions with the annotations. Thus, it is difficult to quantitatively evaluate these methods reliably with the available benchmarks, or integrate them into a larger human behavior understanding system. Instead, we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget, and ChildPlay datasets and outper-forms comparable multi-person architectures with a notable margin. Our code, checkpoints, and data extractions will be made publicly available soon.

关键词： computer vision deep learning gaze following

来源：评论

学校读者我要写书评

暂无评论

Distilling vision-Language Models on Millions of Videos

Distilling Vision-Language Models on Millions of Videos

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Zhao, Yue Zhao, Long Zhou, Xingyi Wu, Jialin Chu, Chun-Te Mia, Hui Schroff, Florian Adam, Hartwig Liu, Ting Gong, Boqing Krahenbuhl, Philipp Yuan, Liangzhe Google Res Mountain View CA 94043 USA Univ Texas Austin Austin TX 78712 USA

ISBN: (纸本)9798350353006

The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models, but there simply is not enough human- curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video capation dataset to date.

关键词： Video analysis

来源：评论

学校读者我要写书评

暂无评论

3D Human Pose Perception from Egocentric Stereo Videos

3D Human Pose Perception from Egocentric Stereo Videos

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Akada, Hiroyasu Wang, Jian Golyanik, Vladislav Theobalt, Christian Max Planck Inst Informat SIC Saarbrucken Germany

ISBN: (纸本)9798350353013;9798350353006

While head-mounted devices are becoming more compact, they provide egocentric views with significant self-occlusions of the device user. Hence, existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work, we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation, which leverages the scene information and temporal context of egocentric stereo videos. Specifically, we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames, and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. Furthermore, we introduce two new benchmark datasets, i.e., UnrealEgo2 and UnrealEgo-RW (RealWorld). The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets, allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. UnrealEgo2, UnrealEgo-RW, and trained models are available on our project page(1) and Benchmark Challenge(2).

关键词： Egocentric 3D human pose estimation First-person view Stereo vision

来源：评论

学校读者我要写书评

暂无评论

Synthesize, Diagnose, and Optimize: Towards Fine-Grained vision-Language Understanding

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vis...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Peng, Wujian Xi, Sicheng You, Zuyao Lan, Shiyi Wu, Zuxuan Fudan Univ Sch CS Shanghai Key Lab Intell Info Proc Shanghai Peoples R China Shanghai Collaborat Innovat Ctr Intelligent Visua Shanghai Peoples R China NVIDIA Shenzhen Guangdong Peoples R China

ISBN: (纸本)9798350353006

vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://***/wjpoom/SPEC.

关键词： Fine-grained understdanding vision language model

来源：评论

学校读者我要写书评

暂无评论

SpatialVLM: Endowing vision-Language Models with Spatial Reasoning Capabilities

SpatialVLM: Endowing Vision-Language Models with Spatial Rea...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Chen, Boyuan Xu, Zhuo Kirman, Sean Ichter, Brian Sadigh, Dorsa Guibas, Leonidas Xia, Fei Google DeepMind London England Google Res Mountain View CA USA MIT 77 Massachusetts Ave Cambridge MA 02139 USA

ISBN: (纸本)9798350353006

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size difference. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in training recipe including data quality, training pipeline and VLM architecture. Our work features the first Internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Website: https://***/

关键词： large language model multimodal spatial reasoning vision language model

来源：评论

学校读者我要写书评

暂无评论

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

ViP-LLaVA: Making Large Multimodal Models Understand Arbitra...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Cai, Mu Liu, Haotian Mustikovela, Siva Karthik Meyer, Gregory P. Chai, Yuning Park, Dennis Lee, Yong Jae Univ Wisconsin Madison WI 53706 USA Cruise LLC San Francisco CA USA

ISBN: (纸本)9798350353006

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

关键词： Large Language Models Large Multimodal Models Multimodal Benchmark Region-level Understanding vision-language models Visual Commonsense Reasoning Visual Prompts

来源：评论

学校读者我要写书评

暂无评论

Forecasting of 3D Whole-body Human Poses with Grasping Objects

Forecasting of 3D Whole-body Human Poses with Grasping Objec...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Yan, Haitao Cui, Qiongjie Xie, Jiexin Guo, Shijie Fudan Univ Acad Engn & Technol Shanghai Peoples R China Nanjing Univ Sci & Technol Nanjing Peoples R China

ISBN: (纸本)9798350353013;9798350353006

In the context of computer vision and human-robot interaction, forecasting 3D human poses is crucial for understanding human behavior and enhancing the predictive capabilities of intelligent systems. While existing methods have made significant progress, they often focus on predicting major body joints, overlooking fine-grained gestures and their interaction with objects. Human hand movements, particularly during object interactions, play a pivotal role and provide more precise expressions of human poses. This work fills this gap and introduces a novel paradigm: forecasting 3D whole-body human poses with a focus on grasping objects. This task involves predicting activities across all joints in the body and hands, encompassing the complexities of internal heterogeneity and external interactivity. To tackle these challenges, we also propose a novel approach: C3HOST, cross-context cross-modal consolidation for 3D whole-body pose forecasting, effectively handles the complexities of internal heterogeneity and external interactivity. C3HOST involves distinct steps, including the heterogeneous content encoding and alignment, and cross-modal feature learning and interaction. These enable us to predict activities across all body and hand joints, ensuring high-precision whole-body human pose prediction, even during object grasping. Extensive experiments on two benchmarks demonstrate that our model significantly enhances the accuracy of whole-body human motion prediction. The project page is available at https://***/view/c3host.

关键词： 3D computer vision cross-modal learning human motion analysis human motion prediction

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共466页 << < 31 32 33 34 35 36 37 38 39 40 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：