检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

11,267 篇 会议
14 篇 期刊文献

馆藏范围

11,281 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

7,859 篇 工学
- 7,418 篇 计算机科学与技术...
- 799 篇 机械工程
- 390 篇 电气工程
- 377 篇 软件工程
- 224 篇 控制科学与工程
- 68 篇 光学工程
- 32 篇 信息与通信工程
- 26 篇 生物工程
- 10 篇 生物医学工程（可授...
- 8 篇 化学工程与技术
- 7 篇 电子科学与技术（可...
- 6 篇 交通运输工程
- 5 篇 安全科学与工程
- 3 篇 仪器科学与技术
- 2 篇 力学（可授工学、理...
- 2 篇 材料科学与工程（可...
- 2 篇 动力工程及工程热...
- 2 篇 航空宇航科学与技...
3,103 篇 医学
- 3,102 篇 临床医学
- 4 篇 基础医学(可授医学...
297 篇 理学
- 199 篇 系统科学
- 69 篇 物理学
- 27 篇 生物学
- 24 篇 数学
- 9 篇 统计学（可授理学、...
- 7 篇 化学
23 篇 管理学
- 14 篇 图书情报与档案管...
- 9 篇 管理科学与工程(可...
- 4 篇 工商管理
6 篇 法学
- 6 篇 社会学
2 篇 农学
1 篇 教育学
1 篇 艺术学

主题

5,461 篇 computer vision
2,564 篇 training
2,118 篇 pattern recognit...
1,632 篇 computational mo...
1,454 篇 visualization
1,325 篇 three-dimensiona...
1,070 篇 semantics
972 篇 codes
968 篇 benchmark testin...
930 篇 computer archite...
885 篇 deep learning
831 篇 task analysis
729 篇 feature extracti...
541 篇 conferences
530 篇 neural networks
526 篇 face recognition
503 篇 transformers
480 篇 object detection
478 篇 image segmentati...
469 篇 cameras

机构

169 篇 univ sci & techn...
146 篇 tsinghua univ pe...
142 篇 univ chinese aca...
142 篇 carnegie mellon ...
132 篇 chinese univ hon...
122 篇 peng cheng lab p...
102 篇 zhejiang univ pe...
96 篇 sensetime res pe...
95 篇 swiss fed inst t...
90 篇 shanghai ai lab ...
86 篇 tsinghua univers...
86 篇 stanford univ st...
84 篇 shanghai jiao to...
80 篇 zhejiang univers...
79 篇 alibaba grp peop...
79 篇 univ hong kong p...
76 篇 peng cheng labor...
76 篇 tech univ munich...
74 篇 australian natl ...
73 篇 peking univ peop...

作者

67 篇 timofte radu
60 篇 van gool luc
50 篇 zhang lei
43 篇 yang yi
36 篇 loy chen change
36 篇 tao dacheng
31 篇 liu yang
30 篇 zhou jie
30 篇 chen chen
30 篇 tian qi
29 篇 sun jian
28 篇 zha zheng-jun
27 篇 qi tian
27 篇 boxin shi
26 篇 li xin
26 篇 vasconcelos nuno
26 篇 pollefeys marc
24 篇 liu xiaoming
24 篇 zheng wei-shi
24 篇 luo ping

语言

11,273 篇 英文
7 篇 其他
1 篇 中文

检索条件"任意字段=2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020"

共 11281 条记录，以下是351-360 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large vision Language Models

SC-Tune: Unleashing Self-Consistent Referential Comprehensio...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Yue, Tongtian Cheng, Jie Guo, Longteng Dai, Xingyuan Zhao, Zijia He, Xingjian Xiong, Gang Lv, Yisheng Liu, Jing CASIA Lab Cognit & Decis Intelligence Complex Syst Beijing Peoples R China CASIA State Key Lab Multimodal Artificial Intelligence Beijing Peoples R China Univ Chinese Acad Sci Sch Artificial Intelligence Beijing Peoples R China

ISBN: (纸本)9798350353006

Recent trends in Large vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced, object-level referential comprehension. In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reliability of fine-grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations, posing limitations on their practical applicability and potential. To address this gap, we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits generalizability across multiple LVLMs. Through extensive experiments, we demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or improved performance on image-level vision-language benchmarks. Both our model and code will be publicly available at https://***/ivattyue/SC-Tune.

关键词：

来源：评论

学校读者我要写书评

暂无评论

DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses

DVMNet: Computing Relative Pose for Unseen Objects Beyond Hy...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Zhao, Chen Zhang, Tong Dang, Zheng Salzmann, Mathieu Ecole Polytech Fed Lausanne Lausanne Switzerland ClearSpace SA Renens Switzerland

ISBN: (纸本)9798350353006

Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses, which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast, we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end, we map the two input RGB images, reference and query, to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module, where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness, we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D, LINEMOD, and Objaverse datasets, demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://***/sailor-z/DVMNet/.

关键词： 3D computer vision Multi-View Geometry Object Pose Estimation

来源：评论

学校读者我要写书评

暂无评论

Random Entangled Tokens for Adversarially Robust vision Transformer

Random Entangled Tokens for Adversarially Robust Vision Tran...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Gong, Huihui Dong, Mingjing Mao, Siqi Camtepe, Seyit Nepal, Surya Xu, Chang Univ Sydney Sydney NSW Australia CSIRO Data61 Eveleigh Australia City Univ Hong Kong Hong Kong Peoples R China Univ New South Wales Sydney NSW Australia

ISBN: (纸本)9798350353006

vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks ( CNNs) in the realm of computer vision, showcasing tremendous potential. However, recent research has unveiled a susceptibility of ViTs to adversarial attacks, akin to their CNN counterparts. Adversarial training and randomization are two representative effective defenses for CNNs. Some researchers have attempted to apply adversarial training to ViTs and achieved comparable robustness to CNNs, while it is not easy to directly apply randomization to ViTs because of the architecture difference between CNNs and ViTs. In this paper, we delve into the structural intricacies of ViTs and propose a novel defense mechanism termed Random entangled image Transformer (ReiT), which seamlessly integrates adversarial training and randomization to bolster the adversarial robustness of ViTs. Recognizing the challenge posed by the structural disparities between ViTs and CNNs, we introduce a novel module, input-independent random entangled self-attention (II-ReSA). This module optimizes random entangled tokens that lead to "dissimilar" self-attention outputs by leveraging model parameters and the sampled random tokens, thereby synthesizing the self-attention module outputs and random entangled tokens to diminish adversarial similarity. ReiT incorporates two distinct random entangled tokens and employs dual randomization, offering an effective countermeasure against adversarial examples while ensuring comprehensive deduction guarantees. Through extensive experiments conducted on various ViT variants and benchmarks, we substantiate the superiority of our proposed method in enhancing the adversarial robustness of vision Transformers.

关键词： Adversarial Robustness Randomized Defence Self-Attention Mechanism vision Transformers

来源：评论

学校读者我要写书评

暂无评论

Segment and Caption Anything

Segment and Caption Anything

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Huang, Xiaoke Wang, Jianfeng Tang, Yansong Zhang, Zheng Hue, Han Lu, Jiwen Wang, Lijuan Liu, Zicheng Tsinghua Univ Shenzhen Int Grad Sch Shenzhen Peoples R China Microsoft Shanghai Peoples R China Tsinghua Univ Dept Automat Beijing Peoples R China Adv Micro Devices Inc Beijing Peoples R China

ISBN: (纸本)9798350353006

We propose a method to efficiently equip the Segment Anything Model ( SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pretrain our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pretraining data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following link.

关键词： Image Captioning Parameter-Efficient Fine-Tuning Regional Captioning Segmentation vision Language Learning

来源：评论

学校读者我要写书评

暂无评论

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

MM-Narrator: Narrating Long-form Videos with Multimodal In-C...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Zh, Chaoyi Lin, Kevin Yang, Zhengyuan Wang, Jianfeng Li, Linjie Lin, Chung-Ching Liu, Zicheng Wang, Lijuan Univ Sydney Sydney NSW Australia Microsoft Corp Redmond WA 98052 USA Adv Micro Devices Inc Santa Clara CA USA

ISBN: (纸本)9798350353006

We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

关键词： audio description in-context learning LLM multimodal retrieval-augmented generation video understanding vision-and-language

来源：评论

学校读者我要写书评

暂无评论

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Wasim, Syed Talal Naseer, Muzammal Khan, Salman Yang, Ming-Hsuan Khan, Fahad Shahbaz Mohamed Bin Zayed Univ AI Abu Dhabi U Arab Emirates Australian Natl Univ Canberra Australia Univ Calif Merced Merced CA USA Google Res Mountain View CA USA Linkoping Univ Linkoping Sweden

ISBN: (纸本)9798350353006

Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and pre-defined vocabularies, our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content, achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably, the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore, in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions, our model surpasses the recent best-performing models by 4.88 m vIoU and 1.83% accuracy, demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released.

关键词： MultiModal Open Vocabulary Video Grounding vision Language

来源：评论

学校读者我要写书评

暂无评论

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf vision-Language Models

Emergent Open-Vocabulary Semantic Segmentation from Off-the-...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Luo, Jiayun Khandelwal, Siddhesh Sigal, Leonid Li, Boyang Nanyang Technol Univ Singapore Singapore Univ British Columbia Vector Inst AI Vancouver BC Canada

ISBN: (纸本)9798350353013;9798350353006

From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout;by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC, +13.2% mIoU on Pascal Context, +14.0% mIoU on MS COCO, +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs. Our codebase is at https://***/letitiabanana/PnP-OVSS.

关键词： open-vocabulary semantic segmentation training-free

来源：评论

学校读者我要写书评

暂无评论

EgoThink: Evaluating First-Person Perspective Thinking Capability of vision-Language Models

EgoThink: Evaluating First-Person Perspective Thinking Capab...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Cheng, Sijie Guo, Zhicheng Wu, Jingwen Fang, Kechen Li, Peng Liu, Huaping Liu, Yang Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Tsinghua Univ Inst AI Ind Res AIR Beijing Peoples R China Univ Toronto Dept Elect & Comp Engn Toronto ON Canada Tsinghua Univ Zhili Coll Beijing Peoples R China 01 Ai Beijing Peoples R China

ISBN: (纸本)9798350353006

vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from ego-centric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate twenty-one popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

关键词： Benchmark Egocentric vision-Language Models

来源：评论

学校读者我要写书评

暂无评论

Generative Bias for Robust Visual Question Answering

Generative Bias for Robust Visual Question Answering

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Cho, Jae Won Kim, Dong-Jin Ryu, Hyeonggon Kweon, In So Korea Adv Inst Sci & Technol Daejeon South Korea Hanyang Univ Seoul South Korea

ISBN: (纸本)9798350301298

The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Various previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to train a robust target model. However, these methods compute the bias for a model simply from the label statistics of the training data or from single modal branches. In this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model directly from the target model, called GenB. In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE, and show state-of-the-art results with the LXMERT architecture on VQA-CP2.

关键词： language reasoning vision

来源：评论

学校读者我要写书评

暂无评论

MMA: Multi-Modal Adapter for vision-Language Models

MMA: Multi-Modal Adapter for Vision-Language Models

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Yang, Lingxiao Zhang, Ru-Yuan Wang, Yanchen Xie, Xiaohua Sun Yat Sen Univ Guangzhou Peoples R China Shanghai Jiao Tong Univ Shanghai Peoples R China Stanford Univ Stanford CA USA

ISBN: (纸本)9798350353006

Pre-trained vision-Language Models (VLMs) have served as excellent foundation models for transfer learning in diverse downstream tasks. However, tuning VLMs for few-shot generalization tasks faces a discrimination - generalization dilemma, i.e., general knowledge should be preserved and task-specific knowledge should be fine-tuned. How to precisely identify these two types of representations remains a challenge. In this paper, we propose a Multi-Modal Adapter (MMA) for VLMs to improve the alignment between representations from text and vision branches. MMA aggregates features from different branches into a shared feature space so that gradients can be communicated across branches. To determine how to incorporate MMA, we systematically analyze the discriminability and generalizability of features across diverse datasets in both the vision and language branches, and find that (1) higher layers contain discriminable dataset-specific knowledge, while lower layers contain more generalizable knowledge, and (2) language features are more discriminable than visual features, and there are large semantic gaps between the features of the two modalities, especially in the lower layers. Therefore, we only incorporate MMA to a few higher layers of transformers to achieve an optimal balance between discrimination and generalization. We evaluate the effectiveness of our approach on three tasks: generalization to novel classes, novel target datasets, and domain generalization. Compared to many state-of-the-art methods, our MMA achieves leading performance in all evaluations. Code is at https://***/ZjjConan/Multi-Modal-Adapter

关键词： Adapter Multi-Modal vision-Language Models

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 32 33 34 35 36 37 38 39 40 41 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：