检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

95 篇 会议
74 篇 期刊文献
1 篇 学位论文

馆藏范围

170 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

162 篇 工学
- 130 篇 计算机科学与技术...
- 37 篇 电气工程
- 17 篇 软件工程
- 13 篇 信息与通信工程
- 13 篇 控制科学与工程
- 11 篇 生物医学工程（可授...
- 9 篇 电子科学与技术（可...
- 9 篇 测绘科学与技术
- 4 篇 机械工程
- 4 篇 仪器科学与技术
- 4 篇 材料科学与工程（可...
- 4 篇 生物工程
- 2 篇 交通运输工程
- 1 篇 航空宇航科学与技...
- 1 篇 环境科学与工程（可...
35 篇 医学
- 25 篇 临床医学
- 12 篇 特种医学
- 4 篇 基础医学(可授医学...
- 4 篇 医学技术（可授医学...
- 1 篇 中西医结合
26 篇 理学
- 11 篇 物理学
- 10 篇 地球物理学
- 9 篇 化学
- 5 篇 生物学
- 3 篇 地理学
- 1 篇 天文学
- 1 篇 地质学
7 篇 管理学
- 6 篇 管理科学与工程(可...
- 1 篇 图书情报与档案管...
1 篇 哲学
- 1 篇 哲学
1 篇 农学

主题

170 篇 vision-language ...
17 篇 large language m...
15 篇 prompt learning
12 篇 few-shot learnin...
11 篇 clip
10 篇 visualization
7 篇 contrastive lear...
6 篇 foundation model...
6 篇 remote sensing
6 篇 training
6 篇 adaptation model...
5 篇 object detection
5 篇 deep learning
5 篇 feature extracti...
5 篇 image classifica...
4 篇 long-tailed reco...
4 篇 computational mo...
4 篇 artificial intel...
4 篇 computer vision
4 篇 domain generaliz...

机构

4 篇 chinese acad sci...
4 篇 carnegie mellon ...
4 篇 univ chinese aca...
3 篇 inesc tec porto
3 篇 sichuan univ col...
3 篇 univ chinese aca...
3 篇 zhejiang univ pe...
3 篇 chinese univ hon...
2 篇 shanghai ai lab ...
2 篇 ecole polytech f...
2 篇 tsinghua univ de...
2 篇 harbin inst tech...
2 篇 univ porto fac e...
2 篇 cent south univ ...
2 篇 beijing univ pos...
2 篇 city univ hong k...
2 篇 china univ geosc...
2 篇 sichuan univ col...
2 篇 tech univ munich...
2 篇 westlake univ sc...

作者

4 篇 banerjee biplab
4 篇 zhang yi
4 篇 jha ankit
3 篇 wang donglin
3 篇 singha mainak
3 篇 ding kun
3 篇 zhang ce
3 篇 tuia devis
2 篇 men aidong
2 篇 li haifeng
2 篇 mahapatra dwarik...
2 篇 zhang min
2 篇 liu xuyang
2 篇 chen honggang
2 篇 ma chao
2 篇 guo miaotian
2 篇 yang yang
2 篇 ricci elisa
2 篇 ye mao
2 篇 tian liang

语言

164 篇 英文
5 篇 其他

检索条件"主题词=Vision-language Models"

共 170 条记录，以下是111-120 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

VTR: Bidirectional Video-Textual Transmission Rail for CLIP-based Video Recognition

VTR: Bidirectional Video-Textual Transmission Rail for CLIP-...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Yu, Shaoqi Chen, Lili Zhang, Xiaolin Li, Jiamao Chinese Acad Sci Shanghai Inst Microsyst & Informat Technol Shanghai Peoples R China Univ Chinese Acad Sci Beijing Peoples R China ShanghaiTech Univ Shanghai Peoples R China

ISBN: (纸本)9798350390155;9798350390162

There are two key issues when transferring visionlanguage model like CLIP for video recognition: bidirectional video-textual transmission and temporal modeling. To address the issues, we propose a novel framework named Video-Textual Transmission Rail (VTR) which enables bidirectional transmission of visual-textual representations and temporal modeling concurrently. Specifically, Message Rail (MR) tokens are proposed within VTR to realize the bidirectional transfer of not only highlevel visual-language knowledge but also low-level knowledge. For temporal modeling, we introduce the Temporal Elite (TE) a temporal modeling module within VTR, providing VTR with the capability of both short- and long-range temporal modeling under textual supervision thanks to MR tokens. Extensive experiments on popular video datasets (i.e., Kinetics-400, SomethingSomething-v2, UCF-101 and HMDB-51) demonstrate that VTR achieves state-of-the-art performance in fully-supervised, zeroshot and few-shot video recognition.

关键词： Video recognition CLIP Bidirectional transmission vision-language models

来源：评论

学校读者我要写书评

暂无评论

ViP-LLaVA: Making Large Multimodal models Understand Arbitrary Visual Prompts

ViP-LLaVA: Making Large Multimodal Models Understand Arbitra...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Cai, Mu Liu, Haotian Mustikovela, Siva Karthik Meyer, Gregory P. Chai, Yuning Park, Dennis Lee, Yong Jae Univ Wisconsin Madison WI 53706 USA Cruise LLC San Francisco CA USA

ISBN: (纸本)9798350353006

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

关键词： Large language models Large Multimodal models Multimodal Benchmark Region-level Understanding vision-language models Visual Commonsense Reasoning Visual Prompts

来源：评论

学校读者我要写书评

暂无评论

PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

PIN: Positional Insert Unlocks Object Localisation Abilities...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Dorkenwald, Michael Barazani, Nimrod Snoek, Cees G. M. Asano, Yuki M. Univ Amsterdam Amsterdam Netherlands

ISBN: (纸本)9798350353006

vision-language models (VLMs), such as Flamingo and GPT-4V, have shown immense potential by integrating large language models with vision systems. Nevertheless, these models face challenges in the fundamental computer vision task of object localisation, due to their training on multi-modal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom, supervised training pipelines with bounding box annotations that integrate with VLMs, these result in specialized and hard-to-scale models. In this paper, we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end, we introduce an input-agnostic Positional Insert (PIN), a learnable spatial prompt, containing a minimal set of parameters that are slid inside the frozen VLM, unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images, including Pascal VOC, COCO, LVIS, and diverse images like paintings or cartoons.

关键词： Efficient Adaption of VLMs Foundation models vision-language models Visual Grounding

来源：评论

学校读者我要写书评

暂无评论

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

CLIP as RNN: Segment Countless Visual Concepts without Train...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Sun, Shuyang Li, Runjia Torr, Philip Gu, Xiuye Li, Siyang Univ Oxford Oxford England Google Res Mountain View CA 94043 USA

ISBN: (纸本)9798350353006

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

关键词： image segmentation open-vocabulary referring segmentation training-free methods vision-language models

来源：评论

学校读者我要写书评

暂无评论

TRAINING VISUAL language models WITH OBJECT DETECTION: GROUNDED CHANGE DESCRIPTIONS IN SATELLITE IMAGES

TRAINING VISUAL LANGUAGE MODELS WITH OBJECT DETECTION: GROUN...

引用

IEEE International Geoscience and Remote Sensing Symposium (IGARSS)

作者： Prado, Joao Luis Montariol, Syrielle Castillo-Navarro, Javiera Tuia, Devis Bosselut, Antoine Ecole Polytech Fed Lausanne EPFL Lausanne Switzerland

ISBN: (纸本)9798350360332;9798350360325

Recently, generalist vision language models (VLMs) have shown exceptional progress in tasks previously dominated by specialized computer vision models. This becomes more prevalent when visual grounding capabilities, such as the ability to reason over input text and image to generate bounding boxes around objects, are required. However, how these capabilities transfer to specialized domains such as remote sensing remains understudied, despite the recent increase in specialized models for Earth observation. In this work, we evaluate how grounding visual entities - by generating bounding-box coordinates - affects VLM performance in satellite imagery. To this end, we create two instruction-following tasks sourced from the xBD dataset, describing changes due to natural disasters observed in satellite images. We fine-tune several instances of MiniGPTv2, an open-source VLM with grounding capabilities, and evaluate their performance under the "grounded" vs. "not grounded" settings. We find that generating bounding boxes to refer to visual entities increases performance in tasks related to objects in the image, but only when the number of entities in the image is limited.

关键词： vision-language models Object Detection Earth Observation

来源：评论

学校读者我要写书评

暂无评论

FASN: Feature Aggregate Side-Network for Open-Vocabulary Semantic Segmentation

FASN: Feature Aggregate Side-Network for Open-Vocabulary Sem...

引用

International Joint Conference on Neural Networks (IJCNN)

作者： Jia, Daixi Chen, Lipeng Su, Xingzhe Wu, Fengge Zhao, Junsuo Univ Chinese Acad Sci Chinese Acad Sci Inst Software Beijing 100190 Peoples R China

ISBN: (纸本)9798350359329;9798350359312

In this paper, we introduce an Feature Aggregate Side Network (FASN), a simple, efficient, and easy-to-train method for open-vocabulary semantic segmentation. Building upon existing models based on the CLIP-Side Network framework, we address the issue of CLIP-generated features lacking pixel-level recognition capability by layering a novel fast graph representation learning layer between the CLIP and side networks. This integration introduces an inductive bias for the aggregation of local features, thereby better addressing the challenges in semantic segmentation. Through validation on five distinct datasets and extensive ablation studies, we have demonstrated the effectiveness of our modifications. Our findings indicate that with a slight increase in the number of parameters, there is a significant enhancement in performance.

关键词： Open-Vocabulary Semantic Segmentation Computer vision vision-language models

来源：评论

学校读者我要写书评

暂无评论

DARA: DOMAIN- AND RELATION-AWARE ADAPTERS MAKE PARAMETER-EFFICIENT TUNING FOR VISUAL GROUNDING

DARA: DOMAIN- AND RELATION-AWARE ADAPTERS MAKE PARAMETER-EFF...

引用

IEEE International Conference on Multimedia and Expo (ICME)

作者： Liu, Ting Liu, Xuyang Huang, Siteng Chen, Honggang Yin, Quanjun Qin, Long Wang, Donglin Hu, Yue Natl Univ Def Technol Coll Syst Engn Changsha Peoples R China Sichuan Univ Coll Elect & Informat Engn Chengdu Peoples R China Westlake Univ Sch Engn Hangzhou Peoples R China

ISBN: (纸本)9798350390155;9798350390162

Visual grounding (VG) is a challenging task to localize an object in an image based on a textual description. Recent surge in the scale of VG models has substantially improved performance, but also introduced a significant burden on computational costs during fine-tuning. In this paper, we explore applying parameter-efficient transfer learning (PETL) to efficiently transfer the pre-trained vision-language knowledge to VG. Specifically, we propose DARA, a novel PETL method comprising Domain-aware Adapters (DA Adapters) and Relation-aware Adapters (RA Adapters) for VG. DA Adapters first transfer intra-modality representations to be more fine-grained for the VG domain. Then RA Adapters share weights to bridge the relation between two modalities, improving spatial reasoning. Empirical results on widelyused benchmarks demonstrate that DARA achieves the best accuracy while saving numerous updated parameters compared to the full fine-tuning and other PETL methods. Notably, with only 2.13% tunable backbone parameters, DARA improves average accuracy by 0.81% across the three benchmarks compared to the baseline model. Our code is available at https://***/liuting20/DARA.

关键词： Visual grounding parameter-efficient transfer learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

Spuriousness-Aware Meta-Learning for Learning Robust Classifiers 24

Spuriousness-Aware Meta-Learning for Learning Robust Classif...

引用

30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

作者： Zheng, Guangtao Ye, Wenqian Zhang, Aidong Univ Virginia Charlottesville VA 22903 USA

ISBN: (纸本)9798400704901

Spurious correlations are brittle associations between certain attributes of inputs and target variables, such as the correlation between an image background and an object class. Deep image classifiers often leverage them for predictions, leading to poor generalization on the data where the correlations do not hold. Mitigating the impact of spurious correlations is crucial towards robust model generalization, but it often requires annotations of the spurious correlations in data - a strong assumption in practice. In this paper, we propose a novel learning framework based on meta-learning, termed SPUME - SPUriousness-aware MEta-learning, to train an image classifier to be robust to spurious correlations. We design the framework to iteratively detect and mitigate the spurious correlations that the classifier excessively relies on for predictions. To achieve this, we first propose to utilize a pre-trained vision-language model to extract text-format attributes from images. These attributes enable us to curate data with various class-attribute correlations, and we formulate a novel metric to measure the degree of these correlations' spuriousness. Then, to mitigate the reliance on spurious correlations, we propose a meta-learning strategy in which the support (training) sets and query (test) sets in tasks are curated with different spurious correlations that have high degrees of spuriousness. By meta-training the classifier on these spuriousness-aware meta-learning tasks, our classifier can learn to be invariant to the spurious correlations. We demonstrate that our method is robust to spurious correlations without knowing them a priori and achieves the best on five benchmark datasets with different robustness measures. Our code is available at https://***/gtzheng/SPUME.

关键词： Spurious correlations robustness meta-learning image classification vision-language models

来源：评论

学校读者我要写书评

暂无评论

Classes Are Not Equal: An Empirical Study on Image Recognition Fairness

Classes Are Not Equal: An Empirical Study on Image Recogniti...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Cui, Jiequan Zhu, Beier Wen, Xin Qi, Xiaojuan Yu, Bei Zhang, Hanwang Nanyang Technol Univ Singapore Singapore Univ Hong Kong Hong Kong Peoples R China Chinese Univ Hong Kong Hong Kong Peoples R China

ISBN: (纸本)9798350353006

In this paper, we present an empirical study on image unfairness, i.e., extreme class accuracy disparity on balanced data like ImageNet. We demonstrate that are not equal and unfairness is prevalent for image classification models across various datasets, network and model capacities. Moreover, several intriguing properties of fairness are identified. First, the unfairness lies in problematic representation rather than classifier bias distinguished from long-tailed recognition. Second, with the proposed concept of Model Prediction Bias, investigate the origins of problematic representation training optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that more challenging to recognize. It means that more other will be confused with harder classes. Then the False (FPs) will dominate the learning in optimization, leading to their poor accuracy. Further, we conclude data augmentation and representation learning algorithms improve overall performance by promoting fairness some degree in image classification.

关键词： Fairness Long-tailed Recognition Representation Learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

VGDIFFZERO: TEXT-TO-IMAGE DIFFUSION models CAN BE ZERO-SHOT VISUAL GROUNDERS 49

VGDIFFZERO: TEXT-TO-IMAGE DIFFUSION MODELS CAN BE ZERO-SHOT ...

引用

49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Liu, Xuyang Huang, Siteng Kang, Yachen Chen, Honggang Wang, Donglin Sichuan Univ Coll Elect & Informat Engn Chengdu Peoples R China Westlake Univ Sch Engn Hangzhou Peoples R China

ISBN: (纸本)9798350344868;9798350344851

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://***/xuyang-liu16/VGDiffZero.

关键词： Visual grounding diffusion models zero-shot learning vision-language models

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共17页 << < 8 9 10 11 12 13 14 15 16 17 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：