检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

90 篇 会议
70 篇 期刊文献
1 篇 学位论文

馆藏范围

161 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

153 篇 工学
- 124 篇 计算机科学与技术...
- 35 篇 电气工程
- 15 篇 软件工程
- 12 篇 信息与通信工程
- 12 篇 控制科学与工程
- 9 篇 测绘科学与技术
- 8 篇 电子科学与技术（可...
- 7 篇 生物医学工程（可授...
- 4 篇 机械工程
- 4 篇 仪器科学与技术
- 4 篇 材料科学与工程（可...
- 2 篇 交通运输工程
- 1 篇 航空宇航科学与技...
- 1 篇 环境科学与工程（可...
- 1 篇 生物工程
31 篇 医学
- 22 篇 临床医学
- 8 篇 特种医学
- 4 篇 基础医学(可授医学...
- 1 篇 中西医结合
- 1 篇 医学技术（可授医学...
23 篇 理学
- 10 篇 地球物理学
- 8 篇 物理学
- 6 篇 化学
- 5 篇 生物学
- 3 篇 地理学
- 1 篇 天文学
- 1 篇 地质学
5 篇 管理学
- 4 篇 管理科学与工程(可...
- 1 篇 图书情报与档案管...
1 篇 哲学
- 1 篇 哲学
1 篇 农学

主题

161 篇 vision-language ...
16 篇 large language m...
13 篇 prompt learning
11 篇 clip
11 篇 few-shot learnin...
10 篇 visualization
7 篇 contrastive lear...
6 篇 foundation model...
6 篇 remote sensing
6 篇 training
6 篇 adaptation model...
5 篇 object detection
5 篇 deep learning
5 篇 feature extracti...
5 篇 image classifica...
4 篇 long-tailed reco...
4 篇 computational mo...
4 篇 artificial intel...
4 篇 computer vision
4 篇 domain generaliz...

机构

4 篇 chinese acad sci...
4 篇 carnegie mellon ...
4 篇 univ chinese aca...
3 篇 inesc tec porto
3 篇 sichuan univ col...
3 篇 univ chinese aca...
3 篇 zhejiang univ pe...
3 篇 chinese univ hon...
2 篇 shanghai ai lab ...
2 篇 ecole polytech f...
2 篇 tsinghua univ de...
2 篇 harbin inst tech...
2 篇 univ porto fac e...
2 篇 cent south univ ...
2 篇 beijing univ pos...
2 篇 city univ hong k...
2 篇 china univ geosc...
2 篇 sichuan univ col...
2 篇 tech univ munich...
2 篇 westlake univ sc...

作者

4 篇 banerjee biplab
4 篇 zhang yi
4 篇 jha ankit
3 篇 wang donglin
3 篇 singha mainak
3 篇 ding kun
3 篇 zhang ce
3 篇 tuia devis
2 篇 men aidong
2 篇 li haifeng
2 篇 zhang min
2 篇 liu xuyang
2 篇 chen honggang
2 篇 ma chao
2 篇 guo miaotian
2 篇 yang yang
2 篇 ricci elisa
2 篇 ye mao
2 篇 tian liang
2 篇 patricio cristia...

语言

159 篇 英文
1 篇 其他

检索条件"主题词=Vision-language Models"

共 161 条记录，以下是21-30 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained vision-language models

One Prompt Word is Enough to Boost Adversarial Robustness fo...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Lin, L. Guan, Haoyan Qiu, Jianing Spratling, Michael Kings Coll London London England Imperial Coll London London England

ISBN: (纸本)9798350353006

Large pre-trained vision-language models (VLMs) like CLIP, despite having remarkable generalization ability, are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this, we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method, named Adversarial Prompt Tuning (APT), is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly, by simply adding one learned word to the prompts, APT can significantly boost the accuracy and robustness ((sic)=4/255 ) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases, in our most effective setting, to +26.4% for accuracy and +16.7% for robustness. Code is available at https://***/TreeLLi/APT.

关键词： adversarial examples adversarial robustness CLIP text prompting vision-language models VLMs

来源：评论

学校读者我要写书评

暂无评论

language models as Black-Box Optimizers for vision-language models

Language Models as Black-Box Optimizers for Vision-Language ...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Liu, Shihong Yu, Samuel Lin, Zhiqiu Pathak, Deepak Ramanan, Deva Carnegie Mellon Univ Pittsburgh PA 15213 USA

ISBN: (纸本)9798350353006

vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However, many VLMs rely on proprietary data and are not open-source, which restricts the use of white-box approaches for fine-tuning. As such, we aim to develop a black-box approach to optimize VLMs through natural language prompts, thereby avoiding the need to access model parameters, feature embeddings, or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically, we adopt an automatic "hill-climbing" procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback, all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup, our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts, suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. In addition, we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly, we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization.

关键词： generative models large language models prompting text-to-image generation vision-language models

来源：评论

学校读者我要写书评

暂无评论

LifeGraph 4-Lifelog Retrieval using Multimodal Knowledge Graphs and vision-language models 7

LifeGraph 4-Lifelog Retrieval using Multimodal Knowledge Gra...

引用

7th Annual ACM Workshop on the Lifelog Search Challenge (LSC)

作者： Rossetto, Luca Kyriakou, Athina Lange, Svenja Ruosch, Florian Wang, Ruijie Wardatzky, Kathrin Bernstein, Abraham Univ Zurich Dept Informat Zurich Switzerland

ISBN: (纸本)9798400705502

In the scope of the 7th Lifelog Search Challenge (LSC'24), we present the 4th iteration of LifeGraph, a multimodal knowledge-graph approach with data augmentations using vision-language models (VLM). We extend the LifeGraph model presented in former LSC challenges by event-based clustering using temporal and spatial relations as well as information extracted from descriptions of Lifelog image captions produced by VLMs.

关键词： Lifelogging Lifelog Search Challenge Multimodal Knowledge Graphs Graph-based Retrieval Multi-modal Retrieval vision-language models

来源：评论

学校读者我要写书评

暂无评论

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in vision-language models with Counterfactual Examples

SocialCounterfactuals: Probing and Mitigating Intersectional...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Howard, Phillip Madasu, Avinash Le, Tiep Moreno, Gustavo Lujan Bhiwandiwalla, Anahita Lal, Vasudev Intel Labs Santa Clara CA 95052 USA

ISBN: (纸本)9798350353006

While vision-language models (VLMs) have achieved remarkable performance improvements recently, there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge, we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g., a given occupation) while differing only in their depiction of intersectional social attributes (e.g., race & gender). Through our over-generate-then-filter methodology, we produce SocialCounterfactuals, a high-quality dataset containing 171k image-text pairs for probing intersectional biases related to gender, race, and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs.

关键词： counterfactuals Fairness intersectionality social bias vision-language models

来源：评论

学校读者我要写书评

暂无评论

InsightSee: Advancing Multi-agent vision-language models for Enhanced Visual Understanding 21

InsightSee: Advancing Multi-agent Vision-Language Models for...

引用

21st IEEE International Conference on Mechatronics and Automation (IEEE ICMA)

作者： Zhang, Huaxiang Mu, Yaojia Zhu, Guo-Niu Gan, Zhongxue Fudan Univ Acad Engn & Technol Shanghai 200433 Peoples R China

ISBN: (纸本)9798350388084;9798350388077

Accurate visual understanding is imperative for advancing autonomous systems and intelligent robots. Despite the powerful capabilities of vision-language models (VLMs) in processing complex visual scenes, precisely recognizing obscured or ambiguously presented visual elements remains challenging. To tackle such issues, this paper proposes InsightSee, a multi-agent framework to enhance VLMs' interpretative capabilities in handling complex visual understanding scenarios. The frame-work comprises a description agent, two reasoning agents, and a decision agent, which are integrated to refine the process of visual information interpretation. The design of these agents and the mechanisms by which they can be enhanced in visual information processing are presented. Experimental results demonstrate that the InsightSee framework not only boosts performance on specific visual tasks but also retains the original models' strength. The proposed framework outperforms state-of-the-art algorithms in 6 out of 9 benchmark tests, with a substantial advancement in multimodal understanding.

关键词： Visual understanding Multi-agent vision-language models Adversarial reasoning

来源：评论

学校读者我要写书评

暂无评论

Dual Memory Networks: A Versatile Adaptation Approach for vision-language models

Dual Memory Networks: A Versatile Adaptation Approach for Vi...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Zhang, Yabin Zhu, Wenjie Tang, Hui Ma, Zhiyuan Zhou, Kaiyang Zh, Lei HKPolyU Hong Kong Peoples R China OPPO Hong Kong Peoples R China HKUST Hong Kong Peoples R China HKBU Hong Kong Peoples R China

ISBN: (纸本)9798350353006

With the emergence of pre-trained vision-language models like CLIP, how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation, few-shot adaptation, and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper, we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically, we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge, enabling training-free few-shot adaptation, while the dynamic memory preserves historical test features online during the testing process, allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy, which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably, in the zero-shot scenario, it outperforms existing methods by over 3% and even shows superior results against methods utilizing external training data. Additionally, our method exhibits robust performance against natural distribution shifts. Codes are available at https://***/YBZh/DMN.

关键词： dual memory networks versatile adaptation vision-language models

来源：评论

学校读者我要写书评

暂无评论

Neural Collapse Anchored Prompt Tuning for Generalizable vision-language models 24

Neural Collapse Anchored Prompt Tuning for Generalizable Vis...

引用

30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

作者： Zhu, Didi Li, Zexi Zhang, Min Yuan, Junkun Liu, Jiashuo Kuang, Kun Wu, Chao Zhejiang Univ Hangzhou Peoples R China Tsinghua Univ Beijing Peoples R China

ISBN: (纸本)9798400704901

Large-scale vision-language (V-L) models have demonstrated remarkable generalization capabilities for downstream tasks through prompt tuning. However, the mechanisms behind the learned text representations are unknown, limiting further generalization gains, and the limitations are more severe when faced with the prevalent class imbalances seen in web-sourced datasets. Recent advances in the neural collapse (NC) phenomenon of vision-only models suggest that the optimal representation structure is the simplex ETF, which paves the way to study representations in V-L models. In this paper, we make the first attempt to use NC for examining the representations in V-L models via prompt tuning. It is found that NC optimality of text-to-image representations shows a positive correlation with downstream generalizability, which is more severe under class imbalance settings. To improve the representations, we propose Neural-collapse-anchored Prompt Tuning (NPT), a novel method that learns prompts with text and image representations that satisfy the same simplex Equiangular Tight Frame (ETF). NPT incorporates two regularization terms: language-modality collapse and multi-modality isomorphism;and it is compatible with other prompt tuning methods. Extensive experiments show that NPT can consistently help to improve existing prompt tuning techniques across 11 datasets for both balanced and imbalanced settings.

关键词： Prompt Tuning vision-language models Representation Learning

来源：评论

学校读者我要写书评

暂无评论

EgoThink: Evaluating First-Person Perspective Thinking Capability of vision-language models

EgoThink: Evaluating First-Person Perspective Thinking Capab...

引用

IEEE/CVF Conference on Computer vision and Pattern Recognition (CVPR)

作者： Cheng, Sijie Guo, Zhicheng Wu, Jingwen Fang, Kechen Li, Peng Liu, Huaping Liu, Yang Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Tsinghua Univ Inst AI Ind Res AIR Beijing Peoples R China Univ Toronto Dept Elect & Comp Engn Toronto ON Canada Tsinghua Univ Zhili Coll Beijing Peoples R China 01 Ai Beijing Peoples R China

ISBN: (纸本)9798350353006

vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from ego-centric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate twenty-one popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

关键词： Benchmark Egocentric vision-language models

来源：评论

学校读者我要写书评

暂无评论

TAGGAR: General-Purpose Task Guidance from Natural language in Augmented Reality using vision-language models 24

TAGGAR: General-Purpose Task Guidance from Natural Language ...

引用

12th Symposium on Spatial User Interaction (SUI)

作者： Stover, Daniel Bowman, Doug A. Virginia Tech Dept Comp Sci Ctr Human Comp Interact Blacksburg VA 24061 USA

ISBN: (纸本)9798400710889

Augmented reality (AR) task guidance systems provide assistance for procedural tasks by rendering virtual guidance visuals within the real-world environment. Current AR task guidance systems are limited in that they require AR system experts to manually place visuals, require models of real-world objects, or only function for limited tasks or environments. We propose a general-purpose AR task guidance approach for tasks defined by natural language. Our approach allows an operator to take pictures of relevant objects and write task instructions for an end user, which are used by the system to determine where to place guidance visuals. Then, an end user can receive and follow guidance even if objects change locations or environments. Our approach utilizes current vision-language machine learning models for text and image semantic understanding and object localization. We built a proof-of-concept system called TAGGAR using our approach and tested its accuracy and usability in a user study. We found that all operators were able to generate clear guidance for tasks and end users were able to follow the guidance visuals to complete the expected action 85.7% of the time without any knowledge of the tasks.

关键词： Augmented Reality Virtual Reality Task Guidance General-Purpose Machine Learning vision-language models Natural language

来源：评论

学校读者我要写书评

暂无评论

Experiential Views: Towards Human Experience Evaluation of Designed Spaces using vision-language models

Experiential Views: Towards Human Experience Evaluation of D...

引用

CHI Conference on Human Factors in Computing Sytems (CHI)

作者： Aseniero, Bon Adriel Lee, Michael Wang, Yi Zhou, Qian Shahmansouri, Nastaran Goldstein, Rhys Autodesk Res Toronto ON Canada

ISBN: (纸本)9798400703317

Experiential Views is a proof-of-concept in which we explore a method of helping architects and designers predict how building occupants might experience their designed spaces using AI technology based on vision-language models. Our prototype evaluates a space using a pre-trained model that we fine-tuned with photos and renders of a building. These images were evaluated and labeled based on a preliminary set of three human-centric dimensions that characterize the Social, Tranquil, and Inspirational qualities of a scene. We developed a floor plan visualization and a WebGL-based 3D-viewer that demonstrate how architectural design software could be enhanced to evaluate areas of a built environment based on psychological or emotional criteria. We see this as an early step towards helping designers anticipate emotional responses to their designs to create better experiences for occupants.

关键词： vision-language models architectural design human-centric building design

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共17页 << < 1 2 3 4 5 6 7 8 9 10 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：