检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

23,000 篇 会议
126 册 图书
92 篇 期刊文献

馆藏范围

23,217 篇 电子文献
1 种 纸本馆藏

日期分布

学科分类号

13,622 篇 工学
- 11,107 篇 计算机科学与技术...
- 3,479 篇 软件工程
- 2,445 篇 机械工程
- 1,716 篇 光学工程
- 1,075 篇 电气工程
- 1,014 篇 控制科学与工程
- 784 篇 信息与通信工程
- 411 篇 仪器科学与技术
- 352 篇 生物工程
- 251 篇 生物医学工程（可授...
- 196 篇 电子科学与技术（可...
- 114 篇 化学工程与技术
- 107 篇 安全科学与工程
- 100 篇 测绘科学与技术
- 88 篇 建筑学
- 86 篇 交通运输工程
- 84 篇 土木工程
3,494 篇 医学
- 3,481 篇 临床医学
- 81 篇 基础医学(可授医学...
3,241 篇 理学
- 1,939 篇 物理学
- 1,640 篇 数学
- 563 篇 统计学（可授理学、...
- 500 篇 生物学
- 249 篇 系统科学
- 106 篇 化学
521 篇 管理学
- 311 篇 图书情报与档案管...
- 223 篇 管理科学与工程(可...
- 76 篇 工商管理
276 篇 艺术学
- 276 篇 设计学（可授艺术学...
66 篇 法学
- 63 篇 社会学
38 篇 农学
28 篇 教育学
22 篇 经济学
10 篇 军事学
3 篇 文学

主题

10,186 篇 computer vision
3,966 篇 pattern recognit...
3,005 篇 training
2,007 篇 computational mo...
1,818 篇 visualization
1,815 篇 cameras
1,515 篇 feature extracti...
1,481 篇 shape
1,455 篇 three-dimensiona...
1,438 篇 image segmentati...
1,287 篇 robustness
1,205 篇 computer archite...
1,155 篇 semantics
1,147 篇 conferences
1,107 篇 layout
1,092 篇 computer science
1,087 篇 object detection
1,025 篇 benchmark testin...
970 篇 codes
922 篇 face recognition

机构

136 篇 univ sci & techn...
121 篇 univ chinese aca...
118 篇 chinese univ hon...
107 篇 carnegie mellon ...
101 篇 tsinghua univers...
101 篇 microsoft resear...
95 篇 swiss fed inst t...
93 篇 zhejiang univ pe...
82 篇 university of sc...
81 篇 zhejiang univers...
80 篇 university of ch...
77 篇 shanghai ai lab ...
72 篇 shanghai jiao to...
69 篇 national laborat...
67 篇 microsoft res as...
67 篇 alibaba grp peop...
64 篇 adobe research
61 篇 tsinghua univ pe...
60 篇 peking univ peop...
59 篇 univ oxford oxfo...

作者

81 篇 van gool luc
72 篇 timofte radu
64 篇 zhang lei
47 篇 luc van gool
40 篇 yang yi
40 篇 li stan z.
37 篇 loy chen change
34 篇 chen chen
33 篇 xiaoou tang
32 篇 liu yang
32 篇 qi tian
31 篇 tian qi
31 篇 sun jian
30 篇 murino vittorio
30 篇 pascal fua
29 篇 darrell trevor
29 篇 li fei-fei
28 篇 li xin
28 篇 ying shan
27 篇 vasconcelos nuno

语言

23,137 篇 英文
52 篇 其他
22 篇 中文
5 篇 土耳其文
2 篇 日文

检索条件"任意字段=IEEE Conference on Computer Vision and Pattern Recognition Workshops"

共 23218 条记录，以下是351-360 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

ViP-LLaVA: Making Large Multimodal Models Understand Arbitra...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Cai, Mu Liu, Haotian Mustikovela, Siva Karthik Meyer, Gregory P. Chai, Yuning Park, Dennis Lee, Yong Jae Univ Wisconsin Madison WI 53706 USA Cruise LLC San Francisco CA USA

ISBN: (纸本)9798350353006

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

关键词： Large Language Models Large Multimodal Models Multimodal Benchmark Region-level Understanding vision-language models Visual Commonsense Reasoning Visual Prompts

来源：评论

学校读者我要写书评

暂无评论

Grounding Everything: Emerging Localization Properties in vision-Language Transformers

Grounding Everything: Emerging Localization Properties in Vi...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Bousselham, Walid Petersen, Felix Ferrari, Vittorio Kuehne, Hilde Univ Bonn Bonn Germany Goethe Univ Frankfurt Frankfurt Germany Stanford Univ Stanford CA 94305 USA Synthesia Io London England MIT IBM Watson AI Lab Cambridge MA USA

ISBN: (纸本)9798350353013;9798350353006

vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery [17] to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark. (1)

关键词： CLIP open-vocabulary zero-shot segmentation vision-language model

来源：评论

学校读者我要写书评

暂无评论

Unseen And Adverse Outdoor Scenes recognition Through Event-based Captions

Unseen And Adverse Outdoor Scenes Recognition Through Event-...

引用

ieee/CVF International conference on computer vision (ICCV)

作者： Sakaino, Hidetomo Weathernews Inc Weather Transportat Lab Visual Recognit Grp Chiba Japan

ISBN: (纸本)9798350307443

This paper presents EventCAP, i.e., event-based captions, for refined and enriched qualitative and quantitative captions by Deep Learning (DL) models and vision Language Models (VLMs) with different tasks in a complementary manner. Indoor and outdoor images are used for object recognition and captioning. However, outdoor images in events change in wide ranges due to natural phenomena, i.e., weather changes. Such dynamical changes may degrade segmentation by illumination and object shape changes. This increases unseen objects and scenes under such adverse conditions. On the other hand, single state-of-art (SOTA) DLs and VLMs work with single or limited tasks, Therefore, this paper proposes EventCAP with captions with physical scales and objects' surface properties. Moreover, an iterative VQA model is proposed to refine incomplete segmented images with the prompts. A higher semantic level in captions for real-world scene descriptions is experimentally shown compared to SOTA VLMs.

关键词： adverse condition caption deep learning recognition segmentation vision language

来源：评论

学校读者我要写书评

暂无评论

Random Entangled Tokens for Adversarially Robust vision Transformer

Random Entangled Tokens for Adversarially Robust Vision Tran...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Gong, Huihui Dong, Mingjing Mao, Siqi Camtepe, Seyit Nepal, Surya Xu, Chang Univ Sydney Sydney NSW Australia CSIRO Data61 Eveleigh Australia City Univ Hong Kong Hong Kong Peoples R China Univ New South Wales Sydney NSW Australia

ISBN: (纸本)9798350353006

vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks ( CNNs) in the realm of computer vision, showcasing tremendous potential. However, recent research has unveiled a susceptibility of ViTs to adversarial attacks, akin to their CNN counterparts. Adversarial training and randomization are two representative effective defenses for CNNs. Some researchers have attempted to apply adversarial training to ViTs and achieved comparable robustness to CNNs, while it is not easy to directly apply randomization to ViTs because of the architecture difference between CNNs and ViTs. In this paper, we delve into the structural intricacies of ViTs and propose a novel defense mechanism termed Random entangled image Transformer (ReiT), which seamlessly integrates adversarial training and randomization to bolster the adversarial robustness of ViTs. Recognizing the challenge posed by the structural disparities between ViTs and CNNs, we introduce a novel module, input-independent random entangled self-attention (II-ReSA). This module optimizes random entangled tokens that lead to "dissimilar" self-attention outputs by leveraging model parameters and the sampled random tokens, thereby synthesizing the self-attention module outputs and random entangled tokens to diminish adversarial similarity. ReiT incorporates two distinct random entangled tokens and employs dual randomization, offering an effective countermeasure against adversarial examples while ensuring comprehensive deduction guarantees. Through extensive experiments conducted on various ViT variants and benchmarks, we substantiate the superiority of our proposed method in enhancing the adversarial robustness of vision Transformers.

关键词： Adversarial Robustness Randomized Defence Self-Attention Mechanism vision Transformers

来源：评论

学校读者我要写书评

暂无评论

LowRankOcc: Tensor Decomposition and Low-Rank Recovery for vision-based 3D Semantic Occupancy Prediction

LowRankOcc: Tensor Decomposition and Low-Rank Recovery for V...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Zhao, Linqing Xu, Xiuwei Wang, Ziwei Zhang, Yunpeng Zhang, Borui Zheng, Wenzhao Du, Dalong Zhou, Jie Lu, Jiwen Tsinghua Univ Dept Automat Beijing Peoples R China Tianjin Univ Sch Elect & Informat Engn Tianjin Peoples R China PhiGent Robot Beijing Peoples R China

ISBN: (纸本)9798350353006

In this paper, we present a tensor decomposition and low-rank recovery approach (LowRankOcc) for vision-based 3D semantic occupancy prediction. Conventional methods model outdoor scenes with fine-grained 3D grids, but the sparsity of non-empty voxels introduces considerable spatial redundancy, leading to potential overfitting risks. In contrast, our approach leverages the intrinsic low-rank property of 3D occupancy data, factorizing voxel representations into low-rank components to efficiently mitigate spatial redundancy without sacrificing performance. Specifically, we present the Vertical-Horizontal (VH) decomposition block factorizes 3D tensors into vertical vectors and horizontal matrices. With our "decomposition-encoding recovery" framework, we encode 3D contexts with only 1/2D convolutions and poolings, and subsequently recover the encoded compact yet informative context features back to voxel representations. Experimental results demonstrate that LowRankOcc achieves state-of-the-art performances in semantic scene completion on the SemanticKITTI dataset and 3D occupancy prediction on the nuScenes dataset.

关键词： 3D semantic occupancy tensor decomposition

来源：评论

学校读者我要写书评

暂无评论

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

GenZI: Zero-Shot 3D Human-Scene Interaction Generation

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Li, Lei Dai, Angela Tech Univ Munich Munich Germany

ISBN: (纸本)9798350353006

Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI(1), the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs), which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene, we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene, guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches, GenZI circumvents the conventional need for captured 3D interaction data, and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality, making it applicable to diverse scene types, including both indoor and outdoor environments.

关键词： Human-Scene Interaction vision-Language Models Zero-Shot

来源：评论

学校读者我要写书评

暂无评论

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect vision-Language Pre-training Framework

Decomposing Disease Descriptions for Enhanced Pathology Dete...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Vu Minh Hieu Phan Xie, Yutong Qi, Yuankai Liu, Linggiao Liu, Liyang Zhang, Bowen Liao, Zhibin Wu, Qi To, Minh-Son Verjans, Johan W. Univ Adelaide Australian Inst Machine Learning Adelaide SA Australia Macquarie Univ Sydney NSW Australia Flinders Univ S Australia Adelaide SA Australia

ISBN: (纸本)9798350353006

Medical vision language pre-training (VLP) has emerged as a frontier of research, enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts, current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper, we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects, leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module, our approach aligns an input image with the diverse elements of a disease, generating aspect-centric image representations. By consolidating the matches from each aspect, we improve the compatibility between an image and its associated disease. Additionally, capitalizing on the aspect-oriented representations, we present a dual-head Transformer tailored to process known and unknown diseases, optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets, ours improves the accuracy of recent methods by up to 8.56% and 17.26% for seen and unseen categories, respectively. Our code is released at https://***/HieuPhan33/MAVL.

关键词： Medical vision-language pre-training vision-language pre-training Visual grounding Zero-shot classification

来源：评论

学校读者我要写书评

暂无评论

SyncMask: Synchronized Attentional Masking for Fashion-centric vision-Language Pretraining

SyncMask: Synchronized Attentional Masking for Fashion-centr...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Song, Chull Hwan Hwang, Taebaek Yoon, Jooyoung Choi, Shunghyun Gu, Yeong Hyeon Dealicious Inc Seoul South Korea Sejong Univ Seoul South Korea

ISBN: (纸本)9798350353006

vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However, in fashion domain, datasets of-en exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text, leading to cases where some textual details are not visible in individual images. This mismatch, particularly when non-co-occurring elements are masked, undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling, thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem, we pro-pose Synchronized attentional Masking (SyncMask), which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model, ensuring a precise alignment between the two modalities. Additionally, we enhance grouped batch sampling with semi-hard negatives, effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach, outperforming existing methods in three downstream tasks.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Sun, Zeyi Fang, Ye Wu, Tong Zhang, Pan Zang, Yuhang Kong, Shu Xiong, Yuanjun Lin, Dahua Wang, Jiaqi Shanghai Jiao Tong Univ Shanghai Peoples R China Fudan Univ Shanghai Peoples R China Chinese Univ Hong Kong Hong Kong Peoples R China Shanghai AI Lab Shanghai Peoples R China Univ Macau Taipa Macao Peoples R China MThreads Inc Beijing Peoples R China

ISBN: (纸本)9798350353006

Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image, including all the details, even those irrelevant to specific tasks. However, for a finer understanding and controlled editing of images, it becomes crucial to focus on specific regions of interest, which can be indicated as points, masks, or boxes by humans or perception models. To fulfill the requirements, we introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks. Our project is with codes and models available is linked to https://***/alpha-clip/.

关键词： CLIP MLLMs vision-language pretraining

来源：评论

学校读者我要写书评

暂无评论

The Gender Gap in Face recognition Accuracy Is a Hairy Problem

The Gender Gap in Face Recognition Accuracy Is a Hairy Probl...

引用

23rd ieee/CVF Winter conference on Applications of computer vision (WACV)

作者： Bhatta, Aman Albiero, Vitor Bowyer, Kevin W. King, Michael C. Univ Notre Dame Notre Dame IN 46556 USA Florida Inst Technol Melbourne FL 32901 USA

ISBN: (纸本)9798350320565

It is broadly accepted that there is a "gender gap" in face recognition accuracy, with females having lower accuracy. However, relatively little is known about the cause(s) of this gender gap. We first demonstrate that female and male hairstyles have important differences that impact face recognition accuracy. In particular, variation in male facial hair contributes to a greater average difference in appearance between different male faces. We then demonstrate that when the data used to evaluate recognition accuracy is gender-balanced for how hairstyles occlude the face, the initially observed gender gap in accuracy largely disappears. We show this result for two different matchers, and for a Caucasian image dataset and an African-American dataset. Our results suggest that research on demographic variation in accuracy should include a check for balanced quality of the test data as part of the problem formulation. This new understanding of the causes of the gender gap in recognition accuracy will hopefully promote rational consideration of what might be done about it. To promote reproducible research, the matchers, attribute classifiers, and datasets used in this work are available to other researchers.

关键词： Hair computer vision Face recognition conferences

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 32 33 34 35 36 37 38 39 40 41 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：