检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

11,885 篇 会议
5 篇 期刊文献

馆藏范围

11,890 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

8,059 篇 工学
- 7,617 篇 计算机科学与技术...
- 796 篇 机械工程
- 688 篇 电气工程
- 360 篇 软件工程
- 228 篇 控制科学与工程
- 40 篇 光学工程
- 19 篇 生物工程
- 17 篇 信息与通信工程
- 12 篇 生物医学工程（可授...
- 6 篇 电子科学与技术（可...
- 6 篇 建筑学
- 6 篇 交通运输工程
- 5 篇 仪器科学与技术
- 5 篇 化学工程与技术
- 5 篇 安全科学与工程
- 4 篇 土木工程
3,347 篇 医学
- 3,346 篇 临床医学
- 4 篇 基础医学(可授医学...
- 4 篇 公共卫生与预防医...
253 篇 理学
- 198 篇 系统科学
- 32 篇 物理学
- 21 篇 生物学
- 18 篇 数学
- 9 篇 统计学（可授理学、...
- 7 篇 化学
17 篇 管理学
- 12 篇 管理科学与工程(可...
- 7 篇 图书情报与档案管...
- 5 篇 工商管理
3 篇 法学
- 3 篇 社会学
3 篇 教育学
- 3 篇 教育学
2 篇 农学
1 篇 经济学
1 篇 军事学

主题

5,633 篇 computer vision
2,668 篇 training
2,203 篇 pattern recognit...
1,747 篇 computational mo...
1,502 篇 visualization
1,360 篇 three-dimensiona...
1,074 篇 semantics
999 篇 benchmark testin...
986 篇 codes
959 篇 computer archite...
891 篇 deep learning
777 篇 conferences
754 篇 task analysis
700 篇 feature extracti...
561 篇 transformers
533 篇 face recognition
527 篇 neural networks
495 篇 object detection
490 篇 image segmentati...
468 篇 cameras

机构

174 篇 univ sci & techn...
145 篇 carnegie mellon ...
144 篇 univ chinese aca...
144 篇 tsinghua univ pe...
134 篇 chinese univ hon...
110 篇 zhejiang univ pe...
109 篇 peng cheng lab p...
99 篇 swiss fed inst t...
91 篇 tsinghua univers...
90 篇 shanghai ai lab ...
87 篇 sensetime res pe...
86 篇 shanghai jiao to...
83 篇 zhejiang univers...
82 篇 tech univ munich...
79 篇 university of sc...
79 篇 stanford univ st...
78 篇 univ hong kong p...
77 篇 australian natl ...
76 篇 alibaba grp peop...
75 篇 peng cheng labor...

作者

75 篇 timofte radu
64 篇 van gool luc
50 篇 zhang lei
43 篇 yang yi
37 篇 loy chen change
36 篇 tao dacheng
32 篇 zhou jie
31 篇 chen chen
30 篇 liu yang
30 篇 tian qi
29 篇 sun jian
29 篇 zha zheng-jun
28 篇 li xin
27 篇 qi tian
26 篇 vasconcelos nuno
25 篇 liu xiaoming
25 篇 darrell trevor
24 篇 zheng wei-shi
24 篇 luo ping
24 篇 ying shan

语言

11,863 篇 英文
26 篇 其他
1 篇 中文

检索条件"任意字段=2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024"

共 11890 条记录，以下是251-260 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Synthesize, Diagnose, and Optimize: Towards Fine-Grained vision-Language Understanding

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vis...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Peng, Wujian Xi, Sicheng You, Zuyao Lan, Shiyi Wu, Zuxuan Fudan Univ Sch CS Shanghai Key Lab Intell Info Proc Shanghai Peoples R China Shanghai Collaborat Innovat Ctr Intelligent Visua Shanghai Peoples R China NVIDIA Shenzhen Guangdong Peoples R China

ISBN: (纸本)9798350353006

vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However, understanding fine-grained visual-linguistic concepts, such as attributes and inter-object relationships, remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity, their primary focus remains on the linguistic aspect, neglecting the visual dimension. Here, we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine, we carefully design a benchmark, SPEC, to diagnose the comprehension of object size, position, existence, and count. Subsequently, we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly, their performance is close to random guess, revealing significant limitations. With this in mind, we propose a simple yet effective approach to optimize VLMs in fine-grained understanding, achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements, further validating the transferability of our approach. Code and data are available at https://***/wjpoom/SPEC.

关键词： Fine-grained understdanding vision language model

来源：评论

学校读者我要写书评

暂无评论

ParamISP: Learned Forward and Inverse ISPs using Camera Parameters

ParamISP: Learned Forward and Inverse ISPs using Camera Para...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Kim, Woohyeok Kim, Geonu Lee, Junyong Lee, Seungyong Baek, Seung-Hwan Cho, Sunghyun POSTECH Pohang South Korea Samsung AI Ctr Toronto Toronto ON Canada Samsung Toronto ON Canada

ISBN: (纸本)9798350353006

RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated, enabling physically-meaningful RAW-level image processing on input sRGB images. However, existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time, and have limitations when used for various applications. In this paper, we propose ParamISP, a learning-based method for forward and inverse conversion between sRGB and RAW images, that adopts a novel neural-network module to utilize camera parameters, which is dubbed as ParamNet. Given the camera parameters provided in the EXIF data, ParamNet converts them into a feature vector to control the ISP networks. Extensive experiments demonstrate that ParamISP achieve superior RAWand sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis, raw deblurring, HDR reconstruction, and camera-to-camera transfer.

关键词： Low-level vision

来源：评论

学校读者我要写书评

暂无评论

Multi-Modal Hallucination Control by Visual Information Grounding

Multi-Modal Hallucination Control by Visual Information Grou...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Favero, Alessandro Zancato, Luca Trager, Matthew Choudhary, Siddharth Perera, Pramuditha Achille, Alessandro Swaminathan, Ashwin Soatto, Stefano AWS AI Labs Lausanne Switzerland

ISBN: (纸本)9798350353006

Generative vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers that, however, are not always grounded in the input image. We investigate this phenomenon, usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular, we show that as more tokens are generated, the reliance on the visual prompt decreases, and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations, we introduce Multi-Modal Mutual-Information Decoding (M3ID), a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior, hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option, we show that M3ID can be paired with Direct Preference Optimization ( DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically, for the LLaVA 13B model, M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28%, respectively, and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%.

关键词： language reasoning vision

来源：评论

学校读者我要写书评

暂无评论

Choose What You Need: Disentangled Representation Learning for Scene Text recognition, Removal and Editing

Choose What You Need: Disentangled Representation Learning f...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Zhang, Boqiang Xie, Hongtao Gao, Zuan Wang, Yuxin Univ Sci & Technol China Hefei Peoples R China

ISBN: (纸本)9798350353006

Scene text images contain not only style information (font, background) but also content information (character, texture). Different scene text tasks need different information, but previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically, we synthesize a dataset of image pairs with identical style but different content. Based on the dataset, we decouple the two types of features by the supervision design. Clearly, we directly split the visual representation into style and content features, the content features are supervised by a text recognition loss, while an alignment loss aligns the style features in the image pairs. Then, style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge, this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text recognition, Removal, and Editing.

关键词： Scene Text Scene Understanding

来源：评论

学校读者我要写书评

暂无评论

SpatialVLM: Endowing vision-Language Models with Spatial Reasoning Capabilities

SpatialVLM: Endowing Vision-Language Models with Spatial Rea...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Chen, Boyuan Xu, Zhuo Kirman, Sean Ichter, Brian Sadigh, Dorsa Guibas, Leonidas Xia, Fei Google DeepMind London England Google Res Mountain View CA USA MIT 77 Massachusetts Ave Cambridge MA 02139 USA

ISBN: (纸本)9798350353006

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size difference. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in training recipe including data quality, training pipeline and VLM architecture. Our work features the first Internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Website: https://***/

关键词： large language model multimodal spatial reasoning vision language model

来源：评论

学校读者我要写书评

暂无评论

Low-Resource vision Challenges for Foundation Models

Low-Resource Vision Challenges for Foundation Models

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Zhang, Yunhua Doughty, Hazel Snoek, Cees G. M. Univ Amsterdam Amsterdam Netherlands Leiden Univ Leiden Netherlands

ISBN: (纸本)9798350353006

Low-resource settings are well-established in natural language processing, where many languages lack sufficient data for deep learning at scale. However, low-resource problems are under-explored in computer vision. In this paper, we address this gap and explore the challenges of low-resource image tasks with vision foundation models. We first collect a benchmark of genuinely low-resource image data, covering historic maps, circuit diagrams, and mechanical drawings. These low-resource settings all share three challenges: data scarcity, fine-grained differences, and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability, we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision, we introduce one simple baseline per challenge. Specifically, we i) enlarge the data space by generative models, ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on our three low-resource tasks demonstrate our proposals already provide a better baseline than transfer learning, data augmentation, and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation. Project page: https://***/Low-Resource-vision/.

关键词：

来源：评论

学校读者我要写书评

暂无评论

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

ViP-LLaVA: Making Large Multimodal Models Understand Arbitra...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Cai, Mu Liu, Haotian Mustikovela, Siva Karthik Meyer, Gregory P. Chai, Yuning Park, Dennis Lee, Yong Jae Univ Wisconsin Madison WI 53706 USA Cruise LLC San Francisco CA USA

ISBN: (纸本)9798350353006

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

关键词： Large Language Models Large Multimodal Models Multimodal Benchmark Region-level Understanding vision-language models Visual Commonsense Reasoning Visual Prompts

来源：评论

学校读者我要写书评

暂无评论

Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning

Troika: Multi-Path Cross-Modal Traction for Compositional Ze...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Hu, Siteng Gong, Biao Feng, Yutong Zhang, Min Lv, Yiliang Wang, Donglin Zhejiang Univ Hangzhou Peoples R China Alibaba Grp Hangzhou Peoples R China Westlake Univ Sch Engn AI Div Machine Intelligence Lab MiLAB Hangzhou Peoples R China

ISBN: (纸本)9798350353006

Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is an outstanding implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://***/bighuang624/Troika.

关键词： compositional zero-shot learning parameter-efficient transfer learning Troika vision-language models

来源：评论

学校读者我要写书评

暂无评论

Generating Enhanced Negatives for Training Language-Based Object Detectors

Generating Enhanced Negatives for Training Language-Based Ob...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Zhao, Shiyu Zhao, Long Kumar, Vijay B. G. Suh, Yumin Metaxas, Dimitris N. Chandraker, Manmohan Schulter, Samuel Rutgers State Univ New Brunswick NJ 08901 USA NEC Labs Amer Princeton NJ USA Google Res Mountain View CA USA Univ Calif San Diego La Jolla CA USA

ISBN: (纸本)9798350353006

The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful, but requires good positive and negative samples. However, the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast, we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically, we use large-Language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks. Code is available at https://***/xiaofeng94/Gen-Enhanced-Negs.

关键词： large language model negative example mining open-vocabulary object detection vision and language

来源：评论

学校读者我要写书评

暂无评论

Boosting Adversarial Transferability by Block Shuffle and Rotation

Boosting Adversarial Transferability by Block Shuffle and Ro...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Wang, Kunyu He, Xuanran Wang, Wenxuan Wang, Xiaosen Chinese Univ Hong Kong Hong Kong Peoples R China Nanyang Technol Univ Singapore Singapore Huawei Singular Secur Lab Beijing Peoples R China

ISBN: (纸本)9798350353006

Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability, which refers to their ability to deceive other models, thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability, the performance still falls short compared with white-box attacks. In this work, we observe that existing input transformation based attacks, one of the mainstream transfer-based attacks, result in different attention heatmaps on various models, which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding, we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically, BSR splits the input image into several blocks, then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability, which significantly outperforms the state-of-the-art methods. Code is available at https://***/Trustworthy-AI-Group/BSR.

关键词： adversarial attack computer vision

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 22 23 24 25 26 27 28 29 30 31 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：