检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

23,142 篇 会议
91 篇 期刊文献
15 册 图书

馆藏范围

23,247 篇 电子文献
1 种 纸本馆藏

日期分布

学科分类号

13,637 篇 工学
- 11,168 篇 计算机科学与技术...
- 3,342 篇 软件工程
- 2,414 篇 机械工程
- 1,663 篇 光学工程
- 1,205 篇 电气工程
- 974 篇 控制科学与工程
- 739 篇 信息与通信工程
- 381 篇 仪器科学与技术
- 322 篇 生物工程
- 239 篇 生物医学工程（可授...
- 189 篇 电子科学与技术（可...
- 109 篇 化学工程与技术
- 106 篇 安全科学与工程
- 99 篇 测绘科学与技术
- 85 篇 建筑学
- 85 篇 交通运输工程
- 82 篇 土木工程
- 56 篇 力学（可授工学、理...
3,696 篇 医学
- 3,684 篇 临床医学
- 76 篇 基础医学(可授医学...
3,140 篇 理学
- 1,882 篇 物理学
- 1,605 篇 数学
- 547 篇 统计学（可授理学、...
- 466 篇 生物学
- 243 篇 系统科学
- 107 篇 化学
492 篇 管理学
- 290 篇 图书情报与档案管...
- 213 篇 管理科学与工程(可...
- 74 篇 工商管理
252 篇 艺术学
- 251 篇 设计学（可授艺术学...
58 篇 法学
38 篇 农学
25 篇 教育学
19 篇 经济学
10 篇 军事学
3 篇 文学

主题

10,395 篇 computer vision
3,893 篇 pattern recognit...
3,101 篇 training
2,104 篇 computational mo...
1,898 篇 visualization
1,799 篇 cameras
1,487 篇 feature extracti...
1,475 篇 three-dimensiona...
1,464 篇 shape
1,447 篇 image segmentati...
1,287 篇 robustness
1,235 篇 computer archite...
1,213 篇 semantics
1,112 篇 benchmark testin...
1,111 篇 conferences
1,104 篇 layout
1,092 篇 object detection
1,084 篇 computer science
1,026 篇 codes
907 篇 face recognition

机构

137 篇 univ sci & techn...
124 篇 univ chinese aca...
121 篇 chinese univ hon...
108 篇 tsinghua univers...
108 篇 carnegie mellon ...
105 篇 microsoft resear...
97 篇 zhejiang univ pe...
91 篇 swiss fed inst t...
85 篇 university of sc...
84 篇 zhejiang univers...
81 篇 shanghai ai lab ...
79 篇 university of ch...
75 篇 shanghai jiao to...
69 篇 microsoft res as...
68 篇 alibaba grp peop...
66 篇 adobe research
65 篇 national laborat...
64 篇 peking univ peop...
61 篇 univ oxford oxfo...
59 篇 peng cheng labor...

作者

80 篇 van gool luc
71 篇 timofte radu
65 篇 zhang lei
43 篇 luc van gool
40 篇 yang yi
37 篇 loy chen change
34 篇 li stan z.
33 篇 liu yang
33 篇 xiaoou tang
33 篇 murino vittorio
33 篇 chen chen
33 篇 qi tian
33 篇 li fei-fei
32 篇 tian qi
32 篇 sun jian
30 篇 ying shan
30 篇 pascal fua
29 篇 darrell trevor
28 篇 li xin
28 篇 hanqing lu

语言

23,073 篇 英文
148 篇 其他
20 篇 中文
5 篇 土耳其文
2 篇 日文

检索条件"任意字段=IEEE/CVF Conference on Computer Vision and Pattern Recognition"

共 23248 条记录，以下是381-390 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual I...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Yu, Qifan Li, Juncheng Wei, Longhui Pang, Liang Ye, Wentao Qin, Bosheng Tang, Siliang Tian, Qi Zhuang, Yueting Zhejiang Univ Hangzhou Peoples R China Huawei Cloud Suzhou Peoples R China Chinese Acad Sci Inst Comp Technol Beijing Peoples R China

ISBN: (纸本)9798350353006

Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object cooccurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA. The data and code for this paper are publicly available.(1)

关键词： Hallucinations Multi-modal Language Model vision-language reasoning

来源：评论

学校读者我要写书评

暂无评论

Forecasting of 3D Whole-body Human Poses with Grasping Objects

Forecasting of 3D Whole-body Human Poses with Grasping Objec...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Yan, Haitao Cui, Qiongjie Xie, Jiexin Guo, Shijie Fudan Univ Acad Engn & Technol Shanghai Peoples R China Nanjing Univ Sci & Technol Nanjing Peoples R China

ISBN: (纸本)9798350353013;9798350353006

In the context of computer vision and human-robot interaction, forecasting 3D human poses is crucial for understanding human behavior and enhancing the predictive capabilities of intelligent systems. While existing methods have made significant progress, they often focus on predicting major body joints, overlooking fine-grained gestures and their interaction with objects. Human hand movements, particularly during object interactions, play a pivotal role and provide more precise expressions of human poses. This work fills this gap and introduces a novel paradigm: forecasting 3D whole-body human poses with a focus on grasping objects. This task involves predicting activities across all joints in the body and hands, encompassing the complexities of internal heterogeneity and external interactivity. To tackle these challenges, we also propose a novel approach: C3HOST, cross-context cross-modal consolidation for 3D whole-body pose forecasting, effectively handles the complexities of internal heterogeneity and external interactivity. C3HOST involves distinct steps, including the heterogeneous content encoding and alignment, and cross-modal feature learning and interaction. These enable us to predict activities across all body and hand joints, ensuring high-precision whole-body human pose prediction, even during object grasping. Extensive experiments on two benchmarks demonstrate that our model significantly enhances the accuracy of whole-body human motion prediction. The project page is available at https://***/view/c3host.

关键词： 3D computer vision cross-modal learning human motion analysis human motion prediction

来源：评论

学校读者我要写书评

暂无评论

Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture

Multi-Modal Fusion of Event and RGB for Monocular Depth Esti...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Devulapally, Anusha Khan, Md Fahim Faysal Advani, Siddharth Narayanan, Vijaykrishnan Penn State Univ University Pk PA 16802 USA Samsung Elect Amer Ridgefield Pk NJ USA

ISBN: (纸本)9798350365474

In the field of robotics and autonomous navigation, accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors, capture asynchronous changes in brightness at the pixel level, offering benefits such as high temporal resolution, no motion blur, and a wide dynamic range. However, unlike traditional cameras that measure absolute intensity, event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study, we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities, a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events, convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error, achieving state-of-the-art results in most cases. Additionally, the performance is also seen in other metrics like RMSE, absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at:https://***/anusha-devulapally/ER-F2D.

关键词： Event Cameras Monocular Depth Estimation Multi-Modal Fusion vision Transformer

来源：评论

学校读者我要写书评

暂无评论

Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation

Instance-aware Exploration-Verification-Exploitation for Ins...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Lei, Xiaohan Wang, Min Zhou, Wengang Li, Li Li, Houqiang Univ Sci & Technol China MoE Key Lab Brain Inspired Intelligent Percept & Hefei Anhui Peoples R China Hefei Comprehens Natl Sci Ctr Inst Artificial Intelligence Hefei Anhui Peoples R China

ISBN: (纸本)9798350353006

As a new embodied vision task, Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work, we propose to imitate the human behaviour of "getting closer to confirm" when distinguishing objects from a distance. Specifically, we design a new modular navigation framework named Instance-aware Exploration-Verification- Exploitation (IEVE) for instance-level image goal navigation. Our method allows for active switching among the exploration, verification, and exploitation actions, thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3D-SEM) dataset, our method surpasses previous state-of-the-art work, with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://***/XiaohanLei/IEVE.

关键词： Embodied vision Verification Visual Navigation

来源：评论

学校读者我要写书评

暂无评论

Segment and Caption Anything

Segment and Caption Anything

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Huang, Xiaoke Wang, Jianfeng Tang, Yansong Zhang, Zheng Hue, Han Lu, Jiwen Wang, Lijuan Liu, Zicheng Tsinghua Univ Shenzhen Int Grad Sch Shenzhen Peoples R China Microsoft Shanghai Peoples R China Tsinghua Univ Dept Automat Beijing Peoples R China Adv Micro Devices Inc Beijing Peoples R China

ISBN: (纸本)9798350353006

We propose a method to efficiently equip the Segment Anything Model ( SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer, we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions), it costs less computation, less memory usage, and less communication bandwidth, resulting in both fast and scalable training. To address the scarcity problem of regional caption data, we propose to first pretrain our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pretraining data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. The project page, along with the associated code, can be accessed via the following link.

关键词： Image Captioning Parameter-Efficient Fine-Tuning Regional Captioning Segmentation vision Language Learning

来源：评论

学校读者我要写书评

暂无评论

Retrieval-Augmented Open-Vocabulary Object Detection

Retrieval-Augmented Open-Vocabulary Object Detection

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Kim, Jooyeon Cho, Eulrang Kim, Sehyung Kim, Hyunwoo J. Korea Univ Dept Comp Sci & Engn Seoul South Korea Samsung Res Mountain View CA USA

ISBN: (纸本)9798350353006

Open-vocabulary object detection (OVD) has been studied with vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector, using 'positive' pseudo-labels with additional 'class' names, e.g., sock, iPod, and alligator. To extend the previous methods in two aspects, we propose Retrieval-Augmented Losses and visual Features ( RALF). Our method retrieves related 'negative' classes and augments loss functions. Also, visual features are augmented with 'verbalized concepts' of classes, e.g., worn on the feet, handheld music player, and sharp teeth. Specifically, RALF consists of two modules: Retrieval Aug-mented Losses ( RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition, RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box APN on novel categories of the COCO dataset and 3.6 mask APr gains on the LVIS dataset. Code is available at https://***/mlvlab/RALF.

关键词： computer vision Object Detection Open-Vocabulary Object Detection Retrieval-Augmentation

来源：评论

学校读者我要写书评

暂无评论

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine vision-Language Model

Jack of All Tasks, Master of Many: Designing General-purpose...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Pramanick, Shraman Han, Guangxing Hou, Rui Nag, Sayan Lim, Ser-Nam Ballas, Nicolas Wang, Qifan Chellappa, Rama Almahairi, Amjad Johns Hopkins Univ Baltimore MD 21218 USA Meta New York NY 10003 USA Univ Toronto Toronto ON Canada Univ Cent Florida Orlando FL 32816 USA

ISBN: (纸本)9798350353006

The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong base-lines across many downstream tasks. Our project page can be found at https://***/VistaLLM/.

关键词：

来源：评论

学校读者我要写书评

暂无评论

On Scaling up a Multilingual vision and Language Model

On Scaling up a Multilingual Vision and Language Model

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Chen, Xi Djolonga, Josip Padlewski, Piotr Mustafa, Basil Changpinyo, Soravit Wu, Jialin Ruiz, Carlos Riquelme Goodman, Sebastian Wang, Xiao Tay, Yi Shakeri, Siamak Dehghani, Mostafa Salz, Daniel Lucic, Mario Tschannen, Michael Nagrani, Arsha Hu, Hexiang Joshi, Mandar Pang, Bo Montgomery, Ceslee Pietrzyk, Paulina Ritter, Marvin Piergiovanni, A. J. Minderer, Matthias Pavetic, Filip Waters, Austin Li, Gang Alabdulmohsin, Ibrahim Beyer, Lucas Amelot, Julien Lee, Kenton Steiner, Andreas Peter Li, Yang Keysers, Daniel Arnab, Anurag Xu, Yuanzhong Rong, Keran Kolesnikov, Alexander Seyedhosseini, Mojtaba Angelova, Anelia Zhai, Xiaohua Houlsby, Neil Soricut, Radu Google Mountain View CA 94043 USA

ISBN: (纸本)9798350353006

We explore the boundaries of scaling up a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.

关键词： language multimodal pretraining vision

来源：评论

学校读者我要写书评

暂无评论

Exploring and Utilizing pattern Imbalance

Exploring and Utilizing Pattern Imbalance

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Mei, Shibin Zhao, Chenglong Yuan, Shengchao Ni, Bingbing Shanghai Jiao Tong Univ Shanghai 200240 Peoples R China

ISBN: (纸本)9798350301298

In this paper, we identify pattern imbalance from several aspects, and further develop a new training scheme to avert pattern preference as well as spurious correlation. In contrast to prior methods which are mostly concerned with category or domain granularity, ignoring the potential finer structure that existed in datasets, we give a new definition of seed category as an appropriate optimization unit to distinguish different patterns in the same category or domain. Extensive experiments on domain generalization datasets of diverse scales demonstrate the effectiveness of the proposed method.

关键词： Datasets and evaluation

来源：评论

学校读者我要写书评

暂无评论

TransNeXt: Robust Foveal Visual Perception for vision Transformers

TransNeXt: Robust Foveal Visual Perception for Vision Transf...

引用

ieee/cvf conference on computer vision and pattern recognition (CVPR)

作者： Shi, Dai

ISBN: (纸本)9798350353006

Due to the depth degradation effect in residual connections, many efficient vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing, leading to unnatural visual perception. To address this issue, in this paper, we propose Aggregated Attention, a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore, we incorporate learnable tokens that interact with conventional queries and keys, which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange, thus effectively avoiding depth degradation and achieving natural visual perception. Additionally, we propose Convolutional GLU, a channel mixer that bridges the gap between GLU and SE mechanism, which empowers each token to have channel attention based on its nearest neighbor image features, enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of 224(2), TransNeXt-Tiny attains an ImageNet accuracy of 84.0%, surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of 384(2), a COCO object detection mAP of 57.1, and an ADE20K semantic segmentation mIoU of 54.7.

关键词： Aggregated Attention Biomimetic vision Design Convolutional GLU Efficient Transformer Foveal Visual Perception Image Classification Image Segmentation ImageNet-1K ImageNet-Adversarial Large-Kernel Convolution Length-Scaled Cosine Attention Multi-Scale Extrapolation Object Detection Perceptual Artifacts Pixel-focused Attention Robustness Self-Attention vision Transformer Visual Backbone

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 35 36 37 38 39 40 41 42 43 44 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：