检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

分类表

所选分类

>> <<

限定检索结果

标题

标题
作者
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

作者

作者
标题
主题词
出版物名称
出版社
机构
学科分类号
摘要
ISBN
ISSN
基金资助
索书号

文献类型

19,636 篇 会议
49 篇 期刊文献
3 册 图书

馆藏范围

19,687 篇 电子文献
1 种 纸本馆藏

日期分布

学科分类号

12,587 篇 工学
- 10,355 篇 计算机科学与技术...
- 2,449 篇 机械工程
- 2,010 篇 软件工程
- 815 篇 光学工程
- 599 篇 电气工程
- 433 篇 控制科学与工程
- 329 篇 信息与通信工程
- 211 篇 测绘科学与技术
- 80 篇 生物医学工程（可授...
- 75 篇 生物工程
- 69 篇 电子科学与技术（可...
- 67 篇 仪器科学与技术
- 37 篇 建筑学
- 36 篇 土木工程
- 34 篇 力学（可授工学、理...
- 31 篇 航空宇航科学与技...
- 29 篇 安全科学与工程
- 23 篇 交通运输工程
- 21 篇 化学工程与技术
- 20 篇 材料科学与工程（可...
3,435 篇 医学
- 3,434 篇 临床医学
1,980 篇 理学
- 1,001 篇 数学
- 972 篇 物理学
- 356 篇 统计学（可授理学、...
- 340 篇 生物学
- 235 篇 系统科学
- 26 篇 化学
262 篇 管理学
- 141 篇 管理科学与工程(可...
- 124 篇 图书情报与档案管...
- 26 篇 工商管理
19 篇 法学
12 篇 农学
8 篇 教育学
6 篇 经济学
4 篇 艺术学
2 篇 军事学

主题

7,949 篇 computer vision
2,773 篇 training
2,712 篇 pattern recognit...
1,771 篇 computational mo...
1,660 篇 visualization
1,427 篇 cameras
1,383 篇 three-dimensiona...
1,345 篇 shape
1,236 篇 face recognition
1,222 篇 feature extracti...
1,213 篇 image segmentati...
1,117 篇 robustness
1,094 篇 semantics
977 篇 layout
961 篇 object detection
946 篇 benchmark testin...
944 篇 computer archite...
931 篇 codes
897 篇 computer science
861 篇 deep learning

机构

174 篇 univ sci & techn...
159 篇 carnegie mellon ...
148 篇 univ chinese aca...
144 篇 chinese univ hon...
109 篇 microsoft resear...
103 篇 zhejiang univ pe...
103 篇 tsinghua univ pe...
99 篇 swiss fed inst t...
92 篇 tsinghua univers...
89 篇 microsoft res as...
88 篇 shanghai ai lab ...
81 篇 zhejiang univers...
76 篇 alibaba grp peop...
74 篇 university of sc...
73 篇 hong kong univ s...
73 篇 university of ch...
72 篇 peking univ peop...
68 篇 shanghai jiao to...
66 篇 univ oxford oxfo...
65 篇 google res mount...

作者

79 篇 van gool luc
70 篇 zhang lei
60 篇 timofte radu
48 篇 yang yi
48 篇 luc van gool
46 篇 xiaoou tang
43 篇 darrell trevor
43 篇 tian qi
42 篇 loy chen change
42 篇 sun jian
42 篇 li fei-fei
40 篇 li stan z.
39 篇 qi tian
36 篇 chen xilin
36 篇 torralba antonio
35 篇 vasconcelos nuno
35 篇 shan shiguang
35 篇 liu yang
34 篇 liu xiaoming
34 篇 tao dacheng

语言

19,682 篇 英文
3 篇 中文
2 篇 日文
1 篇 其他

检索条件"任意字段=IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015"

共 19688 条记录，以下是341-350 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

相关度排序

相关度排序
时效性降序
时效性升序

Equivariant Multi-Modality Image Fusion

Equivariant Multi-Modality Image Fusion

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Zhao, Zixiang Hai, Haowen Zhang, Jiangshe Zhang, Yulun Zhane, Kai Xu, Shuang Chen, Dongdong Timofte, Radu Van Gool, Luc Xi An Jiao Tong Univ Xian Peoples R China Swiss Fed Inst Technol Zurich Switzerland Shanghai Jiao Tong Univ Shanghai Peoples R China Nanjing Univ Nanjing Peoples R China Northwestern Polytech Univ Xian Peoples R China Heriot Watt Univ Edinburgh Midlothian Scotland Univ Wurzburg Wurzburg Germany INSAIT Sofia Bulgaria

ISBN: (纸本)9798350353006

Multi-modality image fusion is a technique that combines information from different sensors or modalities, enabling the fused image to retain complementary features from each modality, such as functional highlights and texture details. However, effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue, we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently, we introduce a novel training paradigm that encompasses a fusion module, a pseudo-sensing module, and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infraredvisible and medical images, concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://***/Zhaozixiang1228/MMIF-EMMA.

关键词： image fusion low-level vision

来源：评论

学校读者我要写书评

暂无评论

Sharingan: A Transformer Architecture for Multi-Person Gaze Following

Sharingan: A Transformer Architecture for Multi-Person Gaze ...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Tafasca, Samy Gupta, Anshul Odobez, Jean-Marc Idiap Res Inst Martigny Switzerland Ecole Polytech Fed Lausanne Lausanne Switzerland

ISBN: (纸本)9798350353013;9798350353006

Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such, modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular, the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures, but they have been constrained by the need to process one person at a time, which proves to be highly inefficient. In this paper, we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction [38, 39], they use a fixed set of learnable embeddings to decode both the person and its gaze target, which requires a matching step afterward to link the predictions with the annotations. Thus, it is difficult to quantitatively evaluate these methods reliably with the available benchmarks, or integrate them into a larger human behavior understanding system. Instead, we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget, and ChildPlay datasets and outper-forms comparable multi-person architectures with a notable margin. Our code, checkpoints, and data extractions will be made publicly available soon.

关键词： computer vision deep learning gaze following

来源：评论

学校读者我要写书评

暂无评论

EgoThink: Evaluating First-Person Perspective Thinking Capability of vision-Language Models

EgoThink: Evaluating First-Person Perspective Thinking Capab...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Cheng, Sijie Guo, Zhicheng Wu, Jingwen Fang, Kechen Li, Peng Liu, Huaping Liu, Yang Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Tsinghua Univ Inst AI Ind Res AIR Beijing Peoples R China Univ Toronto Dept Elect & Comp Engn Toronto ON Canada Tsinghua Univ Zhili Coll Beijing Peoples R China 01 Ai Beijing Peoples R China

ISBN: (纸本)9798350353006

vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities, with the majority focusing on the third-person perspective, and only a few addressing specific tasks from the first-person perspective. However, the capability of VLMs to "think" from a first-person perspective, a crucial attribute for advancing autonomous agents and robotics, remains largely unexplored. To bridge this research gap, we introduce EgoThink, a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from ego-centric videos, with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs, we evaluate twenty-one popular VLMs on EgoThink. Moreover, given the open-ended format of the answers, we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions, all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile, enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion, EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs, providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics.

关键词： Benchmark Egocentric vision-Language Models

来源：评论

学校读者我要写书评

暂无评论

Language-driven Grasp Detection

Language-driven Grasp Detection

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： An Dinh Vuong Minh Nhat Vu Baoru Huang Nghia Nguyen Hieu Le Thieu Vo Anh Nguyen FPT Software AI Ctr Hanoi Vietnam TU Wien Automat Control Inst Vienna Austria Imperial Coll London London England Ton Duc Thang Univ Ho Chi Minh City Vietnam Univ Liverpool Liverpool Merseyside England

ISBN: (数字)9798350353006

ISBN: (纸本)9798350353006

Grasp detection is a persistent and intricate challenge with various industrial applications. Recently, many methods and datasets have been proposed to tackle the grasp detection problem. However, most of them do not consider using natural language as a condition to detect the grasp poses. In this paper, we introduce Grasp-Anything++, a new language- driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task, we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective, which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally, we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work.

关键词： Training computer vision Noise reduction Natural languages Grasping Benchmark testing Diffusion models

来源：评论

学校读者我要写书评

暂无评论

EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning

EmoVIT: Revolutionizing Emotion Insights with Visual Instruc...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Xie, Hongxia Peng, Chu-Jun Tseng, Yu-Wen Chen, Hung-Jen Hsu, Chan-Feng Shuai, Hong-Han Cheng, Wen-Huang Jilin Univ Changchun Peoples R China Natl Taiwan Univ Taipei Taiwan Natl Yang Ming Chiao Tung Univ Hsinchu Taiwan

ISBN: (纸本)9798350353006

Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at https://***/aimmemotion/EmoVIT.

关键词： Emotion recognition Large Language Model Visual Instruction Tuning

来源：评论

学校读者我要写书评

暂无评论

Bootstrapping SparseFormers from vision Foundation Models

Bootstrapping SparseFormers from Vision Foundation Models

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Gao, Ziteng Tong, Zhan Lin, Kevin Qinghong Chen, Joya Shou, Mike Zheng Natl Univ Singapore Show Lab Singapore Singapore

ISBN: (纸本)9798350353006

The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scal-ing up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://***/showlab/sparseformer

关键词： Cost reduction

来源：评论

学校读者我要写书评

暂无评论

From Coarse to Fine-Grained Open-Set recognition

From Coarse to Fine-Grained Open-Set Recognition

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Lang, Nico Snaebjarnarson, Vesteinn Cole, Elijah Mac Aodha, Oisin Igel, Christian Belongie, Serge Univ Copenhagen Copenhagen Denmark Altos Labs San Diego CA USA Univ Edinburgh Edinburgh Midlothian Scotland

ISBN: (纸本)9798350353006

Open-set recognition (OSR) methods aim to identify whether or not a test example belongs to a category observed during training. Depending on how visually similar a test example is to the training categories, the OSR task can be easy or extremely challenging. However, the vast majority of previous work has studied OSR in the presence of large, coarse-grained semantic shifts. In contrast, many real-world problems are inherently fine-grained, which means that test examples may be highly visually similar to the training categories. Motivated by this observation, we investigate three aspects of OSR: label granularity, similarity between the open- and closed-sets, and the role of hierarchical supervision during training. To study these dimensions, we curate new open-set splits of a large fine-grained visual categorization dataset. Our analysis results in several interesting findings, including: (i) the best OSR method to use is heavily dependent on the degree of semantic shift present, and (ii) hierarchical representation learning can improve coarse-grained OSR, but has little effect on fine-grained OSR performance. To further enhance fine-grained OSR performance, we propose a hierarchy-adversarial learning method to discourage hierarchical structure in the representation space, which results in a perhaps counter-intuitive behaviour, and a relative improvement in fine-grained OSR of up to 2% in AUROC and 7% in AUPR over standard training. Code and data are available: ***/fine-grained-osr.

关键词： anomaly detection classification fine-grained iNaturalist novelty detection OOD Open-set recognition OSR Out-of-distribution detection

来源：评论

学校读者我要写书评

暂无评论

PracticalDG: Perturbation Distillation on vision-Language Models for Hybrid Domain Generalization

PracticalDG: Perturbation Distillation on Vision-Language Mo...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Chen, Zining Wang, Weiqiu Zhao, Zhicheng Su, Fei Men, Aidong Meng, Hongying Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing Peoples R China Beijing Key Lab Network Syst & Network Culture Beijing Peoples R China Minist Culture & Tourism Key Lab Interact Technol & Experience Syst Beijing Peoples R China Brunel Univ Uxbridge Uxbridge Middx England

ISBN: (纸本)9798350353006

Domain Generalization (DG) aims to resolve distribution shifts between source and target domains, and current DG methods are default to the setting that data from source and target domains share identical categories. Nevertheless, there exists unseen classes from target domains in practical scenarios. To address this issue, Open Set Domain Generalization (OSDG) has emerged and several methods have been exclusively proposed. However, most ex-isting methods adopt complex architectures with slight improvement compared with DG methods. Recently, vision-language models (VLMs) have been introduced in DG following the fine-tuning paradigm, but consume huge training overhead with large vision models. Therefore, in this paper, we innovate to transfer knowledge from VLMs to lightweight vision models and improve the robustness by introducing Perturbation Distillation (PD) from three perspectives, including Score, Class and Instance (SCI), named SCI- PD. Moreover, previous methods are oriented by the benchmarks with identical and fixed splits, ignoring the divergence between source domains. These methods are revealed to suffer from sharp performance decay with our proposed new benchmark Hybrid Domain Generalization (HDG) and a novel metric H-2-CV, which construct various splits to comprehensively assess the robustness of algorithms. Extensive experiments demonstrate that our method outperforms state-of-the-art algorithms on multiple datasets, especially improving the robustness when con-fronting data scarcity.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

Forecasting of 3D Whole-body Human Poses with Grasping Objects

Forecasting of 3D Whole-body Human Poses with Grasping Objec...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Yan, Haitao Cui, Qiongjie Xie, Jiexin Guo, Shijie Fudan Univ Acad Engn & Technol Shanghai Peoples R China Nanjing Univ Sci & Technol Nanjing Peoples R China

ISBN: (纸本)9798350353013;9798350353006

In the context of computer vision and human-robot interaction, forecasting 3D human poses is crucial for understanding human behavior and enhancing the predictive capabilities of intelligent systems. While existing methods have made significant progress, they often focus on predicting major body joints, overlooking fine-grained gestures and their interaction with objects. Human hand movements, particularly during object interactions, play a pivotal role and provide more precise expressions of human poses. This work fills this gap and introduces a novel paradigm: forecasting 3D whole-body human poses with a focus on grasping objects. This task involves predicting activities across all joints in the body and hands, encompassing the complexities of internal heterogeneity and external interactivity. To tackle these challenges, we also propose a novel approach: C3HOST, cross-context cross-modal consolidation for 3D whole-body pose forecasting, effectively handles the complexities of internal heterogeneity and external interactivity. C3HOST involves distinct steps, including the heterogeneous content encoding and alignment, and cross-modal feature learning and interaction. These enable us to predict activities across all body and hand joints, ensuring high-precision whole-body human pose prediction, even during object grasping. Extensive experiments on two benchmarks demonstrate that our model significantly enhances the accuracy of whole-body human motion prediction. The project page is available at https://***/view/c3host.

关键词： 3D computer vision cross-modal learning human motion analysis human motion prediction

来源：评论

学校读者我要写书评

暂无评论

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

MM-Narrator: Narrating Long-form Videos with Multimodal In-C...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Zh, Chaoyi Lin, Kevin Yang, Zhengyuan Wang, Jianfeng Li, Linjie Lin, Chung-Ching Liu, Zicheng Wang, Lijuan Univ Sydney Sydney NSW Australia Microsoft Corp Redmond WA 98052 USA Adv Micro Devices Inc Santa Clara CA USA

ISBN: (纸本)9798350353006

We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios, as measured by standard evaluation metrics. Additionally, we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4, this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions.

关键词： audio description in-context learning LLM multimodal retrieval-augmented generation video understanding vision-and-language

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 31 32 33 34 35 36 37 38 39 40 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：