检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

20,860 篇 会议
105 篇 期刊文献
43 册 图书

馆藏范围

21,007 篇 电子文献
1 种 纸本馆藏

日期分布

学科分类号

13,620 篇 工学
- 11,056 篇 计算机科学与技术...
- 2,652 篇 机械工程
- 2,252 篇 软件工程
- 914 篇 光学工程
- 885 篇 电气工程
- 529 篇 控制科学与工程
- 477 篇 信息与通信工程
- 216 篇 测绘科学与技术
- 135 篇 生物工程
- 127 篇 生物医学工程（可授...
- 98 篇 电子科学与技术（可...
- 92 篇 仪器科学与技术
- 46 篇 安全科学与工程
- 40 篇 建筑学
- 40 篇 化学工程与技术
- 39 篇 土木工程
- 37 篇 交通运输工程
- 35 篇 力学（可授工学、理...
- 33 篇 航空宇航科学与技...
3,494 篇 医学
- 3,489 篇 临床医学
- 32 篇 基础医学(可授医学...
2,247 篇 理学
- 1,145 篇 物理学
- 1,081 篇 数学
- 401 篇 生物学
- 384 篇 统计学（可授理学、...
- 245 篇 系统科学
- 46 篇 化学
343 篇 管理学
- 176 篇 管理科学与工程(可...
- 168 篇 图书情报与档案管...
- 34 篇 工商管理
31 篇 法学
19 篇 农学
15 篇 教育学
8 篇 经济学
5 篇 艺术学
2 篇 军事学
1 篇 文学

主题

8,141 篇 computer vision
2,886 篇 training
2,841 篇 pattern recognit...
1,809 篇 computational mo...
1,715 篇 visualization
1,493 篇 cameras
1,433 篇 three-dimensiona...
1,433 篇 feature extracti...
1,366 篇 shape
1,360 篇 face recognition
1,243 篇 image segmentati...
1,135 篇 robustness
1,124 篇 semantics
992 篇 computer archite...
985 篇 object detection
982 篇 layout
959 篇 benchmark testin...
935 篇 codes
900 篇 computer science
898 篇 object recogniti...

机构

174 篇 univ sci & techn...
158 篇 univ chinese aca...
153 篇 carnegie mellon ...
145 篇 chinese univ hon...
109 篇 microsoft resear...
103 篇 zhejiang univ pe...
99 篇 swiss fed inst t...
95 篇 tsinghua univers...
90 篇 microsoft res as...
90 篇 tsinghua univ pe...
88 篇 shanghai ai lab ...
81 篇 zhejiang univers...
77 篇 alibaba grp peop...
74 篇 hong kong univ s...
73 篇 university of sc...
72 篇 peking univ peop...
72 篇 university of ch...
68 篇 shanghai jiao to...
66 篇 univ oxford oxfo...
65 篇 google res mount...

作者

80 篇 van gool luc
70 篇 zhang lei
58 篇 timofte radu
48 篇 yang yi
47 篇 luc van gool
46 篇 xiaoou tang
44 篇 tian qi
43 篇 darrell trevor
42 篇 loy chen change
42 篇 sun jian
41 篇 qi tian
40 篇 li stan z.
38 篇 li fei-fei
37 篇 chen xilin
36 篇 shan shiguang
35 篇 zhou jie
35 篇 vasconcelos nuno
35 篇 liu yang
35 篇 torralba antonio
34 篇 liu xiaoming

语言

20,982 篇 英文
10 篇 中文
7 篇 其他
5 篇 土耳其文
2 篇 日文
2 篇 葡萄牙文

检索条件"任意字段=2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016"

共 21008 条记录，以下是461-470 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

PromptSync: Bridging Domain Gaps in vision-Language Models through Class-Aware Prototype Alignment and Discrimination

PromptSync: Bridging Domain Gaps in Vision-Language Models t...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Khandelwal, Anant Glance AI Bangalore Karnataka India

ISBN: (纸本)9798350365474

The potential for zero-shot generalization in vision-language (V-L) models such as CLIP has spurred their widespread adoption in addressing numerous downstream tasks. Previous methods have employed test-time prompt tuning to adapt the model to unseen domains, but they overlooked the issue of imbalanced class distributions. In this study, we explicitly address this problem by employing class-aware prototype alignment weighted by mean class probabilities obtained for the test sample and filtered augmented views. Additionally, we ensure that the class probabilities are as accurate as possible by performing prototype discrimination using contrastive learning. The combination of alignment and discriminative loss serves as a geometric regularizer, preventing the prompt representation from collapsing onto a single class and effectively bridging the distribution gap between the source and test domains. Our method, named PromptSync, synchronizes the prompts for each test sample on both the text and vision branches of the V-L model. In empirical evaluations on the domain generalization benchmark, our method outperforms previous best methods by 2.33% in overall performance, by 1% in base-to-novel generalization, and by 2.84% in cross-dataset transfer tasks.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Panda-70M: Captioning 70M Videos with Multiple Cross-Modalit...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Chen, Tsai-Shien Siarohin, Aliaksandr Menapace, Willi Deyneka, Ekaterina Chao, Hsiang-wei Jeon, Byung Eun Fang, Yuwei Lee, Hsin-Ying Ren, Jian Yang, Ming-Hsuan Tulyakov, Sergey Snap Inc Santa Monica CA 90405 USA Univ Calif Merced Merced CA 95343 USA Univ Trento Trento Italy Snap Santa Monica CA USA

ISBN: (纸本)9798350353006

The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs, high-quality video-text data is much harder to collect. First of all, manual labeling is more time-consuming, as it requires an annotator to watch an entire video. Second, videos have a temporal dimension, consisting of several scenes stacked together, and showing multiple actions. Accordingly, to establish a video dataset with high- quality captions, we propose an automatic approach leveraging multimodal inputs, such as textual video description, subtitles, and individual video frames. Specifically, we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips, and apply multiple cross-modality teacher models to obtain captions for each video. Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way, we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning, video and text retrieval, and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks.

关键词： Multimodal learning Video captioning vision-language dataset

来源：评论

学校读者我要写书评

暂无评论

Multi-criteria Token Fusion with One-step-ahead Attention for Efficient vision Transformers

Multi-criteria Token Fusion with One-step-ahead Attention fo...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Lee, Sanghyeok Choi, Joonmyung Kim, Hyunwoo J. Korea Univ Dept Comp Sci & Engn Seoul South Korea

ISBN: (纸本)9798350353006

vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs, recent works lessen the quadratic cost of the self- attention layer by pruning or fusing the redundant tokens. However, these works faced the speed-accuracy trade-off caused by the loss of information. Here, we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper, we propose a Multi-criteria Token Fusion (MCTF), that gradually fuses the tokens based on multi-criteria (i.e., similarity, informativeness, and size of fused tokens). Further, we utilize the one-step-ahead attention, which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency, we achieve the best speed-accuracy tradeoff in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically, DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5%, and +0.3%) over the base model, respectively. We also demonstrate the applicability of MCTF in various vision Transformers (e.g., T2T-ViT, LV-ViT), achieving at least 31% speedup without performance degradation. Code is available at https://***/mlvlab/MCTF.

关键词： Efficient ViTs Token Fusion Token Merging Token Reduction

来源：评论

学校读者我要写书评

暂无评论

Doubly Right Object recognition: A Why Prompt for Visual Rationales

Doubly Right Object Recognition: A Why Prompt for Visual Rat...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Mao, Chengzhi Teotial, Revant Sundari, Amrutha Menon, Sachit Yang, Junfeng Wang, Xin Vondrick, Carl Columbia Univ New York NY 10027 USA Microsoft Res Redmond WA USA

ISBN: (纸本)9798350301298

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a "doubly right" object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a "why prompt," which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

关键词： Explainable computer vision

来源：评论

学校读者我要写书评

暂无评论

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

SIFU: Side-view Conditioned Implicit Function for Real-world...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Zhang, Zechuan Yang, Zongxin Yang, Yi Zhejiang Univ CCAL ReLER Hangzhou Peoples R China

ISBN: (纸本)9798350353006

Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements, accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images, along with predicting textures for unseen areas, remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response, we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction), a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline. SIFU employs a cross-attention mechanism within the transformer, using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness, especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments, SIFU surpasses SOTA methods in both geometry and texture reconstruction, showcasing enhanced robustness in com-lex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building, demonstrating its broad utility in real-world scenarios.

关键词： 3D reconstruction 3D vision Clothed human diffusion model single image to 3D

来源：评论

学校读者我要写书评

暂无评论

Making Visual Sense of Oracle Bones for You and Me

Making Visual Sense of Oracle Bones for You and Me

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Qiao, Runqi Yang, Lan Pang, Kaiyue Zhang, Honggang Beijing Univ Posts & Telecommun Sch Artificial Intelligence Beijing Peoples R China Univ Surrey CVSSP SketchX Guildford Surrey England

ISBN: (纸本)9798350353006

Visual perception evolves over time. This is particularly the case of oracle bone scripts, where visual glyphs seem intuitive to people from distant past prove difficult to be understood in contemporary eyes. While semantic correspondence of an oracle can be found via a dictionary lookup, this proves to be not enough for public viewers to connect the dots, i.e., why does this oracle mean that? Common solution relies on a laborious curation process to collect visual guide for each oracle (Fig. 1), which hinges on the case-by-case effort and taste of curators. This paper delves into one natural follow-up question: can AI take over? Begin with a comprehensive human study, we show participants could indeed make better sense of an oracle glyph subjected to a proper visual guide and its efficacy can be approximated via a novel metric termed TransOV (Transferable Oracle Visuals). We then define a new conditional visual generation task based on an oracle glyph and its semantic meaning and importantly approach it by circumventing any form of model training in the presence of fatal lack of oracle data. At its heart is to leverage foundation model like GPT-4V to reason about the visual cues hidden inside an oracle and take advantage of an existing text-to-image model for final visual guide generation. Extensive empirical evidence shows our AI-enabled visual guides achieve significantly comparable TransOV performance compared with those collected under manual efforts. Finally, we demonstrate the versatility of our system under a more complex setting, where it is required to work alongside with an AI image denoiser to cope with raw oracle scan image inputs (cf. processed clean oracle glyphs). Code is available at https://***/RQ-Lab/OBS-Visual.

关键词： vision

来源：评论

学校读者我要写书评

暂无评论

What, when, and where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

What, when, and where? Self-Supervised Spatio-Temporal Groun...

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Chen, Brian Shvetsova, Nina Rouditchenko, Andrew Kondermann, Daniel Thomas, Samuel Chang, Shih-Fu Feris, Rogerio Glass, James Kuehne, Hilde Columbia Univ New York NY 10027 USA Goethe Univ Frankfurt Frankfurt Germany Univ Bonn Bonn Germany MIT CSAIL Cambridge MA USA Qual Match GmbH Heidelberg Germany IBM Res AI Yorktown Hts NY USA MIT IBM Watson Lab Cambridge MA USA

ISBN: (纸本)9798350353006

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a frame-work for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed, providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks, showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding.

关键词： Grounding Multimodal learning Self-supervised learning Video understanding vision and language

来源：评论

学校读者我要写书评

暂无评论

Single-Model and Any-Modality for Video Object Tracking

Single-Model and Any-Modality for Video Object Tracking

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Wu, Zongwei Zheng, Jilai Ren, Xiangxuan Vasluianu, Florin-Alexandru Ma, Chao Paudel, Danda Pani Luc Van Gool Timofte, Radu Univ Wurzburg Comp Vision Lab CAIDAS & IFI Wurzburg Germany Shanghai Jiao Tong Univ AI Inst Shanghai Peoples R China Sofia Univ INSAIT Sofia Bulgaria Swiss Fed Inst Technol CVL Zurich Switzerland

ISBN: (纸本)9798350353006

In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multimodality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs - each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGBX pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts, validating our effectiveness and practicality. The source code is publicly available at https:// github. com/ Zongwei97/ UnTrack.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Behavioral Analysis of vision-and-Language Navigation Agents

Behavioral Analysis of Vision-and-Language Navigation Agents

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Yang, Zijiao Majumdar, Arjun Lee, Stefan Oregon State Univ Corvallis OR 97331 USA Georgia Inst Technol Atlanta GA USA

ISBN: (纸本)9798350301298

To be successful, vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis - examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance.

关键词： language reasoning vision

来源：评论

学校读者我要写书评

暂无评论

Deep Video Codec Control for vision Models

Deep Video Codec Control for Vision Models

引用

ieee/CVF conference on computer vision and pattern recognition (cvpr)

作者： Reich, Christoph Debnath, Biplob Patel, Deep Prangemeier, Tim Cremers, Daniel Chakradhar, Srimat NEC Labs Amer Inc San Jose CA 95110 USA Tech Univ Munich Munich Germany Tech Univ Darmstadt Darmstadt Germany Munich Ctr Machine Learning MCML Munich Germany

ISBN: (纸本)9798350365474

Standardized lossy video coding is at the core of almost all real-world video processing pipelines. Rate control is used to enable standard codecs to adapt to different network bandwidth conditions or storage constraints. However, standard video codecs (e.g., H.264) and their rate control modules aim to minimize video distortion w.r.t. human quality assessment. We demonstrate empirically that standard-coded videos vastly deteriorate the performance of deep vision models. To overcome the deterioration of vision performance, this paper presents the first end-to-end learnable deep video codec control that considers both bandwidth constraints and downstream deep vision performance, while adhering to existing standardization. We demonstrate that our approach better preserves downstream deep vision performance than traditional standard video coding.

关键词： Codec Coding for Machines Deep Video Codec Control Deep vision Models End-to-end Learning Rate Control Self-supervised Learning Standardization Video Coding

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 43 44 45 46 47 48 49 50 51 52 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：