检索结果-内蒙古大学图书馆

您好，读者！请登录

内蒙古大学图书馆

首页
概况
党建
资源
服务
科研支持
- 论文收录引用证明
- 科技查新
知识产权
档案馆
帮助

咨询与建议

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

您的常用邮箱：*

您的手机号码：*

问题描述：

当前已输入0个字，您还可以输入200个字

全部搜索
期刊论文
图书
学位论文
标准
纸本馆藏
外文资源发现
数据库导航
超星发现

高级检索

时间限定

出版年份：

文献类型

图书期刊文献学位论文多媒体

馆藏选择

电子馆藏纸本馆藏

核心期刊

全部期刊 SCI 收录期刊 SSCI 收录期刊 EI 收录期刊 CSCD 收录期刊 CSSCI 收录期刊

语言

中文英文

文献类型

期刊文献图书学位论文标准纸本馆藏

帮助

文字说明：

T=题名（书名、题名），A=作者（责任者），K=主题词，P=出版物名称，PU=出版社名称，O=机构（作者单位、学位授予单位、专利申请人），L=中图分类号，C=学科分类号，U=全部字段，Y=年（出版发行年、学位年度、标准发布年）

检索规则说明：

AND代表“并且”；OR代表“或者”；NOT代表“不包含”；(注意必须大写,运算符两边需空一格)

检索范例：

范例一：(K=图书馆学 OR K=情报学) AND A=范并思 AND Y=1982-2016
范例二：P=计算机应用与软件 AND (U=C++ OR U=Basic) NOT K=Visual AND Y=2011-2016

分类表

所选分类

>> <<

限定检索结果

文献类型

11,886 篇 会议
5 篇 期刊文献

馆藏范围

11,891 篇 电子文献
0 种 纸本馆藏

日期分布

学科分类号

8,060 篇 工学
- 7,618 篇 计算机科学与技术...
- 796 篇 机械工程
- 688 篇 电气工程
- 361 篇 软件工程
- 228 篇 控制科学与工程
- 41 篇 光学工程
- 19 篇 生物工程
- 17 篇 信息与通信工程
- 12 篇 生物医学工程（可授...
- 7 篇 交通运输工程
- 6 篇 电子科学与技术（可...
- 6 篇 建筑学
- 5 篇 仪器科学与技术
- 5 篇 化学工程与技术
- 5 篇 安全科学与工程
- 4 篇 土木工程
3,347 篇 医学
- 3,346 篇 临床医学
- 4 篇 基础医学(可授医学...
- 4 篇 公共卫生与预防医...
254 篇 理学
- 198 篇 系统科学
- 32 篇 物理学
- 21 篇 生物学
- 19 篇 数学
- 9 篇 统计学（可授理学、...
- 7 篇 化学
17 篇 管理学
- 12 篇 管理科学与工程(可...
- 7 篇 图书情报与档案管...
- 5 篇 工商管理
3 篇 法学
- 3 篇 社会学
3 篇 教育学
- 3 篇 教育学
2 篇 农学
1 篇 经济学
1 篇 军事学

主题

5,633 篇 computer vision
2,668 篇 training
2,203 篇 pattern recognit...
1,747 篇 computational mo...
1,502 篇 visualization
1,360 篇 three-dimensiona...
1,074 篇 semantics
999 篇 benchmark testin...
986 篇 codes
959 篇 computer archite...
892 篇 deep learning
777 篇 conferences
754 篇 task analysis
700 篇 feature extracti...
561 篇 transformers
533 篇 face recognition
527 篇 neural networks
495 篇 object detection
490 篇 image segmentati...
468 篇 cameras

机构

174 篇 univ sci & techn...
145 篇 carnegie mellon ...
144 篇 univ chinese aca...
144 篇 tsinghua univ pe...
134 篇 chinese univ hon...
110 篇 zhejiang univ pe...
109 篇 peng cheng lab p...
99 篇 swiss fed inst t...
91 篇 tsinghua univers...
90 篇 shanghai ai lab ...
87 篇 sensetime res pe...
86 篇 shanghai jiao to...
83 篇 zhejiang univers...
82 篇 tech univ munich...
79 篇 university of sc...
79 篇 stanford univ st...
78 篇 univ hong kong p...
77 篇 australian natl ...
76 篇 alibaba grp peop...
75 篇 peng cheng labor...

作者

75 篇 timofte radu
64 篇 van gool luc
50 篇 zhang lei
43 篇 yang yi
37 篇 loy chen change
36 篇 tao dacheng
32 篇 zhou jie
31 篇 chen chen
30 篇 liu yang
30 篇 tian qi
29 篇 sun jian
29 篇 zha zheng-jun
28 篇 li xin
27 篇 qi tian
26 篇 vasconcelos nuno
25 篇 liu xiaoming
25 篇 darrell trevor
24 篇 zheng wei-shi
24 篇 luo ping
24 篇 ying shan

语言

11,850 篇 英文
40 篇 其他
1 篇 中文

检索条件"任意字段=2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024"

共 11891 条记录，以下是1401-1410 订阅

全选清除本页清除全部题录导出标记到"检索档案"

详细简洁

排序：

Event-guided Person Re-Identification via Sparse-Dense Complementary Learning

Event-guided Person Re-Identification via Sparse-Dense Compl...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Cao, Chengzhi Fu, Xueyang Liu, Hongjian Huang, Yukun Wang, Kunyu Luo, Jiebo Zha, Zheng-Jun Univ Sci & Technol China Hefei Peoples R China Univ Rochester Rochester NY 14627 USA

ISBN: (纸本)9798350301298

Video-based person re-identification (Re-ID) is a prominent computer vision topic due to its wide range of video surveillance applications. Most existing methods utilize spatial and temporal correlations in frame sequences to obtain discriminative person features. However, inevitable degradation, e.g., motion blur contained in frames, leading to the loss of identity-discriminating cues. Recently, a new bio-inspired sensor called event camera, which can asynchronously record intensity changes, brings new vitality to the Re-ID task. With the microsecond resolution and low latency, it can accurately capture the movements of pedestrians even in the degraded environments. In this work, we propose a Sparse-Dense Complementary Learning (SDCL) Framework, which effectively extracts identity features by fully exploiting the complementary information of dense frames and sparse events. Specifically, for frames, we build a CNN-based module to aggregate the dense features of pedestrian appearance step by step, while for event streams, we design a bio-inspired spiking neural network (SNN) backbone, which encodes event signals into sparse feature maps in a spiking form, to extract the dynamic motion cues of pedestrians. Finally, a cross feature alignment module is constructed to fuse motion information from events and appearance cues from frames to enhance identity representation learning. Experiments on several benchmarks show that by employing events and SNN into Re-ID, our method significantly outperforms competitive methods. The code is available at https://***/ChengzhiCao/SDCL.

关键词： detection recognition: Categorization retrieval

来源：评论

学校读者我要写书评

暂无评论

SynthVSR: Scaling Up Visual Speech recognition With Synthetic Supervision

SynthVSR: Scaling Up Visual Speech Recognition With Syntheti...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Liu, Xubo Lakomkin, Egor Vougioukas, Konstantinos Ma, Pingchuan Chen, Honglie Xie, Ruiming Doulaty, Morrie Moritz, Niko Kolar, Jachym Petridis, Stavros Pantic, Maja Fuegen, Christian Univ Surrey Guildford Surrey England Meta AI New York NY 58051 USA

ISBN: (纸本)9798350301298

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method.

关键词： and reasoning language vision

来源：评论

学校读者我要写书评

暂无评论

IKEA Ego 3D Dataset: Understanding furniture assembly actions from ego-view 3D Point Clouds

IKEA Ego 3D Dataset: Understanding furniture assembly action...

引用

ieee/cvf Winter conference on Applications of computer vision (WACV)

作者： Ben-Shabat, Yizhak Paul, Jonathan Segev, Eviatar Shrout, Oren Gould, Stephen Australian Natl Univ Canberra ACT Australia Technion Israel Inst Technol Haifa Israel

ISBN: (纸本)9798350318920;9798350318937

We propose a novel dataset for ego-view 3D point cloud action recognition. While there has been extensive research on understanding human actions in RGB videos in recent years, the exploration of its 3D point cloud counterpart has been relatively limited. Furthermore, RGB ego-view datasets are rapidly growing, however, 3D point cloud ego-view datasets are scarce at best. Existing 3D datasets are limited in several ways, some include actions that are distinguishable by full-body motion while others use a distant static sensor that hinders the recognition of small objects. We introduce a new point cloud action recognition dataset-the IKEA Ego 3D dataset. It includes sequences of point clouds captured from an ego-view using a HoloLens 2 device. The dataset consists of approximately 493k frames and 56 classes of intricate furniture assembly actions of four different furniture types. We evaluate the performance of various state-of-the-art 3D action recognition methods on the proposed dataset and show that it is very challenging.

关键词： 3D computer vision Algorithms Algorithms Datasets and evaluations

来源：评论

学校读者我要写书评

暂无评论

A*: Atrous Spatial Temporal Action recognition for Real Time ApplicationsA*: Atrous Spatial Temporal Action recognition for Real Time Applications

A*: Atrous Spatial Temporal Action Recognition for Real Time...

引用

ieee/cvf Winter conference on Applications of computer vision (WACV)

作者： Kim, Myeongjun Spinola, Federica Benz, Philipp Kim, Tae-hoon Deeping Source Inc Seoul South Korea

ISBN: (纸本)9798350318920;9798350318937

Deep learning has become a popular tool across various fields and is increasingly being integrated into real-world applications such as autonomous driving cars and surveillance cameras. One area of active research is recognizing human actions, including identifying unsafe or abnormal behaviors. Temporal information is crucial for action recognition tasks. Global context, as well as the target person, are also important for judging human behaviors. However, larger networks that can capture all of these features face difficulties operating in real-time. To address these issues, we propose A*: Atrous Spatial Temporal Action recognition for Real Time Applications. A* includes four modules aimed at improving action detection networks. First, we introduce a Low-Level Feature Aggregation module. Second, we propose the Atrous Spatio-Temporal Pyramid Pooling module. Third, we suggest to fuse all extracted image and video features in an Image-Video Feature Fusion module. Finally, we integrate the Proxy Anchor Loss for action features into the loss function. We evaluate A* on three common action detection benchmarks, and achieve state-of-the-art performance on JHMDB and UCF101-24, while staying competitive on AVA. Furthermore, we demonstrate that A* can achieve real-time inference speeds of 33 FPS, making it suitable for real-world applications.

关键词： Algorithms Video recognition and understanding

来源：评论

学校读者我要写书评

暂无评论

Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images

Visual Language Pretrained Multiple Instance Zero-Shot Trans...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Lu, Ming Y. Chen, Bowen Zhang, Andrew Williamson, Drew F. K. Chen, Richard J. Ding, Tong Le, Long Phi Chuang, Yung-Sung Mahmood, Faisal MIT Cambridge MA 02139 USA Harvard Univ Cambridge MA 02138 USA Mass Gen Brigham Boston MA 02199 USA

ISBN: (纸本)9798350301298

Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models on gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pre-train our text encoder. By effectively leveraging strong pre-trained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://***/mahmoodlab/MI-Zero.

关键词： cell microscopy Medical and biological vision

来源：评论

学校读者我要写书评

暂无评论

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical vision Transformers

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraini...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Liu, Jihao Huang, Xin Zheng, Jinliang Liu, Yu Li, Hongsheng CUHK MMLab Shenzhen Peoples R China SenseTime Res Hong Kong Peoples R China InnoHK CPII Hong Kong Peoples R China

ISBN: (纸本)9798350301298

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical vision Transformers. Existing masked image modeling (MIM) methods for hierarchical vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods.

关键词： Self-supervised or unsupervised representation learning

来源：评论

学校读者我要写书评

暂无评论

Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning

Decomposed Soft Prompt Guided Fusion Enhancing for Compositi...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Lu, Xiaocheng Guo, Song Liu, Ziming Guo, Jingcai Hong Kong Polytech Univ Dept Comp Hong Kong Peoples R China Hong Kong Polytech Univ Shenzhen Res Inst Hong Kong Peoples R China

ISBN: (纸本)9798350301298

Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts formed by known states and objects during training. Existing methods either learn the combined state-object representation, challenging the generalization of unseen compositions, or design two classifiers to identify state and object separately from image features, ignoring the intrinsic relationship between them. To jointly eliminate the above issues and construct a more robust CZSL system, we propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal decomposed fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features. Notably, being fused with the decomposed features, the image features can be more expressive for learning the relationship with states and objects, respectively, to improve the response of unseen compositions in the pair space, hence narrowing the domain gap between seen and unseen sets. Experimental results on three challenging benchmarks demonstrate that our approach significantly outperforms other state-of-the-art methods by large margins.

关键词： Low-level vision

来源：评论

学校读者我要写书评

暂无评论

Masked Image Training for Generalizable Deep Image Denoising

Masked Image Training for Generalizable Deep Image Denoising

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Chen, Haoyu Gu, Jinjin Liu, Yihao Magid, Salma Abdel Dong, Chao Wang, Qiong Pfister, Hanspeter Zhu, Lei Hong Kong Univ Sci & Technol Guangzhou Hong Kong Peoples R China Shanghai AI Lab Shanghai Peoples R China Univ Sydney Sydney Australia Chinese Acad Sci Shenzhen Inst Adv Technol ShenZhen Key Lab Comp Vis & Pattern Recognit Beijing Peoples R China Univ Chinese Acad Sci Beijing Peoples R China Chinese Acad Sci Shenzhen Inst Adv Technol Guangdong Prov Key Lab Comp Vision & Virtual Rea Beijing Peoples R China Harvard Univ Cambridge MA USA Hong Kong Univ Sci & Technol Hong Kong Peoples R China

ISBN: (纸本)9798350301298

When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method.

关键词： Low-level vision

来源：评论

学校读者我要写书评

暂无评论

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Advancing Visual Grounding with Scene Knowledge: Benchmark a...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Chen, Zhihong Zhang, Ruifei Song, Yibing Wan, Xiang Li, Guanbin Chinese Univ Hong Kong Shenzhen Peoples R China Sun Yat Sen Univ Guangzhou Peoples R China Shenzhen Res Inst Big Data Shenzhen Peoples R China Tencent AI Lab Shenzhen Peoples R China Fudan Univ AI3 Inst Shanghai Peoples R China

ISBN: (纸本)9798350301298

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study [27], where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of Scene Knowledge-guided Visual Grounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction;the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at https://***/zhjohnchan/SK-VG.

关键词： language reasoning vision

来源：评论

学校读者我要写书评

暂无评论

Context-aware Alignment and Mutual Masking for 3D-Language Pre-training

Context-aware Alignment and Mutual Masking for 3D-Language P...

引用

ieee/cvf conference on computer vision and pattern recognition (cvpr)

作者： Jin, Zhao Hayat, Munawar Yang, Yuwei Guo, Yulan Lei, Yinjie Sichuan Univ Chengdu Peoples R China Monash Univ Melbourne Vic Australia Sun Yat Sen Univ Guangzhou Peoples R China

ISBN: (纸本)9798350301298

3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering. Our codes are available at https://***/leolyj/3D-VLP

关键词： language reasoning vision

来源：评论

学校读者我要写书评

暂无评论

没有更多数据了...

全选清除本页清除全部题录导出标记到“检索档案”

共500页 << < 137 138 139 140 141 142 143 144 145 146 > >>

检索报告对象比较合并检索0

隐藏清空

合并搜索

回到顶部

执行限定条件

内容：

评分：

请选择保存的检索档案：

请选择收藏分类：

订阅名称：

通借通还

温馨提示：

图书名称：

借书校区：

取书校区：

手机号码：

邮箱地址：

一卡通帐号：

电话和邮箱必须正确填写，我们会与您联系确认。

联系人：

所在院系：

联系邮箱：

联系电话：

内蒙古自治区呼和浩特市赛罕区大学西街235号邮编: 010021

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：