检索结果-内蒙古大学图书馆

arXiv 2024年

作者： Khan, Asifullah Sohail, Anabia Fiaz, Mustansar Hassan, Mehdi Afridi, Tariq Habib Marwat, Sibghat Ullah Munir, Farzeen Ali, Safdar Naseem, Hannan Zaheer, Muhammad Zaigham Ali, Kamran Sultana, Tangina Tanoli, Ziaurrehman Akhter, Naeem Pattern Recognition Lab DCIS PIEAS Nilore Islamabad45650 Pakistan PIEAS Nilore Islamabad45650 Pakistan Deep Learning Lab Center for Mathematical Sciences PIEAS Nilore Islamabad45650 Pakistan Center of Secure Cyber-Physical Security Systems Khalifa University Abu Dhabi United Arab Emirates IBM Research United States Department of Computer Science Air University Islamabad Pakistan Department of Computer Science and Engineering Kyung Hee University Global Campus 1732 Gyeonggi-do Yongin17104 Korea Republic of Department of Electrical Engineering and Automation Aalto University Finland Finnish Center of Artificial Center Finland Faculty of Engineering and Green Technology Universiti Tunku Abdul Rahman Malaysia Computer Vision Department Mohamed Bin Zayed University of Artificial Intelligence United Arab Emirates Karachi Pakistan Department of Electronics and Communication Engineering Hajee Mohammad Danesh Science and Technology University Bangladesh HiLIFE University of Helsinki Finland

vision Transformers (ViTs) have recently demonstrated remarkable performance in computer vision tasks. However, their parameter-intensive nature and reliance on large amounts of data for effective performance have shifted the focus from traditional human-annotated labels to unsupervised learning and pretraining strategies that uncover hidden structures within the data. In response to this challenge, self-supervised learning (SSL) has emerged as a promising paradigm. SSL leverages inherent relationships within the data itself as a form of supervision, eliminating the need for manual labeling and offering a more scalable and resource-efficient alternative for model training. Given these advantages, it is imperative to explore the integration of SSL techniques with ViTs, particularly in scenarios with limited labeled data. Inspired by this evolving trend, this survey aims to systematically review SSL mechanisms tailored for ViTs. We propose a comprehensive taxonomy to classify SSL techniques based on their representations and pre-training tasks. Additionally, we discuss the motivations behind SSL, review prominent pre-training tasks, and highlight advancements and challenges in this field. Furthermore, we conduct a comparative analysis of various SSL methods designed for ViTs, evaluating their strengths, limitations, and applicability to different scenarios. Copyright © 2024, The Authors. All rights reserved.

关键词： Self-supervised learning

来源：评论

学校读者我要写书评

暂无评论

RecycleNet: Latent Feature Recycling Leads to Iterative Decision Refinement

RecycleNet: Latent Feature Recycling Leads to Iterative Deci...

引用

IEEE Workshop on Applications of computer vision (WACV)

作者： Gregor Koehler Tassilo Wald Constantin Ulrich David Zimmerer Paul F. Jaeger Jörg K. H. Franke Simon Kohl Fabian Isensee Klaus H. Maier-Hein Division of Medical Image Computing German Cancer Research Center (DKFZ) Heidelberg Germany Helmholtz Information and Data Science School for Health Karlsruhe/Heidelberg Germany Helmholtz Imaging DKFZ National Center for Tumor Diseases (NCT) NCT Heidelberg a Partnership Between DKFZ University Medical Center Heidelberg Interactive Machine Learning Group DKFZ Machine Learning Lab University of Freiburg Freiburg Germany Latent Labs (***) London UK Applied Computer Vision Lab DKFZ Pattern Analysis and Learning Group Heidelberg University Hospital Heidelberg Germany

Despite the remarkable success of deep learning systems over the last decade, a key difference still remains between neural network and human decision-making: As humans, we can not only form a decision on the spot, but also ponder, revisiting an initial guess from different angles, distilling relevant information, arriving at a better decision. Here, we propose RecycleNet, a latent feature recycling method, instilling the pondering capability for neural networks to refine initial decisions over a number of recycling steps, where outputs are fed back into earlier network layers in an iterative fashion. This approach makes minimal assumptions about the neural network architecture and thus can be implemented in a wide variety of contexts. Using medical image segmentation as the evaluation environment, we show that latent feature recycling enables the network to iteratively refine initial predictions even beyond the iterations seen during training, converging towards an improved decision. We evaluate this across a variety of segmentation benchmarks and show consistent improvements even compared with top-performing segmentation methods. This allows trading increased computation time for improved performance, which can be beneficial, especially for safety-critical applications.

关键词：

来源：评论

学校读者我要写书评

暂无评论

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax 1

引用

16th European Conference on computer vision, ECCV 2020

作者： Zhang, Xiao Zhao, Rui Qiao, Yu Li, Hongsheng CUHK-SenseTime Joint Lab The Chinese University of Hong Kong Hong Kong SenseTime Research Hong Kong ShenZhen Key Lab of Computer Vision and Pattern Recognition SIAT-SenseTime Joint Lab Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences Shenzhen China

ISBN: (数字)9783030585747

ISBN: (纸本)9783030585730

Deep neural networks have achieved remarkable successes in learning feature representations for visual classification. However, deep features learned by the softmax cross-entropy loss generally show excessive intra-class variations. We argue that, because the traditional softmax losses aim to optimize only the relative differences between intra-class and inter-class distances (logits), it cannot obtain representative class prototypes (class weights/centers) to regularize intra-class distances, even when the training is converged. Previous efforts mitigate this problem by introducing auxiliary regularization losses. But these modified losses mainly focus on optimizing intra-class compactness, while ignoring keeping reasonable relations between different class prototypes. These lead to weak models and eventually limit their performance. To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization. The proposed RBF-Softmax loss function not only effectively reduces intra-class distances, stabilizes the training behavior, and reserves ideal relations between prototypes, but also significantly improves the testing performance. Experiments on visual recognition benchmarks including MNIST, CIFAR-10/100, and ImageNet demonstrate that the proposed RBF-Softmax achieves better results than cross-entropy and other state-of-the-art classification losses. The code is at https://***/2han9x1a0release/RBF-Softmax. © 2020, Springer Nature Switzerland AG.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Transductive zero-shot learning by decoupled feature generation

arXiv

引用

arXiv 2021年

作者： Marmoreo, Federico Cavazza, Jacopo Murino, Vittorio Pattern Analysis and Computer Vision Istituto Italiano di Tecnologia Italy University of Genova Italy Huawei Technologies Ltd. Ireland Research Center Ireland Department of Computer Science University of Verona Italy

In this paper, we address zero-shot learning (ZSL), the problem of recognizing categories for which no labeled visual data are available during training. We focus on the transductive setting, in which unlabelled visual data from unseen classes is available. State-of-the-art paradigms in ZSL typically exploit generative adversarial networks to synthesize visual features from semantic attributes. We posit that the main limitation of these approaches is to adopt a single model to face two problems: 1) generating realistic visual features, and 2) translating semantic attributes into visual cues. Differently, we propose to decouple such tasks, solving them separately. In particular, we train an unconditional generator to solely capture the complexity of the distribution of visual data and we subsequently pair it with a conditional generator devoted to enrich the prior knowledge of the data distribution with the semantic content of the class embeddings. We present a detailed ablation study to dissect the effect of our proposed decoupling approach, while demonstrating its superiority over the related state-of-the-art. Copyright © 2021, The Authors. All rights reserved.

关键词： Generative adversarial networks

来源：评论

学校读者我要写书评

暂无评论

Leveraging Acoustic Images for Effective Self-supervised Audio Representation Learning 1

引用

16th European Conference on computer vision, ECCV 2020

作者： Sanguineti, Valentina Morerio, Pietro Pozzetti, Niccolò Greco, Danilo Cristani, Marco Murino, Vittorio Pattern Analysis and Computer Vision Istituto Italiano di Tecnologia Genoa Italy University of Genova Genoa Italy Huawei Technologies Ltd. Ireland Research Center Dublin Ireland University of Verona Verona Italy

ISBN: (数字)9783030585426

ISBN: (纸本)9783030585419

In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for downstream tasks such as classification and cross-modal retrieval, without the need of a microphone array. To prove that, we introduce a novel multimodal dataset consisting in RGB videos, raw audio signals and acoustic images, aligned in space and synchronized in time. Experimental results demonstrate the validity of our hypothesis and the effectiveness of the proposed pipeline, also when tested for tasks and datasets different from those used for training. © 2020, Springer Nature Switzerland AG.

关键词： Supervised learning

来源：评论

学校读者我要写书评

暂无评论

GAN-based Facial Attribute Manipulation

arXiv

引用

arXiv 2022年

作者： Liu, Yunfan Li, Qi Deng, Qiyao Sun, Zhenan Yang, Ming-Hsuan The School of Artificial Intelligence University of Chinese Academy of Sciences Beijing100049 China The National Laboratory of Pattern Recognition Center for Research on Intelligent Perception and Computing Institute of Automation Chinese Academy of Sciences Beijing100190 China The Center for Research on Intelligent Perception and Computing Institute of Automation Chinese Academy of Sciences Beijing100190 China The Department of Computer Science and Engineering University of California MercedCA95340 United States

Facial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to render desired attributes, which has received significant attention due to its broad practical applications ranging from digital entertainment to biometric forensics. In the last decade, with the remarkable success of Generative Adversarial Networks (GANs) in synthesizing realistic images, numerous GAN-based models have been proposed to solve FAM with various problem formulation approaches and guiding information representations. This paper presents a comprehensive survey of GAN-based FAM methods with a focus on summarizing their principal motivations and technical details. The main contents of this survey include: (i) an introduction to the research background and basic concepts related to FAM, (ii) a systematic review of GAN-based FAM methods in three main categories, and (iii) an in-depth discussion of important properties of FAM methods, open issues, and future research directions. This survey not only builds a good starting point for researchers new to this field but also serves as a reference for the vision community. Copyright © 2022, The Authors. All rights reserved.

关键词： Generative adversarial networks

来源：评论

学校读者我要写书评

暂无评论

UniFormer: Unifying Convolution and Self-attention for Visual recognition

arXiv

引用

arXiv 2022年

作者： Li, Kunchang Wang, Yali Zhang, Junhao Gao, Peng Song, Guanglu Liu, Yu Li, Hongsheng Qiao, Yu ShenZhen Key Lab of Computer Vision and Pattern Recognition Shenzhen Institute of Advanced Technology Chinese Academy of Sciences Shenzhen518055 China University of Chinese Academy of Sciences Beijing100049 China Shanghai Artificial Intelligence Laboratory Shanghai200232 China National University of Singapore Singapore Shanghai Artificial Intelligence Laboratory China SenseTime Research China The Chinese University of Hong Kong Hong Kong

It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing tackling both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification task. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks. It obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Something-Something V1/V2 video classification tasks, 53.8 box AP and 46.4 mask AP on COCO object detection task, 50.8 mIoU on ADE20K semantic segmentation task, and 77.4 AP on COCO pose estimation task. Moreover, we build an efficient UniFormer with a concise hourglass design of token shrinking and recovering, which achieves 2-4× higher throughput than the recent lightweight models. Code is av

关键词： Convolution

来源：评论

学校读者我要写书评

暂无评论

RFN-Nest: An end-to-end residual fusion network for infrared and visible images

arXiv

引用

arXiv 2021年

作者： Li, Hui Wu, Xiao-Jun Kittler, Josef Jiangsu Provincial Engineering Laboratory of Pattern Recognition and Computational Intelligence School of Artificial Intelligence and Computer Science Jiangnan University Wuxi214122 China The Center for Vision Speech and Signal Processing University of Surrey GuildfordGU2 7XH United Kingdom

In the image fusion field, the design of deep learning-based fusion methods is far from routine. It is invariably fusion-task specific and requires a careful consideration. The most difficult part of the design is to choose an appropriate strategy to generate the fused image for a specific task in hand. Thus, devising learnable fusion strategy is a very challenging problem in the community of image fusion. To address this problem, a novel end-to-end fusion network architecture (RFN-Nest) is developed for infrared and visible image fusion. We propose a residual fusion network (RFN) which is based on a residual architecture to replace the traditional fusion approach. A novel detail-preserving loss function, and a feature enhancing loss function are proposed to train RFN. The fusion model learning is accomplished by a novel two-stage training strategy. In the first stage, we train an auto-encoder based on an innovative nest connection (Nest) concept. Next, the RFN is trained using the proposed loss functions. The experimental results on public domain data sets show that, compared with the existing methods, our end-to-end fusion network delivers a better performance than the state-of-the-art methods in both subjective and objective evaluation. The code of our fusion method is available at https://***/hli1221/imagefusion-rfn-nest. © 2021, CC BY-NC-ND.

关键词： Image fusion

来源：评论

学校读者我要写书评

暂无评论

Landmark-RxR: solving vision-and-language navigation with fine-grained alignment supervision 21

Landmark-RxR: solving vision-and-language navigation with fi...

引用

Proceedings of the 35th International Conference on Neural Information Processing Systems

作者： Keji He Yan Huang Qi Wu Jianhua Yang Dong An Shuanglin Sima Liang Wang Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences and School of Artificial Intelligence University of Chinese Academy of Sciences School of Computer Science University of Adelaide School of Artificial Intelligence Beijing University of Posts and Telecommunications Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences and School of Future Technology University of Chinese Academy of Sciences Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences and School of Artificial Intelligence University of Chinese Academy of Sciences and Center for Excellence in Brain Science and Intelligence Technology (CEBSIT) and Chinese Academy of Sciences Artificial Intelligence Research (CAS-AIR)

ISBN: (纸本)9781713845393

In vision-and-Language Navigation (VLN) task, an agent is asked to navigate inside 3D indoor environments following given instructions. Cross-modal alignment is one of the most critical challenges in VLN because the predicted trajectory needs to match the given instruction accurately. In this paper, we address the cross-modal alignment challenge from the perspective of fine-grain. Firstly, to alleviate weak cross-modal alignment supervision from coarse-grained data, we introduce a human-annotated fine-grained VLN dataset, namely Landmark-RxR. Secondly, to further enhance local cross-modal alignment under fine-grained supervision, we investigate the focal-oriented rewards with soft and hard forms, by focusing on the critical points sampled from fine-grained Landmark-RxR. Moreover, to fully evaluate the navigation process, we also propose a re-initialization mechanism that makes metrics insensitive to difficult points, which can cause the agent to deviate from the correct trajectories. Experimental results show that our agent has superior navigation performance on Landmark-RxR, en-RxR and R2R.

关键词：

来源：评论

学校读者我要写书评

暂无评论

ELLE: Efficient Lifelong Pre-training for Emerging Data

arXiv

引用

arXiv 2022年

作者： Qin, Yujia Zhang, Jiajie Lin, Yankai Liu, Zhiyuan Li, Peng Sun, Maosong Zhou, Jie Department of Computer Science and Technology Tsinghua University Beijing China Beijing National Research Center for Information Science and Technology China Institute for Artificial Intelligence Tsinghua University Beijing China Pattern Recognition Center WeChat AI Tencent Inc China International Innovation Center of Tsinghua University Shanghai China Beijing Academy of Artificial Intelligence China Tsinghua University China Jiangsu Collaborative Innovation Center for Language Ability Xuzhou China

Current pre-trained language models (PLM) are typically trained with static data, ignoring that in real-world scenarios, streaming data of various sources may continuously grow. This requires PLMs to integrate the information from all the sources in a lifelong manner. Although this goal could be achieved by exhaustive pre-training on all the existing data, such a process is known to be computationally expensive. To this end, we propose ELLE, aiming at efficient lifelong pretraining for emerging data. Specifically, ELLE consists of (1) function preserved model expansion, which flexibly expands an existing PLM's width and depth to improve the efficiency of knowledge acquisition;and (2) pretrained domain prompts, which disentangle the versatile knowledge learned during pretraining and stimulate the proper knowledge for downstream tasks. We experiment ELLE with streaming data from 5 domains on BERT and GPT. The results show the superiority of ELLE over various lifelong learning baselines in both pre-training efficiency and downstream performances. The codes are publicly available at https://***/thunlp/ELLE. Copyright © 2022, The Authors. All rights reserved.

关键词： Efficiency

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：