检索结果-内蒙古大学图书馆

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Zhuge, Mingchen Gao, Dehong Fan, Deng-Ping Jin, Linbo Chen, Ben Zhou, Haoming Qiu, Minghui Shao, Ling Alibaba Grp Hangzhou Peoples R China Incept Inst AI IIAI Abu Dhabi U Arab Emirates

ISBN: (纸本)9781665445092

We present a new vision-language (VL) pre-training model dubbed Kaleido-BERT, which introduces a novel kaleido strategy for fashion cross-modality representations from transformers. In contrast to random masking strategy of recent VL models, we design alignment guided masking to jointly focus more on image-text semantic relations. To this end, we carry out five novel tasks, i.e., rotation, jigsaw, camouflage, grey-to-color, and blank-to-color for self-supervised VL pre-training at patches of different scale. Kaleido-BERT is conceptually simple and easy to extend to the existing BERT framework, it attains state-of-the-art results by large margins on four downstream tasks, including text retrieval (R@1: 4.03% absolute improvement), image retrieval (R@1: 7.13% abs imv.), category recognition (ACC: 3.28% abs imv.), and fashion captioning (Bleu4: 1.2 abs imv.). We validate the efficiency of Kaleido-BERT on a wide range of e-commerical websites, demonstrating its broader potential in real-world applications.

关键词： computer vision Image recognition Text recognition Semantics Image retrieval Bit error rate Transformers

来源：评论

学校读者我要写书评

暂无评论

Sentiment Analysis in Turkish Sign Language Through Facial Expressions and Hand Gestures 32

Sentiment Analysis in Turkish Sign Language Through Facial E...

引用

32nd ieee Signal Processing and Communications Applications conference (SIU)

作者： Takir, Seyma Bilen, Baris Arslan, Dogukan Istanbul Tech Univ Yapay Zeka & Veri Muhendisligi Istanbul Turkiye

ISBN: (纸本)9798350388978;9798350388961

In sign languages, where communication is achieved through hand gestures, facial expressions and body language, signs are the subject of many studies due to the diversity in terms of the position of different body parts. These diversities are also encountered in emotion detection in Turkish Sign Language (TID), making direct translation of hand gestures inadequate for emotion detection. Accordingly, in this study, for the first time in the literature, sentiment analysis in TID was performed using facial expressions and hand gestures. For this purpose, a specialized model for the tasks of emotion extraction from facial expressions and gesture recognition from hand gestures was fine-tuned with the dataset collected in this study. As a result, facial expressions are found to be more significant than hand gestures in sentiment analysis in TID, but when supported with hand gestures, the performance improved even more.

关键词： Turkish Sign Language sentiment analysis computer vision

来源：评论

学校读者我要写书评

暂无评论

Event Transformer⁺. A Multi-Purpose Solution for Efficient Event Data Processing

引用

ieee TRANSACTIONS ON pattern ANALYSIS AND MACHINE INTELLIGENCE 2023年第12期45卷 16013-16020页

作者： Sabater, Alberto Montesano, Luis Murillo, Ana C. Univ Zaragoza DIIS I3A Zaragoza 50009 Spain Bitbrain Technol Zaragoza 50006 Spain

Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/VR and autonomous driving. Current top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer(+), that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU.

关键词： computer vision image analysis image classification

来源：评论

学校读者我要写书评

暂无评论

Cross-modal Speaker Verification and recognition: A Multilingual Perspective

Cross-modal Speaker Verification and Recognition: A Multilin...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Nawaz, Shah Saeed, Muhammad Saad Morerio, Pietro Mahmood, Arif Gallo, Ignazio Yousaf, Muhammad Haroon Del Bue, Alessio Ist Italiano Tecnol IIT Pattern Anal & Comp Vis PAVIS Genoa Italy Ist Italiano Tecnol IIT Visual Geometry & Modelling VGM Genoa Italy Univ Insubria Varese VA Italy Univ Engn & Technol Taxila Rawalpindi Punjab India Informat Technol Univ Lahore Pakistan

ISBN: (纸本)9781665448994

Recent years have seen a surge in finding association between faces and voices within a cross-modal biometric application along with speaker recognition. Inspired from this, we introduce a challenging task in establishing association between faces and voices across multiple languages spoken by the same set of persons. The aim of this paper is to answer two closely related questions: "Is face-voice association language independent?" and "Can a speaker be recognized irrespective of the spoken language?". These two questions are important to understand effectiveness and to boost development of multilingual biometric systems. To answer these, we collected a Multilingual Audio-Visual dataset, containing human speech clips of 154 identities with 3 language annotations extracted from various videos uploaded online. Extensive experiments on the two splits of the proposed dataset have been performed to investigate and answer these novel research questions that clearly point out the relevance of the multilingual problem.

关键词： Training Annotations Face recognition Pipelines Speech recognition Speaker recognition Task analysis

来源：评论

学校读者我要写书评

暂无评论

FROM GAME THEORY TO VISUAL recognition: ADVANCING DNN ROBUSTNESS 49

FROM GAME THEORY TO VISUAL RECOGNITION: ADVANCING DNN ROBUST...

引用

49th ieee International conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Liu, Zhendong Jiang, Wenyu Guo, Ming Wang, Chongjun Nanjing Univ Dept Comp Sci & Technol Xianlin Rd 163 Nanjing Jiangsu Peoples R China

ISBN: (纸本)9798350344868;9798350344851

In many real-world applications of Deep Neural Networks (DNNs) in visual recognition, data augmentation stands out as a premier tool for enhancing model robustness. Stemming from the understanding of the common mechanisms of data augmentation methods, we introduce the mask-based "data augmentation boost" (DaBoost) method, a strategic approach that exploits the control of game interaction strength. Our empirical results are telling: DaBoost not only consistently surpasses the state-of-the-art PixMix method but also achieves impressive robustness metrics, with a vanilla WideResNet registering a mere 6.5% mCE and a 2.3% RMS calibration error on CIFAR-10 data. An intriguing observation from our study is the Long-Rope Effect. We discerned that penalizing high-order interactions inadvertently leads to a boost in mid-order interactions, mirroring patterns inherent to human cognitive processes. This interplay hints at the potential avenues for optimizing DNNs' performance further.

关键词： Data augmentation Game theory Game interactions Model robustness computer vision

来源：评论

学校读者我要写书评

暂无评论

Home Action Genome: Cooperative Compositional Action Understanding

Home Action Genome: Cooperative Compositional Action Underst...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Rai, Nishant Chen, Haofeng Ji, Jingwei Desai, Rishi Kozuka, Kazuki Ishizaka, Shun Adeli, Ehsan Niebles, Juan Carlos Stanford Univ Stanford CA 94305 USA Panasonic Corp Kadoma Osaka Japan

ISBN: (纸本)9781665445092

Existing research on action recognition treats activities as monolithic events occurring in videos. Recently, the benefits of formulating actions as a combination of atomicactions have shown promise in improving action understanding with the emergence of datasets containing such annotations, allowing us to learn representations capturing this information. However, there remains a lack of studies that extend action composition and leverage multiple viewpoints and multiple modalities of data for representation learning. To promote research in this direction, we introduce Home Action Genome (HOMAGE): a multi-view action dataset with multiple modalities and view-points supplemented with hierarchical activity and atomic action labels together with dense scene composition labels. Leveraging rich multi-modal and multi-view settings, we propose Cooperative Compositional Action Understanding (CCAU), a cooperative learning framework for hierarchical action recognition that is aware of compositional action elements. CCAU shows consistent performance improvements across all modalities. Furthermore, we demonstrate the utility of co-learning compositions in few-shot action recognition by achieving 28.6% mAP with just a single sample.

关键词： Location awareness Learning systems computer vision Annotations Image color analysis Genomics Data visualization

来源：评论

学校读者我要写书评

暂无评论

Towards Real-World Blind Face Restoration with Generative Facial Prior

Towards Real-World Blind Face Restoration with Generative Fa...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Wang, Xintao Li, Yu Zhang, Honglun Shan, Ying Tencent PCG Appl Res Ctr ARC Shenzhen Peoples R China

ISBN: (纸本)9781665445092

Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.

关键词： Geometry computer vision Art Limiting Image color analysis Face recognition Transforms

来源：评论

学校读者我要写书评

暂无评论

DeepObjStyle: Deep Object-based Photo Style Transfer

DeepObjStyle: Deep Object-based Photo Style Transfer

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Mastan, Indra Deep Raman, Shanmuganathan Indian Inst Technol Gandhinagar Gandhinagar Gujarat India

ISBN: (纸本)9781665448994

One of the major challenges of style transfer is the appropriate image features supervision between the output image and the input images (style and content). An efficient strategy would be to define an object map between the objects of the style and the content images. However, such a mapping is not well established when there are semantic objects of different types and numbers in the style and the content images. It also leads to content mismatch in the style transfer output, which could reduce the visual quality of the results. We propose an object-based style transfer approach, called DeepObjStyle, for the style supervision in the training data-independent framework. DeepObjStyle preserves the semantics of the objects and achieves better style transfer in the challenging scenario when the style and the content images have a mismatch of image features. We also perform style transfer of images containing a word cloud to demonstrate that DeepObjStyle enables an appropriate image features supervision. We validate the results using quantitative comparisons and user studies.

关键词： Training Visualization computer vision conferences Semantics Tag clouds Quality assessment

来源：评论

学校读者我要写书评

暂无评论

ViP-DeepLab: Learning Visual Perception with Depth-aware Video Panoptic Segmentation

ViP-DeepLab: Learning Visual Perception with Depth-aware Vid...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Qiao, Siyuan Zhu, Yukun Adam, Hartwig Yuille, Alan Chen, Liang-Chieh Johns Hopkins Univ Baltimore MD 21218 USA Google Res Mountain View CA USA

ISBN: (纸本)9781665445092

In this paper, we present ViP-DeepLab, a unified model attempting to tackle the long-standing and challenging inverse projection problem in vision, which we model as restoring the point clouds from perspective image sequences while providing each point with instance-level semantic interpretations. Solving this problem requires the vision models to predict the spatial location, semantic class, and temporally consistent instance label for each 3D point. ViP-DeepLab approaches it by jointly performing monocular depth estimation and video panoptic segmentation. We name this joint task as Depth-aware Video Panoptic Segmentation, and propose a new evaluation metric along with two derived datasets for it, which will be made available to the public. On the individual sub-tasks, ViP-DeepLab also achieves state-of-the-art results, outperforming previous methods by 5.1% VPQ on Cityscapes-VPS, ranking 1st on the KITTI monocular depth estimation benchmark, and 1st on KITTI MOTS pedestrian. The datasets and the evaluation codes are made publicly available(1).

关键词： Measurement Solid modeling Three-dimensional displays Semantics Estimation Predictive models pattern recognition

来源：评论

学校读者我要写书评

暂无评论

Hyperdimensional computing as a framework for systematic aggregation of image descriptors

Hyperdimensional computing as a framework for systematic agg...

引用

ieee/CVF conference on computer vision and pattern recognition (CVPR)

作者： Neubert, Peer Schubert, Stefan Tech Univ Chemnitz Chemnitz Germany

ISBN: (纸本)9781665445092

Image and video descriptors are an omnipresent tool in computer vision and its application fields like mobile robotics. Many hand-crafted and in particular learned image descriptors are numerical vectors with a potentially (very) large number of dimensions. Practical considerations like memory consumption or time for comparisons call for the creation of compact representations. In this paper, we use hyperdimensional computing (HDC) as an approach to systematically combine information from a set of vectors in a single vector of the same dimensionality. HDC is a known technique to perform symbolic processing with distributed representations in numerical vectors with thousands of dimensions. We present a HDC implementation that is suitable for processing the output of existing and future (deep learning based) image descriptors. We discuss how this can be used as a framework to process descriptors together with additional knowledge by simple and fast vector operations. A concrete outcome is a novel HDC-based approach to aggregate a set of local image descriptors together with their image positions in a single holistic descriptor. The comparison to available holistic descriptors and aggregation methods on a series of standard mobile robotics place recognition experiments shows a 20% improvement in average performance and > 2x better worst-case performance compared to runner-up.

关键词： computer vision Systematics Aggregates Semantics Memory management Layout Tools

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：