Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the tran...
详细信息
ISBN:
(纸本)9798350365474
Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which predictions are exchanged. However, in many contexts such a dataset might be difficult to acquire due to privacy and the clients might not allow for storage of a large shared dataset. To this end, in this paper, we introduce a new method that improves this knowledge distillation method to only rely on a single shared image between clients and server. In particular, we propose a novel adaptive dataset pruning algorithm that selects the most informative crops generated from only a single image. With this, we show that federated learning with distillation under a limited shared dataset budget works better by using a single image compared to multiple individual ones. Finally, we extend our approach to allow for training heterogeneous client architectures by incorporating a non-uniform distillation schedule and client-model mirroring on the server side.
The detection and recognition of distracted driving behaviors has emerged as a new vision task with the rapid development of computervision, which is considered as a challenging temporal action localization (TAL) pro...
详细信息
ISBN:
(纸本)9798350365474
The detection and recognition of distracted driving behaviors has emerged as a new vision task with the rapid development of computervision, which is considered as a challenging temporal action localization (TAL) problem in computervision. The primary goal of temporal localization is to determine the start and end time of actions in untrimmed videos. Currently, most state-of-the-art temporal localization methods adopt complex architectures, which are cumbersome and time-consuming. In this paper, we propose a robust and efficient two-stage framework for distracted behavior classification-localization based on the sliding window approach, which is suitable for untrimmed naturalistic driving videos. To address the issues of high similarity among different behaviors and interference from background classes, we propose a multi-view fusion and adaptive thresholding algorithm, which effectively reduces missing detections. To address the problem of fuzzy behavior boundary localization, we design a post-processing procedure that achieves fine localization from coarse localization through post connection and candidate behavior merging criteria. In the AICITY2024 Task3 TestA, our method performs well, achieving Average Intersection over Union(AIOU) of 0.6080 and ranking eighth in AICITY2024 Task3. Our code will be released in the near future.
In this paper, we introduce an approach for recognizing and classifying gestures that accompany mathematical terms, in a new collection we name the "GAMT" dataset. Our method uses language as a means of prov...
详细信息
ISBN:
(纸本)9798350365474
In this paper, we introduce an approach for recognizing and classifying gestures that accompany mathematical terms, in a new collection we name the "GAMT" dataset. Our method uses language as a means of providing context to classify gestures. Specifically, we use a CLIP-style framework to construct a shared embedding space for gestures and language, experimenting with various methods for encoding gestures within this space. We evaluate our method on our new dataset containing a wide array of gestures associated with mathematical terms. The shared embedding space leads to a substantial improvement in gesture classification. Furthermore, we identify an efficient model that excelled at classifying gestures from our unique dataset, thus contributing to the further development of gesture recognition in diverse interaction scenarios.
A Compound Expression recognition (CER) as a sub-field of affective computing is a novel task in intelligent human-computer interaction and multimodal user interfaces. We propose a novel audio-visual method for CER. O...
详细信息
ISBN:
(纸本)9798350365474
A Compound Expression recognition (CER) as a sub-field of affective computing is a novel task in intelligent human-computer interaction and multimodal user interfaces. We propose a novel audio-visual method for CER. Our method relies on emotion recognition models that fuse modalities at the emotion probability level, while decisions regarding the prediction of compound expressions are based on the pair-wise sum of weighted emotion probability distributions. Notably, our method does not use any training data specific to the target task. Thus, the problem is a zero-shot classification task. The method is evaluated in multi-corpus training and cross-corpus validation setups. We achieved F1 scores of 32.15% and 25.56% for the AffWild2 and C-EXPR-DB test subsets without training on target corpus and target task, respectively. Therefore, our method is on par with methods developed training target corpus or target task. The source code is publicly available from https: //***/AVCER/.
Padel is a rapidly growing racquet sport and has gained popularity globally due to its accessibility and exciting gameplay dynamics. Effective coordination between teammates hinges on maintaining an appropriate distan...
详细信息
ISBN:
(纸本)9798350365474
Padel is a rapidly growing racquet sport and has gained popularity globally due to its accessibility and exciting gameplay dynamics. Effective coordination between teammates hinges on maintaining an appropriate distance, allowing for seamless transitions between offensive and defensive maneuvers. A balanced inter-player distance and distance to the net not only facilitates efficient communication but also enhances the team's ability to exploit openings in the opponent's defense while minimizing vulnerabilities. We introduce a new open dataset of padel rallies with annotations for hits and player-ball interactions, a predictive model for detecting hits based on audio signals, a reidentification algorithm for pose tracking, and a framework for calculating inter-player and player-net distances during rallies. Our predictive model achieves an average F1-score of 92% for hit detection, demonstrating robust performance across different match conditions. Furthermore, we develop a system for accurately assigning hits to individual players, achieving an overall accuracy of 83.70% for player-specific assignment and 86.83% for team-based assignment.
Image copy detection is one of the pivotal tools to safeguard online information integrity. The challenge lies in determining whether a query image is an edited copy, which necessitates the identification of candidate...
详细信息
ISBN:
(纸本)9798350365474
Image copy detection is one of the pivotal tools to safeguard online information integrity. The challenge lies in determining whether a query image is an edited copy, which necessitates the identification of candidate source images through a retrieval process. The process requires discriminative features comprising of both global descriptors that are designed to be augmentation-invariant and local descriptors that can capture salient foreground objects to assess whether a query image is an edited copy of some source reference image. This work describes an end-to-end solution that leverage a vision Transformer model to learn such discriminative features and perform implicit matching between the query image and the reference image. Experimental results on two benchmark datasets demonstrate that the proposed solution outperforms state-of-the-art methods. Case studies illustrate the effectiveness of our approach in matching reference images from which the query images have been copy-edited.
Handwritten Document recognition (HDR) has emerged as a challenging task integrating text and layout information recognition to tackle manuscripts end-to-end. Despite advancements, the computational efficiency of proc...
详细信息
ISBN:
(纸本)9798350365474
Handwritten Document recognition (HDR) has emerged as a challenging task integrating text and layout information recognition to tackle manuscripts end-to-end. Despite advancements, the computational efficiency of processing entire documents remains a critical challenge, limiting the practical applicability of these models. This paper presents the Document Attention Network for Computationally Efficient recognition (DANCER). The model differs from existing approaches with its unique encoder-decoder structure, where the encoder reduces spatial redundancy and enhances spatial attention, and the decoder, comprising transformer layers, efficiently decodes the text using optimized attention operations. This design results in a fast, memory-efficient model capable of effectively transcribing and understanding complex manuscript layouts. We evaluated DANCER's efficacy on the ICFHR 2016 READ competition dataset, focusing on recognizing single and doublepage historical documents. We demonstrate how DANCER can triple the training batch size compared to prior models within the same memory limits and reduce memory usage by up to 65% without compromising recognition quality. The proposed approach sets new standards in efficiency and accuracy for HDR solutions, paving the way for practical and scalable applications in diverse contexts.
By using few-shot data and labels, prompt learning obtains optimal prompts that are capable of achieving high performance on downstream tasks. Existing prompt learning methods generate high-quality prompts that are su...
详细信息
ISBN:
(纸本)9798350365474
By using few-shot data and labels, prompt learning obtains optimal prompts that are capable of achieving high performance on downstream tasks. Existing prompt learning methods generate high-quality prompts that are suitable for downstream tasks but tend to perform poorly in scenarios where only very limited data (e.g., one-shot) is available. We address on this challenging one-shot scenario and propose a novel architecture for prompt learning, called Image-Text Feature Alignment Branch (ITFAB). ITFAB aligns text features closer to the centroids of image features and separates text features with different classes to resolve misalignment in the feature space, thereby facilitating the acquisition of high-quality prompts with very limited data. In one-shot setting, our method outperforms the existing CoOp and CoCoOp methods and in some cases even surpasses CoCoOp's 16-shot performance. Testing on different datasets and domain, show that ITFAB almost matches CoCoOp's effectiveness. It also works with current prompt learning methods like MapLe and PromptSRC, improving their performance in one-shot setting.
Conversational facial expression recognition entails challenges such as handling of facial dynamics, small available datasets, low-intensity and fine-grained emotional expressions and extreme face angle. Towards addre...
详细信息
ISBN:
(纸本)9798350365474
Conversational facial expression recognition entails challenges such as handling of facial dynamics, small available datasets, low-intensity and fine-grained emotional expressions and extreme face angle. Towards addressing these challenges, we propose the Masking Action Units and Reconstructing multiple Angles (MAURA) pre-training. MAURA is an efficient self-supervised method that permits the use of small datasets, while preserving end-toend conversational facial expression recognition with vision Transformer. MAURA masks videos using the location with active Action Units and reconstructs synchronized multi-view videos, thus learning the dependencies between muscle movements and encoding information, which might only be visible in few frames and/or in certain views. Based on one view (e.g., frontal), the encoder reconstructs other views (e.g., top, down, laterals). Such masking and reconstructing strategy provides a powerful representation, beneficial in facial expression downstream tasks. Our experimental analysis shows that we consistently outperform the state-of-the-art in the challenging settings of low-intensity and fine-grained conversational facial expression recognition on four datasets including in-the-wild DFEW, CMU-MOSEI, MFA and multi-view MEAD. Our results suggest that MAURA is able to learn robust and generic video representations.
Face Image Quality Assessment (FIQA) estimates the utility of face images for automated face recognition (FR) systems. We propose in this work a novel approach to assess the quality of face images based on inspecting ...
详细信息
ISBN:
(纸本)9798350365474
Face Image Quality Assessment (FIQA) estimates the utility of face images for automated face recognition (FR) systems. We propose in this work a novel approach to assess the quality of face images based on inspecting the required changes in the pre-trained FR model weights to minimize differences between testing samples and the distribution of the FR training dataset. To achieve that, we propose quantifying the discrepancy in Batch Normalization statistics (BNS), including mean and variance, between those recorded during FR training and those obtained by processing testing samples through the pretrained FR model. We then generate gradient magnitudes of pretrained FR weights by backpropagating the BNS through the pretrained model. The cumulative absolute sum of these gradient magnitudes serves as the FIQ for our approach. Through comprehensive experimentation, we demonstrate the effectiveness of our training-free and quality labeling-free approach, achieving competitive performance to recent state-of-the-art FIQA approaches without relying on quality labeling, the need to train regression networks, specialized architectures, or designing and optimizing specific loss functions.
暂无评论