Nowadays, naturalistic driving action recognition and computervision techniques provide crucial solutions to identify and eliminate distracting driving behavior. Existing methods often extract features through fixed-...
详细信息
ISBN:
(纸本)9798350365474
Nowadays, naturalistic driving action recognition and computervision techniques provide crucial solutions to identify and eliminate distracting driving behavior. Existing methods often extract features through fixed-size sliding windows and predict an action's start and end time. However, the information about a fixed-size window may be incomplete or redundant and the connections between different windows are insufficient. To alleviate this problem, we propose a novel Augmented Self-Mask Attention (AMA) architecture that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order. We employ an ensemble technique and use a weighted boundaries fusion to combine and refine predictions with high confidence scores action boundaries. On the test dataset of AI City Challenge 2024 Track3, we achieved significant results compared with other teams, the proposed model ranks first on the public leaderboard of the challenge. Codes are available at https://***/wolfworld6/AIcity2024-track3.
A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted fro...
详细信息
ISBN:
(纸本)9798350301298
A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted from the latent representation of the model's image encoder. Our main insights are that these maps resemble localization maps and that by combining such maps, one can obtain useful pseudo-labels for performing self-training. Our results surpass, by a large margin, the state of the art in weakly supervised phrase grounding. A similar gap in performance is obtained for a recently proposed downstream task called WWbL, in which only the image is input, without any text. Our code is available at https://***/talshaharabany/Similarity-Maps-forSelf-Training-Weakly-Supervised- Phrase-Grounding.
We present a novel method to provide efficient and highly detailed reconstructions. Inspired by wavelets, we learn a neural field that decompose the signal both spatially and frequency-wise. We follow the recent grid-...
详细信息
ISBN:
(纸本)9798350301298
We present a novel method to provide efficient and highly detailed reconstructions. Inspired by wavelets, we learn a neural field that decompose the signal both spatially and frequency-wise. We follow the recent grid-based paradigm for spatial decomposition, but unlike existing work, encourage specific frequencies to be stored in each grid via Fourier features encodings. We then apply a multi-layer perceptron with sine activations, taking these Fourier encoded features in at appropriate layers so that higher-frequency components are accumulated on top of lower-frequency components sequentially, which we sum up to form the final output. We demonstrate that our method outperforms the state of the art regarding model compactness and convergence speed on multiple tasks: 2D image fitting, 3D shape reconstruction, and neural radiance fields. Our code is available at https://***/ubc-vision/NFFB.
Low-light image enhancement (LLIE) has a significant role in edge vision applications (EVA). Despite its widespread practicability, the existing LLIE methods are impractical due to their high computational costs. This...
详细信息
ISBN:
(纸本)9798350365474
Low-light image enhancement (LLIE) has a significant role in edge vision applications (EVA). Despite its widespread practicability, the existing LLIE methods are impractical due to their high computational costs. This study proposed a framework to learn optimized low-light image enhancement to tackle the limitations of existing enhancement methods for accelerating EVA. The proposed framework incorporates a lightweight and mobile-friendly deep network. We optimized our proposed model with INT8 precision with a post-training quantization strategy and deployed it on an edge device. The LLIE model has achieved over 199 frames per second (FPS) on a low-power edge board. Additionally, we evaluated the practicability of an optimized model for accelerating the vision application of an edge environment. The experimental results illustrate that our optimized method can significantly accelerate the performance of SOTA vision algorithms in challenging low-light conditions for numerous everyday vision tasks, including object detection and image registration.
Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the ...
详细信息
ISBN:
(纸本)9798350365474
Heterogeneous face recognition (HFR) involves the intricate task of matching face images across the visual domains of visible (VIS) and near-infrared (NIR). While much of the existing literature on HFR identifies the domain gap as a primary challenge and directs efforts towards bridging it at either the input or feature level, our work deviates from this trend. We observe that large neural networks, unlike their smaller counterparts, when pretrained on large scale homogeneous VIS data, demonstrate exceptional zero-shot performance in HFR, suggesting that the domain gap might be less pronounced than previously believed. By approaching the HFR problem as one of low-data fine-tuning, we introduce a straightforward framework: comprehensive pre-training, succeeded by a regularized fine-tuning strategy, that matches or surpasses the current state-of-the-art on four publicly available benchmarks. Given its simplicity and demonstrably strong performance, our method could be used as a practical solution for adjusting face recognition models to HFR as well as a new baseline for future HFR research. Corresponding training and evaluation codes can be found at https://***/michaeltrs/RethinkNIRVIS.
The study of complex human interactions and group activities has become a focal point in human-centric computervision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale l...
详细信息
ISBN:
(纸本)9798350353006
The study of complex human interactions and group activities has become a focal point in human-centric computervision. However, progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation, we introduce M3Act, a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine, M3Act features multiple semantic groups, highly diverse and photorealistic images, and a comprehensive set of annotations, which facilitates the learning of human-centered tasks across single-person, multi-person, and multi-group conditions. We demonstrate the advantages of M(3)Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably, M(3)Act improves the state-of-the-art MOTRv2 on DanceTrack dataset, leading to a hop on the leaderboard from 10(th) to 2(nd) place. Moreover, M(3)Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://***/M3Act.
Foot contact is an important cue for human motion capture, understanding, and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals...
详细信息
ISBN:
(纸本)9798350353006
Foot contact is an important cue for human motion capture, understanding, and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However, these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion, as well as accurate and dense foot-contact annotation. To fill this gap, we propose a Multimodal MoCap Dataset with vision and Pressure sensors, named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations, which is especially useful for both plausible shape estimation, robust pose fitting without foot drifting, and accurate global translation tracking. To validate the dataset, we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework, VP-MoCap, for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover, VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://***/MMVP-Dataset/
This paper presents, RallyTemPose, a transformer encoder-decoder model for predicting future badminton strokes based on previous rally actions. The model uses court position, skeleton poses, and player-specific embedd...
详细信息
ISBN:
(纸本)9798350365474
This paper presents, RallyTemPose, a transformer encoder-decoder model for predicting future badminton strokes based on previous rally actions. The model uses court position, skeleton poses, and player-specific embeddings to learn stroke and player-specific latent representations in a spatiotemporal encoder module. The representations are then used to condition the subsequent strokes in a decoder module through rally-aware fusion blocks, which provide additional relevant strategic and technical considerations to make more informed predictions. RallyTemPose shows improved forecasting accuracy compared to traditional sequential methods on two real-world badminton datasets. The performance boost can also be attributed to the inclusion of improved stroke embeddings extracted from the latent representation of a pre-trained large-language model subjected to detailed text descriptions of stroke descriptions. In the discussion, the latent representations learned by the encoder module show useful properties regarding player analysis and comparisons. The code can be found at: This https url.
Object detection on images can find benefit from coupling multiple spectra, each presenting specific useful features. However, building an efficient architecture coupling the different modalities is a complex task. Tr...
详细信息
ISBN:
(纸本)9798350365474
Object detection on images can find benefit from coupling multiple spectra, each presenting specific useful features. However, building an efficient architecture coupling the different modalities is a complex task. Transformers, due to their ability to extract meaningful correlations between the different regions of the inputs appear as a promising way to perform features fusion across different spectra. This work presents a multi-spectral object detection architecture based on cross-attention features fusion (CAFF), combined with a transformer based detector (DINO). We demonstrate here the performance of the proposed approach in object detection compared with state-of-the-art approaches, on infrared-visible multi-spectral datasets. Moreover the robustness to systematic misalignment between image pairs is studied. The proposed approach is generic to any mono-spectrum transformer based detectors. The model developed in this study will be available in a dedicated github repository.
Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn ...
详细信息
ISBN:
(纸本)9798350301298
Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxiliary heads for each client, greatly improving training efficiency in the case of heterogeneous data. This approach allows individual models to preserve and enhance performance on their private tasks while also dramatically improving their performance on the global aggregated data distribution. We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation.
暂无评论