The Sports Action recognition (SAR) domain is of significant importance in research, with diverse applications, ranging from aiding coaches in strategic decision-making to empowering athletes and contributing to real-...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The Sports Action recognition (SAR) domain is of significant importance in research, with diverse applications, ranging from aiding coaches in strategic decision-making to empowering athletes and contributing to real-time commercial entertainment. Despite the existence of extensive large-scale and small-scale datasets, the direct application of these datasets to specific sports domains, such as cricket, poses challenges. Existing datasets predominantly center around daily life actions, lacking the necessary granularity for in-depth sports analyses. Current Cricket Action Analysis (CAA) datasets have limitations, including their small scale, modality constraints, and their narrow focus on specific aspects, such as cricket batting. Recognizing the need for a more comprehensive benchmark, this article introduces the Cricket Excited Actions (CEA) dataset. Developed in collaboration with professional cricket players, the CEA dataset encompasses challenging multi-person actions within realistic cricket scenarios. The selected activity classes, such as Clean Bowled, Six, Four, and Catches, adhere to official standards and represent pivotal moments in cricket matches. Through precise annotation and empirical studies, utilizing state-of-the-art action recognition model architectures, this study provides a valuable resource for further research and makes significant contributions by offering support essential to advancing CAA within the cricket sports community. The data and code are available at https://***/Altaf-hucn/Cricket-Excited-Actions-Benchmark.
Though multimodal emotion recognition has achieved significant progress over recent years, the potential of rich synergic relationships across the modalities is not fully exploited. In this paper, we introduce Recursi...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Though multimodal emotion recognition has achieved significant progress over recent years, the potential of rich synergic relationships across the modalities is not fully exploited. In this paper, we introduce Recursive Joint Cross-Modal Attention (RJCMA) to effectively capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities to simultaneously capture intra- and inter-modal relationships across the modalities. The attended features of the individual modalities are again fed as input to the fusion model in a recursive mechanism to obtain more refined feature representations. We have also explored Temporal Convolutional Networks (TCNs) to improve the temporal modeling of the feature representations of individual modalities. Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset. By effectively capturing the synergic intra- and inter-modal relationships across audio, visual, and text modalities, the proposed fusion model achieves a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal, respectively, on the validation set (test set). This shows a significant improvement over the baseline of 0.240 (0.211) and 0.200 (0.191) for valence and arousal, respectively, in the validation set (test set), achieving second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition. The code is available on GitHub: https://***/praveena2j/RJCMA.
This paper proposes H 3 Net that considers detecting people in irregular postures by utilizing human structures and characters. To handle both features, we introduce two attention modules: 1) Human Structure Attention...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper proposes H
3
Net that considers detecting people in irregular postures by utilizing human structures and characters. To handle both features, we introduce two attention modules: 1) Human Structure Attention Module (HSAM), which is introduced to focus on the spatial aspects of a person, and 2) Human Character Attention Module (HCAM), which is designed to address the issue of repetitive appearance. HSAM effectively handles both foreground and background information about a human instance and utilizes keypoints to provide additional guidance to predict irregular postures. Meanwhile, HCAM employs ID information obtained from the tracking head, enriching the posture prediction with high-level semantic information. Furthermore, gathering images of people in irregular postures is a challenging task. Therefore, many conventional datasets consist of images with the same actors simulating varying postures in distinct images. To address this problem, we propose a Human ID Dependent Posture (HID
2
) loss that handles repeated instances. The HID
2
loss generates a regularization term by considering duplicated instances to reduce bias. Our experiments demonstrate the effectiveness of H
3
Net compared to existing algorithms on irregular posture datasets. Furthermore, we show the qualitative results using color-coded masks and bounding boxes. We also provide ablation studies to highlight the significance of our proposed methods.
Capturing the 3D human body is one of the important tasks in computervision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temp...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Capturing the 3D human body is one of the important tasks in computervision with a wide range of applications such as virtual reality and sports analysis. However, conventional frame cameras are limited by their temporal resolution and dynamic range, which imposes constraints in real-world application setups. Event cameras have the advantages of high temporal resolution and high dynamic range (HDR), but the development of event-based methods is necessary to handle data with different characteristics. This paper proposes a novel event-based method for 3D pose estimation and human mesh recovery. Prior work on event-based human mesh recovery require frames (images) as well as event data. The proposed method solely relies on events; it carves 3D voxels by moving the event camera around a stationary body, reconstructs the human pose and mesh by attenuated rays, and fit statistical body models, preserving high-frequency details. The experimental results show that the proposed method outperforms conventional frame-based methods in the estimation accuracy of both pose and body mesh. We also demonstrate results in challenging situations where other frame-based methods suffer from motion blur. This is the first-of-its-kind to demonstrate event-only human mesh recovery, and we hope that it is the first step toward achieving robust and accurate 3D human body scanning from vision sensors.
Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datase...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.
The AI City Challenge was created with two goals in mind: (1) pushing the boundaries of research and development in intelligent video analysis for smarter cities use cases, and (2) assessing tasks where the level of p...
详细信息
ISBN:
(纸本)9781665448994
The AI City Challenge was created with two goals in mind: (1) pushing the boundaries of research and development in intelligent video analysis for smarter cities use cases, and (2) assessing tasks where the level of performance is enough to cause real-world adoption. Transportation is a segment ripe for such adoption. The fifth AI City Challenge attracted 305 participating teams across 38 countries, who leveraged city-scale real traffic data and high-quality synthetic data to compete in five challenge tracks. Track 1 addressed video-based automatic vehicle counting, where the evaluation being conducted on both algorithmic effectiveness and computational efficiency. Track 2 addressed city-scale vehicle re-identification with augmented synthetic data to substantially increase the training set for the task. Track 3 addressed city-scale multi-target multi-camera vehicle tracking. Track 4 addressed traffic anomaly detection. Track 5 was a new track addressing vehicle retrieval using natural language descriptions. The evaluation system shows a general leader board of all submitted results, and a public leader board of results limited to the contest participation rules, where teams are not allowed to use external data in their work. The public leader board shows results more close to real-world situations where annotated data is limited. Results show the promise of AI in Smarter Transportation. State-of-the-art performance for some tasks shows that these technologies are ready for adoption in real-world systems.
Neural Radiance Fields (NeRFs) have emerged as promising tools for advancing autonomous driving (AD) research, offering scalable closed-loop simulation and data augmentation capabilities. However, to trust the results...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Neural Radiance Fields (NeRFs) have emerged as promising tools for advancing autonomous driving (AD) research, offering scalable closed-loop simulation and data augmentation capabilities. However, to trust the results achieved in simulation, one needs to ensure that AD systems perceive real and rendered data in the same way. Although the performance of rendering methods is increasing, many scenarios will remain inherently challenging to reconstruct faithfully. To this end, we propose a novel perspective for addressing the real-to-simulated data gap. Rather than solely focusing on improving rendering fidelity, we explore simple yet effective methods to enhance perception model robustness to NeRF artifacts without compromising performance on real data. Moreover, we conduct the first large-scale investigation into the real-to-simulated data gap in an AD setting using a state-of-the-art neural rendering technique. Specifically, we evaluate object detectors and an online mapping model on real and simulated data, and study the effects of different fine-tuning strategies. Our results show notable improvements in model robustness to simulated data, even improving real-world performance in some cases. Last, we delve into the correlation between the real-to-simulated gap and image reconstruction metrics, identifying FID and LPIPS as strong indicators.
Although stereo image super-resolution has been extensively studied, many existing works only rely on attention in a single epipolar direction to reconstruct stereo images. In the case of asymmetric parallax images, t...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Although stereo image super-resolution has been extensively studied, many existing works only rely on attention in a single epipolar direction to reconstruct stereo images. In the case of asymmetric parallax images, these methods often struggle to capture reliable stereo correspondence, resulting in reconstructed images suffering from blurring and artifacts. In this paper, we propose a novel method called Cross-View Aggregation Network for Stereo Image Super-Resolution (CANSSR) and explore the relationship between multi-directional epipolar lines to construct reliable stereo correspondence. Specifically, we propose a multidirectional cross-view aggregation module (MCAM) that effectively captures multi-directional stereo correspondence and obtains cross-view complementary information. Furthermore, we design a channel-spatial aggregation module (CSAM) that aggregates multi-order global-local information in intra-view to reconstruct clearer texture features. In addition, we equip a large kernel convolution in the Feedforward Network to acquire richer detailed texture information. The extensive experiments conclusively demonstrate that CANSSR outperforms the state-of-the-art method both qualitatively and quantitatively in terms of stereo image super-resolution on the Flickr 1024 and Middlebury datasets.
Terrain classification is an important problem for mobile robots operating in extreme environments as it can aid downstream tasks such as autonomous navigation and planning. While RGB cameras are widely used for terra...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Terrain classification is an important problem for mobile robots operating in extreme environments as it can aid downstream tasks such as autonomous navigation and planning. While RGB cameras are widely used for terrain identification, vision-based methods can suffer due to poor lighting conditions and occlusions. In this paper, we propose the novel use of Ground Penetrating Radar (GPR) for terrain characterization for mobile robot platforms. Our approach leverages machine learning for surface terrain classification from GPR data. We collect a new dataset consisting of four different terrain types, and present qualitative and quantitative results. Our results demonstrate that classification networks can learn terrain categories from GPR signals. Additionally, we integrate our GPR-based classification approach into a multimodal semantic mapping framework to demonstrate a practical use case of GPR for surface terrain classification on mobile robots. Overall, this work extends the usability of GPR sensors deployed on robots to enable terrain classification in addition to GPR’s existing scientific use cases.
Blind image inpainting is a crucial restoration task that does not demand additional mask information to restore the corrupted regions. Yet, it is a very less explored research area due to the difficulty in discrimina...
Blind image inpainting is a crucial restoration task that does not demand additional mask information to restore the corrupted regions. Yet, it is a very less explored research area due to the difficulty in discriminating between corrupted and valid regions. There exist very few approaches for blind image inpainting which sometimes fail at producing plausible inpainted images. Since they follow a common practice of predicting the corrupted regions and then inpaint them. To skip the corrupted region prediction step and obtain better results, in this work, we propose a novel end-to-end architecture for blind image inpainting consisting of wavelet query multi-head attention transformer block and the omni-dimensional gated attention. The proposed wavelet query multi-head attention in the transformer block provides encoder features via processed wavelet coefficients as query to the multi-head attention. Further, the proposed omni-dimensional gated attention effectively provides all dimensional attentive features from the encoder to the respective decoder. Our proposed approach is compared numerically and visually with existing state-of-the-art methods for blind image inpainting on different standard datasets. The comparative and ablation studies prove the effectiveness of the proposed approach for blind image inpainting. The testing code is available at : https://***/shrutiphutke/Blind_Omni_Wav_Net
暂无评论