Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatio...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
Trauma is a leading cause of mortality worldwide, with about 20% of these deaths being preventable. Most of these preventable deaths result from errors during the initial resuscitation of injured patients. Decision su...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Trauma is a leading cause of mortality worldwide, with about 20% of these deaths being preventable. Most of these preventable deaths result from errors during the initial resuscitation of injured patients. Decision support has been evaluated as an approach to support teams during this phase to reduce errors. Existing systems require manual data entry and monitoring, which makes tasks challenging to accomplish in a time-critical setting. This paper identified the specific challenges of achieving effective decision support in trauma resuscitation based on computervision techniques, including complex backgrounds, crowded scenes, fine-grained activities, and a scarcity of labeled data. To address the first three challenges, the proposed system involved an actor tracker that identifies individuals, allowing the system to focus on actor-specific features. Video Masked Autoencoder (Video-MAE) was used to overcome the issue of insufficient labeled data. This approach enables self-supervised learning using unlabeled video content, improving feature representation for medical activities. For more reliable performance, an ensemble fusion method was introduced. This technique combines predictions from consecutive video clips and different actors. Our method outperformed existing approaches in identifying fine-grained activities, providing a solution for activity recognition in trauma resuscitation and similar complex domains.
Learning continually from non-stationary data streams is a long-standing goal and a challenging problem in machine learning. Recently, we have witnessed a renewed and fast-growing interest in continual learning, espec...
详细信息
ISBN:
(纸本)9781665448994
Learning continually from non-stationary data streams is a long-standing goal and a challenging problem in machine learning. Recently, we have witnessed a renewed and fast-growing interest in continual learning, especially within the deep learning community. However, algorithmic solutions are often difficult to re-implement, evaluate and port across different settings, where even results on standard benchmarks are hard to reproduce. In this work, we propose Avalanche, an open-source end-to-end library for continual learning research based on PyTorch. Avalanche is designed to provide a shared and collaborative codebase for fast prototyping, training, and reproducible evaluation of continual learning algorithms.
Deepfake detection aims to contrast the spread of deep-generated media that undermines trust in online content. While existing methods focus on large and complex models, the need for real-time detection demands greate...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Deepfake detection aims to contrast the spread of deep-generated media that undermines trust in online content. While existing methods focus on large and complex models, the need for real-time detection demands greater efficiency. With this in mind, unlike previous work, we introduce a novel deepfake detection approach on images using Binary Neural Networks (BNNs) for fast inference with minimal accuracy loss. Moreover, our method incorporates Fast Fourier Transform (FFT) and Local Binary pattern (LBP) as additional channel features to uncover manipulation traces in frequency and texture domains. Evaluations on COCOFake, DFFD, and CIFAKE datasets demonstrate our method’s state-of-the-art performance in most scenarios with a significant efficiency gain of up to a 20× reduction in FLOPs during inference. Finally, by exploring BNNs in deepfake detection to balance accuracy and efficiency, this work paves the way for future research on efficient deepfake detection.
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for foundation models in computervision, demonstrating the zero-shot ability to recognize any c...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes a substantial step for foundation models in computervision, demonstrating the zero-shot ability to recognize any common category with high accuracy. By leveraging large-scale image-text pairs for training instead of manual annotations, RAM introduces a new paradigm for image *** development of RAM comprises four key steps. Firstly, annotation-free image tags are obtained at scale through automatic text semantic parsing. Subsequently, a preliminary model is trained for automatic annotation by unifying the captioning and tagging tasks, supervised by the original texts and parsed tags, respectively. Thirdly, a data engine is employed to generate additional annotations and clean incorrect ones. Lastly, the model is retrained with the processed data and fine-tuned using a smaller but higher-quality *** evaluate the tagging capability of RAM on numerous benchmarks and observe an impressive zero-shot performance, which significantly outperforms CLIP and BLIP. Remarkably, RAM even surpasses fully supervised models and exhibits a competitive performance compared with the Google tagging API. We have released RAM at https://***/ to foster the advancement of foundation models in computervision.
The Image Signal Processor (ISP) is a customized device to restore RGB images from the pixel signals of CMOS image sensor. In order to realize this function, a series of processing units are leveraged to tackle differ...
详细信息
ISBN:
(纸本)9781665448994
The Image Signal Processor (ISP) is a customized device to restore RGB images from the pixel signals of CMOS image sensor. In order to realize this function, a series of processing units are leveraged to tackle different artifacts, such as color shifts, signal noise, moire effects, and so on, that are introduced from the photo-capturing devices. However, tuning each processing unit is highly complicated and requires a lot of experience and effort from image experts. In this paper, a novel network architecture, CSANet, with emphases on inference speed and high PSNR is proposed for end-to-end learned ISP task. The proposed CSANet applies a double attention module employing both channel and spatial attentions. Particularly, its spatial attention is simplified to a light-weighted dilated depth-wise convolution and still performs as well as others. As proof of performance, CSANet won 2nd place in the Mobile AI 2021 Learned Smartphone ISP Challenge with 1st place PSNR score.
This paper reviews the NTIRE 2024 Portrait Quality Assessment Challenge, highlighting the proposed solutions and results. This challenge aims to obtain an efficient deep neural network capable of estimating the percep...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper reviews the NTIRE 2024 Portrait Quality Assessment Challenge, highlighting the proposed solutions and results. This challenge aims to obtain an efficient deep neural network capable of estimating the perceptual quality of real portrait photos. The methods must generalize to diverse scenes and diverse lighting conditions (indoor, outdoor, low-light), movement, blur, and other challenging conditions. In the challenge, 140 participants registered, and 35 submitted results during the challenge period. The performance of the top 5 submissions is reviewed and provided here as a gauge for the current state-of-the-art in Portrait Quality Assessment.
Event cameras are sensors with pixels that respond independently and asynchronously to changes in scene illumination. Event cameras have a number of advantages when compared to conventional cameras: low-latency, high ...
详细信息
ISBN:
(纸本)9781665448994
Event cameras are sensors with pixels that respond independently and asynchronously to changes in scene illumination. Event cameras have a number of advantages when compared to conventional cameras: low-latency, high temporal resolution, high dynamic range, low power and sparse data output. However, existing event cameras also suffer from comparatively low spatial resolution and are sensitive to noise. Recently, it has been shown that it is possible to reconstruct an intensity frame stream from an event stream. These reconstructions preserve the high temporal rate of the event stream, but tend to suffer from significant artifacts and low image quality due to the shortcomings of event cameras. In this work we demonstrate that it is possible to combine the best of both worlds, by fusing a color frame stream at low temporal resolution and high spatial resolution with an event stream at high temporal resolution and low spatial resolution to generate a video stream with both high temporal and spatial resolutions while preserving the original color information. We utilize a novel event frame interpolation network (EFI-Net), a multi-phase convolutional neural network which fuses the frame and event streams. EFI-Net is trained using only simulated data and generalizes exceptionally well to real-world experimental data. We show that our method is able to interpolate frames where traditional video interpolation approaches fail, while also outperforming event-only reconstructions. We further contribute a new dataset, containing event camera data synchronized with high speed video. This work opens the door to a new application for event cameras, enabling high fidelity fusion with frame based image streams for generation of high-quality high-speed video. The dataset is available at https://***/file/d/1UIGVBqNER_5KguYPAu5y7TVg-JlNhz3-/view?usp=sharing [GRAPHICS] .
Performing hyperparameter tuning in federated learning is often prohibitively expensive due to the substantial communication overhead associated with training a single configuration, especially with a large hyperparam...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Performing hyperparameter tuning in federated learning is often prohibitively expensive due to the substantial communication overhead associated with training a single configuration, especially with a large hyperparameter search space. To overcome this challenge, recent works explored reward-based approaches to learn a policy distribution over a set of hyperparameter configurations. These approaches enable the concurrent exploration of multiple hyperparameter configurations within a single communication round, thereby accelerating the search *** this paper, we take a deeper look at the reward-based strategies and systematically analyze them, uncovering several issues and challenges associated with their adoption in practice. Furthermore, motivated by the insights from our analysis, we propose an in-depth evaluation of policy distribution with metrics that capture rankings of standalone configurations. We contribute this critical examination and proposed evaluation metrics in order to raise awareness about the challenges and hidden issues that reward-based federated hyperparameter optimization might face and to enable a more rigorous evaluation and therefore a faster progress in this research area. We expect that the identified challenges will serve as inspiration for the development of more robust and hyperparameter-free federated hyperparameter tuning approaches.
We propose a novel depth-aware joint attention target estimation framework that estimates the attention target in 3D space. Our goal is to mimic human’s ability to understand where each person is looking in their pro...
We propose a novel depth-aware joint attention target estimation framework that estimates the attention target in 3D space. Our goal is to mimic human’s ability to understand where each person is looking in their proximity. In this work, we tackle the previously unexplored problem of utilising a depth prior along with a 3D joint FOV probability map to estimate the joint attention target of people in the scene. We leverage the insight that besides the 2D image content, strong gaze-related constraints exist in the depth order of the scene and different subject-specific attributes. Extensive experiments show that our method outperforms favourably against existing joint attention target estimation methods on the VideoCoAtt benchmark dataset. Despite the proposed framework being designed for joint attention target estimation, we show that it outperforms single attention target estimation methods on both the GazeFollow image and the VideoAttentionTarget video benchmark datasets.
暂无评论