This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscri...
详细信息
ISBN:
(纸本)9781665487399
This paper reviews the challenge on constrained high dynamic range (HDR) imaging that was part of the New Trends in Image Restoration and Enhancement (NTIRE) workshop, held in conjunction with CVPR 2022. This manuscript focuses on the competition set-up, datasets, the proposed methods and their results. The challenge aims at estimating an HDR image from multiple respective low dynamic range (LDR) observations, which might suffer from under- or over-exposed regions and different sources of noise. The challenge is composed of two tracks with an emphasis on fidelity and complexity constraints: In Track 1, participants are asked to optimize objective fidelity scores while imposing a low-complexity constraint (i.e. solutions can not exceed a given number of operations). In Track 2, participants are asked to minimize the complexity of their solutions while imposing a constraint on fidelity scores (i.e. solutions are required to obtain a higher fidelity score than the prescribed baseline). Both tracks use the same data and metrics: Fidelity is measured by means of PSNR with respect to a ground-truth HDR image (computed both directly and with a canonical tonemapping operation), while complexity metrics include the number of Multiply-Accumulate (MAC) operations and runtime (in seconds).
Monitoring of land cover and land use is crucial in natural resources management. Automatic visual mapping can carry enormous economic value for agriculture, forestry, or public administration. Satellite or aerial ima...
详细信息
ISBN:
(纸本)9781665448994
Monitoring of land cover and land use is crucial in natural resources management. Automatic visual mapping can carry enormous economic value for agriculture, forestry, or public administration. Satellite or aerial images combined with computervision and deep learning enable precise assessment and can significantly speed up change detection. Aerial imagery usually provides images with much higher pixel resolution than satellite data allowing more detailed mapping. However, there is still a lack of aerial datasets made for the segmentation, covering rural areas with a resolution of tens centimeters per pixel, manual fine labels, and highly publicly important environmental instances like buildings, woods, water, or roads. Here we introduce *** (Land Cover from Aerial Imagery) dataset for semantic segmentation. We collected images of 216.27 km(2) rural areas across Poland, a country in Central Europe, 39.51 km(2) with resolution 50 cm per pixel and 176.76 km(2) with resolution 25 cm per pixel and manually fine annotated four following classes of objects: buildings, woodlands, water, and roads. Additionally, we report simple benchmark results, achieving 85.56% of mean intersection over union on the test set. It proves that the automatic mapping of land cover is possible with a relatively small, cost-efficient, RGB-only dataset. The dataset is publicly available at https://***/
Many deep learning based video compression artifact removal algorithms have been proposed to recover high-quality videos from low-quality compressed videos. Recently, methods were proposed to mine spatiotemporal infor...
详细信息
ISBN:
(纸本)9781665448994
Many deep learning based video compression artifact removal algorithms have been proposed to recover high-quality videos from low-quality compressed videos. Recently, methods were proposed to mine spatiotemporal information via utilizing multiple neighboring frames as reference frames. However, these post-processing methods take advantage of adjacent frames directly, but neglect the information of the video itself which can be exploited. In this paper, we propose an effective reference frame proposal strategy to boost the performance of the existing multi-frame approaches. Besides, we introduce a loss based on fast Fourier transformation (FFT) to further improve the effectiveness of restoration. Experimental results show that our method achieves better fidelity and perceptual performance on MFQE 2.0 dataset than the state-of-the-art methods. And our method won Track I and Track 2, and was ranked the 2nd in Track 3 of NTIRE 2021 Quality enhancement of heavily compressed videos Challenge.
Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can ...
详细信息
ISBN:
(纸本)9781665448994
Multimodal representations and continual learning are two areas closely related to human intelligence. The former considers the learning of shared representation spaces where information from different modalities can be compared and integrated (we focus on cross-modal retrieval between language and visual representations). The latter studies how to prevent forgetting a previously learned task when learning a new one. While humans excel in these two aspects, deep neural networks are still quite limited. In this paper, we propose a combination of both problems into a continual cross-modal retrieval setting, where we study how the catastrophic interference caused by new tasks impacts the embedding spaces and their cross-modal alignment required for effective retrieval. We propose a general framework that decouples the training, indexing and querying stages. We also identify and study different factors that may lead to forgetting, and propose tools to alleviate it. We found that the indexing stage pays an important role and that simply avoiding reindexing the database with updated embedding networks can lead to significant gains. We evaluated our methods in two image-text retrieval datasets, obtaining significant gains with respect to the fine tuning baseline.
Video analysis in tackle-collision based sports is highly subjective and exposed to bias, which is inherent in human observation, especially under time constraints. This limitation of match analysis in tackle-collisio...
详细信息
ISBN:
(纸本)9781665448994
Video analysis in tackle-collision based sports is highly subjective and exposed to bias, which is inherent in human observation, especially under time constraints. This limitation of match analysis in tackle-collision based sports can be seen as an opportunity for computervision applications. Objectively tracking, detecting and recognising an athlete's movements and actions during match play from a distance using video, along with our improved understanding of injury aetiology and skill execution will enhance our understanding how injury occurs, assist match day injury management, reduce referee subjectivity. In this paper, we present a system of objectively evaluating in-game tackle risk in rugby union matches. First, a ball detection model is trained using the You Only Look Once (YOLO) framework, these detections are then tracked by a Kalman Filter (KF). Following this, a separate YOLO model is used to detect persons/players within a tackle segment and then the ball-carrier and tackler are identified. Subsequently, we utilize OpenPose to determine the pose of ball-carrier and tackle, the relative pose of these is then used to evaluate the risk of the tackle. We tested the system on a diverse collection of rugby tackles and achieved an evaluation accuracy of 62.50%. These results will enable referees in tackle-contact based sports to make more subjective decisions, ultimately making these sports safer.
Malicious attackers generate adversarial instances by introducing imperceptible perturbations into data. Even in the black-box setting where model details are concealed, attackers still exploit networks with cross-mod...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Malicious attackers generate adversarial instances by introducing imperceptible perturbations into data. Even in the black-box setting where model details are concealed, attackers still exploit networks with cross-model transferability. Despite the notable success of untargeted attacks, achieving targeted attack transferability remains a challenging endeavor. Recent investigations have demonstrated the efficacy of ensemble-based techniques. However, utilizing additional models to carry out ensemble attacks brings extra costs. To reduce the number of white-box models required, model augmentation methods augment the given network to produce a variant of diverse models, contributing useful gradients for attack. In this work, we propose Diversified Weight Pruning (DWP) as an innovative model augmentation technique specifically designed to facilitate the generation of transferable targeted attacks. In contrast to prior techniques, DWP preserves essential connections while simultaneously ensuring diversity among the pruned models, both of which are identified as pivotal factors for targeted transferability. DWP is shown effective with experiments on ImageNet under challenging conditions, with enhancements of up to 10.1%, 6.6%, and 7.0% across adversarially trained models, Non-CNN architectures, and Google Cloud vision respectively.
The detection and recognition of distracted driving behaviors has emerged as a new vision task with the rapid development of computervision, which is considered as a challenging temporal action localization (TAL) pro...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
The detection and recognition of distracted driving behaviors has emerged as a new vision task with the rapid development of computervision, which is considered as a challenging temporal action localization (TAL) problem in computervision. The primary goal of temporal localization is to determine the start and end time of actions in untrimmed videos. Currently, most state-of-the-art temporal localization methods adopt complex architectures, which are cumbersome and time-consuming. In this paper, we propose a robust and efficient two-stage framework for distracted behavior classification-localization based on the sliding window approach, which is suitable for untrimmed naturalistic driving videos. To address the issues of high similarity among different behaviors and interference from background classes, we propose a multi-view fusion and adaptive thresholding algorithm, which effectively reduces missing detections. To address the problem of fuzzy behavior boundary localization, we design a post-processing procedure that achieves fine localization from coarse localization through post connection and candidate behavior merging criteria. In the AICITY2024 Task3 TestA, our method performs well, achieving Average Intersection over Union(AIOU) of 0.6080 and ranking eighth in AICITY2024 Task3. Our code will be released in the near future.
Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second.
Existing image restoration models have limited performance in high-resolution image shadow removal tasks, particularly in handling complex background information and unevenly distributed shadows. To address this chall...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Existing image restoration models have limited performance in high-resolution image shadow removal tasks, particularly in handling complex background information and unevenly distributed shadows. To address this challenge, we propose a novel two-stage approach called HirFormer for high-resolution image shadow removal. The first stage, Dynamic High Resolution Transformer, reconstructs the high-resolution background information and removes a significant portion of the shadows based on the Transformer architecture. The second stage, Large-scale Image Refinement, incorporates the NAFNet model to further eliminate residual shadows and address block artifacts introduced by the first stage. Experimental results on official datasets validate the superiority of our method compared to existing approaches, and our approach emerged as the winner in the fidelity track of the NTIRE 2024 Shadow Removal Challenge during the final testing competition (1st place).
Facial deepfakes are becoming more and more realistic, to the point that it is often difficult for humans to distinguish between a fake and a real video. However, it is acknowledged that deepfakes contain artifacts at...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Facial deepfakes are becoming more and more realistic, to the point that it is often difficult for humans to distinguish between a fake and a real video. However, it is acknowledged that deepfakes contain artifacts at different levels; we hypothesize a connection between manipulations and visible or non-visible artifacts, especially where the subject’s movements are difficult to reproduce in detail. Accordingly, our approach relies on different quality measures, No-Reference (NR) and Full-Reference (FR), over the detected faces in the video. The measurements allow us to adopt a frame-by-frame approach to build an effective matrix-based representation of a video sequence. We show that the results obtained by this basic feature set for a neural network architecture constitute the first step that encourages the empowerment of this representation, aimed to extend our investigation to further deepfake classes. The FaceForensics++ dataset is chosen for experiments, which allows the evaluation of the proposed approach over different deepfake generation algorithms.
暂无评论