Aiming at the problem that it is difficult to fully and effectively utilize features for complete representation of target information during target tracking in complex scenes, this paper proposes an ECO-HC target tra...
详细信息
vision Transformer (ViT) has recently been introduced into the computervision (CV) field with its self-attention mechanism and gotten remarkable performance. However, simply applying ViT for hyperspectral image (HSI)...
详细信息
ISBN:
(数字)9781665490627
ISBN:
(纸本)9781665490627
vision Transformer (ViT) has recently been introduced into the computervision (CV) field with its self-attention mechanism and gotten remarkable performance. However, simply applying ViT for hyperspectral image (HSI) classification is not applicable due to 1) ViT is a spatial-only self-attention model, but rich spectral information exists in HSI;2) ViT needs sufficient training samples, but HSI suffers from limited samples;3) ViT does not well learn local features;4) multi-scale features for ViT are not considered. Furthermore, the methods which combine convolutional neural network (CNN) and ViT generally suffer from a large computational burden. Hence, this paper tends to design a suitable pure ViT based model for HSI classification as the following points: 1) spectral-only vision transformer with all tokens' aggregation;2) spatial-only local-global transformer;3) cross-scale local-global feature fusion, and 4) a cooperative loss function to unify the spectral and spatial features. As a result, the proposed idea achieves competitive classification performance on three public datasets than other state-of-the-art methods.
This paper introduces a novel multi-task transformer for detecting synthetic speech. The network encodes magnitude and phase of the input speech with a feature bottleneck, used to autoencode the input magnitude, to pr...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper introduces a novel multi-task transformer for detecting synthetic speech. The network encodes magnitude and phase of the input speech with a feature bottleneck, used to autoencode the input magnitude, to predict the trajectory of the first phonetic formants (F0, F1, F2), and to distinguish whether the input speech is synthetic or natural. The approach achieves state-of-the-art performance on the ASVspoof 2019 LA dataset with an AUC score of 0.932, while ensuring interpretability at the same time.
Lossy image compression causes a loss of texture, especially at low bitrate. To mitigate this problem, we propose a novel image compression method that utilizes a reference-based image super-resolution model. We use t...
详细信息
ISBN:
(纸本)9781665448994
Lossy image compression causes a loss of texture, especially at low bitrate. To mitigate this problem, we propose a novel image compression method that utilizes a reference-based image super-resolution model. We use two image compression models and a self texture transfer model. The image compression models encode and decode a whole input image and selected reference patches. The reference patches are small but compressed with high quality. The self texture transfer model transfers the texture of reference patches into similar regions in the compressed image. The experimental results show that our method can reconstruct accurate texture by transferring the texture of reference patches.
The current rate of decline in biodiversity exclaims ecological conservation. In response, camera traps are being increasingly deployed for the perlustration of wildlife. The analyses of camera trap data can aid in cu...
详细信息
ISBN:
(纸本)9783031245374;9783031245381
The current rate of decline in biodiversity exclaims ecological conservation. In response, camera traps are being increasingly deployed for the perlustration of wildlife. The analyses of camera trap data can aid in curbing species extinction. However, a substantial amount of time is lost in the manual review curtailing the usage of camera traps for prompt decision-making. The insuperable visual challenges and proneness of camera trap to record empty frames (frames that are natural backdrops with no wildlife presence) deem wildlife detection and species recognition a demanding and taxing task. Thus, we propose a pipeline for wildlife detection and species recognition to expedite the processing of camera trap sequences. The proposed pipeline consists of three stages: (i) empty frame removal, (ii) wildlife detection, and (iii) species recognition and classification. We leverage vision transformer (ViT), DEtection TRansformer (DETR), vision and detection transformer (ViDT), faster region based convolutional neural network (Faster R-CNN), inception v3, and ResNet 50 for the same. We examine the adroitness of the leveraged algorithms at new and unseen locations against the challenges of domain generalisation. We demonstrate the effectiveness of the proposed pipeline using the Caltech camera trap (CCT) dataset.
Nowadays, camera traps are widely employed in monitoring biodiversity and assessing the population density of animal species. A challenge in animal recognition in camera trap images is the detection of small animals i...
详细信息
ISBN:
(纸本)9798331539856
Nowadays, camera traps are widely employed in monitoring biodiversity and assessing the population density of animal species. A challenge in animal recognition in camera trap images is the detection of small animals in complex environments and the identification of heavily obscured animals. This paper presents two novel methods that leverage sequentially captured images to improve animal recognition accuracy: one utilizing optical flow information and the other a motion-based algorithm based on the principle of median filtering. In experiments, we used two new real-world sequence-based camera trap image datasets to evaluate these methods. Our findings indicate that optical flow information effectively reduces false positive cases, while the motion-based algorithm significantly improves the accuracy of detecting animal presence and counting by substantially reducing false negative cases. Specifically, using the MegaDetector with a confidence threshold of 0.5 as the baseline, the motion-based method reduced false negative cases by over 70% while only slightly increasing false positive cases, and improved animal counting accuracy by more than 25%.
While machine learning powers impressive computervision systems, they lack the human advantage of general world knowledge. This means they struggle to interpret visual data with humans' same richness of understan...
详细信息
An important goal across most scientific fields is the discovery of causal structures underling a set of observations. Unfortunately, causal discovery methods which are based on correlation or mutual information can o...
详细信息
ISBN:
(纸本)9781665448994
An important goal across most scientific fields is the discovery of causal structures underling a set of observations. Unfortunately, causal discovery methods which are based on correlation or mutual information can often fail to identify causal links in systems which exhibit dynamic relationships. Such dynamic systems (including the famous coupled logistic map) exhibit 'mirage' correlations which appear and disappear depending on the observation window. This means not only that correlation is not causation but, perhaps counter-intuitively, that causation may occur without correlation. In this paper we describe Neural Shadow-Mapping, a neural network based method which embeds high-dimensional video data into a low-dimensional shadow representation, for subsequent estimation of causal links. We demonstrate its performance at discovering causal links from video-representations of dynamic systems.
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a diff...
详细信息
ISBN:
(纸本)9781665445092
Neural Networks require large amounts of memory and compute to process high resolution images, even when only a small part of the image is actually informative for the task at hand. We propose a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images. Our method may be interfaced with any downstream neural network, is able to aggregate information from different patches in a flexible way, and allows the whole model to be trained end-to-end using backpropagation. We show results for traffic sign recognition, inter-patch relationship reasoning, and fine-grained recognition without using object/part bounding box annotations during training.
The goal of this paper is to model the fashion compatibility of an outfit and provide the explanations. We first extract features of all attributes of all items via convolutional neural networks, and then train the bi...
详细信息
ISBN:
(纸本)9781665448994
The goal of this paper is to model the fashion compatibility of an outfit and provide the explanations. We first extract features of all attributes of all items via convolutional neural networks, and then train the bidirectional Long Shortterm Memory (Bi-LSTM) model to learn the compatibility of an outfit by treating these attribute features as a sequence. Gradient penalty regularization is exploited for training inter-factor compatibility net which is used to compute the loss for judgment and provide its explanation which is generated from the recognized reasons related to the judgment. To train and evaluate the proposed approach, we expanded the EVALUATION3 dataset in terms of the number of items and attributes. Experiment results show that our approach can successfully evaluate compatibility with reason.
暂无评论