This paper introduces a framework for Audio Provenance Analysis, addressing the complex challenge of ana-lyzing heterogeneous sets of audio items without requiring any prior knowledge of their content. Our framework a...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
This paper introduces a framework for Audio Provenance Analysis, addressing the complex challenge of ana-lyzing heterogeneous sets of audio items without requiring any prior knowledge of their content. Our framework applies a novel approach that combines partial audio matching and phylogeny techniques. It constructs directed acyclic graphs to capture the origins and the evolution of content within near-duplicate audio clusters, identifying the least altered versions and tracing the reuse of content within these clusters. The approach is evaluated for two selected application scenarios, demonstrating that it can accurately determine the direction of content reuse and identify parent-child relationships, while also offering a dedicated dataset for benchmarking future research in this area.
Motivated by applications from computervision to bioinformatics, the field of shape analysis deals with problems where one wants to analyze geometric objects, such as curves, while ignoring actions that preserve thei...
详细信息
ISBN:
(纸本)9781665448994
Motivated by applications from computervision to bioinformatics, the field of shape analysis deals with problems where one wants to analyze geometric objects, such as curves, while ignoring actions that preserve their shape, such as translations, rotations, scalings, or reparametrizations. Mathematical tools have been developed to define notions of distances, averages, and optimal deformations for geometric objects. One such framework, which has proven to be successful in many applications, is based on the square root velocity (SRV) transform, which allows one to define a computable distance between spatial curves regardless of how they are parametrized. This paper introduces a supervised deep learning framework for the direct computation of SRV distances between curves, which usually requires an optimization over the group of reparametrizations that act on the curves. The benefits of our approach in terms of computational speed and accuracy are illustrated via several numerical experiments on both synthetic and real data.
Traditional computervision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptio...
详细信息
ISBN:
(纸本)9781665448994
Traditional computervision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use a simple pretraining task of predicting the pairings between images and text captions. CLIP, however, is data hungry and requires more than 400M image text pairs for training. We propose a data-efficient contrastive distillation method that uses soft labels to learn from noisy image-text pairs. Our model transfers knowledge from pre-trained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP. Our method exceeds the previous SoTA of general zero-shot learning on ImageNet 21k+1k by 73% relatively with a ResNet50 image encoder and DeCLUTR text encoder. We also beat CLIP by 10.5% relatively on zeroshot evaluation on Google Open Images (19,958 classes).
We aim to provide a comprehensive view of the inference efficiency of DETR-style detection models. We explore the effect of basic efficiency techniques and identify the factors that are easy to implement, yet effectiv...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We aim to provide a comprehensive view of the inference efficiency of DETR-style detection models. We explore the effect of basic efficiency techniques and identify the factors that are easy to implement, yet effectively improve the efficiency-accuracy trade-off. Specifically, we investigate the effect of input resolution, multi-scale feature enhancement, and backbone pre-training. Our experiments support that 1) adjusting the input resolution is a simple yet effective way to achieve a better efficiency-accuracy trade-off. 2) Multi-scale feature enhancement can be lightened with a marginal decrease in accuracy, and 3) improved backbone pre-training can further improve the trade-off.
Irrigation systems can vary widely in scale, from smallscale subsistence farming to large commercial agriculture (see Fig. 1 ). The heterogeneity in irrigation practices and systems across different regions adds to th...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Irrigation systems can vary widely in scale, from smallscale subsistence farming to large commercial agriculture (see Fig. 1 ). The heterogeneity in irrigation practices and systems across different regions adds to the complexity of mapping (see Fig. 1 ). Distinguishing between irrigated and non-irrigated areas is challenging due to the spectral characteristics of various irrigation systems and practices across different regions, further complicating the task of mapping different types of irrigation. For example, rainfed agriculture is prevalent in the Midwest, Southeast, and parts of the Northeast U.S., while irrigation is common in arid Western and Southwestern states. Rainfed farming can result in highly variable patterns of cultivation. Farmers may practice rainfed agriculture in some fields while irrigating others, leading to a complex mosaic of irrigated and non-irrigated areas within the same region.
3D scene flow estimation is a vital tool in perceiving our environment given depth or range sensors. Unlike optical flow, the data is usually sparse and in most cases partially occluded in between two temporal samplin...
详细信息
ISBN:
(纸本)9781665448994
3D scene flow estimation is a vital tool in perceiving our environment given depth or range sensors. Unlike optical flow, the data is usually sparse and in most cases partially occluded in between two temporal samplings. Here we propose a new scene flow architecture called OGSF-Net which tightly couples the learning for both flow and occlusions between frames. Their coupled symbiosis results in a more accurate prediction of flow in space. Unlike a traditional multi-action network, our unified approach is fused throughout the network, boosting performances for both occlusion detection and flow estimation. Our architecture is the first to gauge the occlusion in 3D scene flow estimation on point clouds. In key datasets such as Flyingthings3D and KITTI, we achieve the state-of-the-art results.(1,2)
This paper proposes an attention-based multi-level model with a multi-scale backbone for thermal image super-resolution. The model leverages the multi-scale backbone as well. The thermal image dataset is provided by P...
详细信息
ISBN:
(纸本)9781665448994
This paper proposes an attention-based multi-level model with a multi-scale backbone for thermal image super-resolution. The model leverages the multi-scale backbone as well. The thermal image dataset is provided by PBVS 2020 in their thermal image super-resolution challenge. This dataset contains the images with three different resolution scales(low, medium, high) [1]. However, only the medium and high-resolution images are used to train the proposed architecture to generate the super-resolution images in x2, x4 scales. The proposed architecture is based on the Res2net blocks as the backbone of the network. Along with this, the coordinate convolution layer and dual attention are also used in the architecture. Further, multi-level supervision is implemented to supervise the output image resolution similarity with the real image at each block during training. To test the robustness of the proposed model, we evaluated our model on the Thermal-6 dataset [20]. The results show that our model is efficient to achieve state-of-the-art results on the PBVS dataset. Further the results on the Thermal-6 dataset show that the model has a decent generalization capacity.
The objective of this paper is few-shot object detection (FSOD) - the task of expanding an object detector for a new category given only a few instances for training. We introduce a simple pseudo-labelling method to s...
详细信息
ISBN:
(数字)9781665469463
ISBN:
(纸本)9781665469463
The objective of this paper is few-shot object detection (FSOD) - the task of expanding an object detector for a new category given only a few instances for training. We introduce a simple pseudo-labelling method to source high-quality pseudo-annotations from the training set, for each new category, vastly increasing the number of training instances and reducing class imbalance;our method finds previously unlabelled instances. Naively training with model predictions yields suboptimal performance;we present two novel methods to improve the precision of the pseudo-labelling process: first, we introduce a verification technique to remove candidate detections with incorrect class labels;second, we train a specialised model to correct poor quality bounding boxes. After these two novel steps, we obtain a large set of high-quality pseudo-annotations that allow our final detector to be trained end-to-end. Additionally, we demonstrate our method maintains base class performance, and the utility of simple augmentations in FSOD. While benchmarking on PASCAL VOC and MS-COCO, our method achieves state-of-the-art or second-best performance compared to existing approaches across all number of shots.
In this paper, we propose an image quality transformer (IQT) that successfully applies a transformer architecture to a perceptual full-reference image quality assessment (IQA) task. Perceptual representation becomes m...
详细信息
ISBN:
(纸本)9781665448994
In this paper, we propose an image quality transformer (IQT) that successfully applies a transformer architecture to a perceptual full-reference image quality assessment (IQA) task. Perceptual representation becomes more important in image quality assessment. In this context, we extract the perceptual feature representations from each of input images using a convolutional neural network (CNN) backbone. The extracted feature maps are fed into the transformer encoder and decoder in order to compare a reference and distorted images. Following an approach of the transformer-based vision models [18, 55], we use extra learnable quality embedding and position embedding. The output of the transformer is passed to a prediction head in order to predict a final quality score. The experimental results show that our proposed model has an outstanding performance for the standard IQA datasets. For a large-scale IQA dataset containing output images of generative model, our model also shows the promising results. The proposed IQT was ranked first among 13 participants in the NTIRE 2021 perceptual image quality assessment challenge [23]. Our work will be an opportunity to further expand the approach for the perceptual IQA task.
Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person...
详细信息
ISBN:
(纸本)9781665448994
Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person re-identification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio (voices) and visual (faces) streams are densely annotated. APES contains over 1.9K identities labeled along 36 hours of video, making it the largest dataset available for untrimmed audiovisual person search. A key property of APES is that it includes dense temporal annotations that link faces to speech segments of the same identity. To showcase the potential of our new dataset, we propose an audiovisual baseline and benchmark for person retrieval. Our study shows that modeling audiovisual cues benefits the recognition of people's identities.
暂无评论