This work is devoted to the development and evaluation of methods for enhancing the accuracy of visual navigation systems for unmanned aerial vehicles (UAVs) using advanced object detection and recognition algorithms....
详细信息
The performance of face recognition system degrades when the variability of the acquired faces increases. Prior work alleviates this issue by either monitoring the face quality in pre-processing or predicting the data...
详细信息
ISBN:
(纸本)9781665445092
The performance of face recognition system degrades when the variability of the acquired faces increases. Prior work alleviates this issue by either monitoring the face quality in pre-processing or predicting the data uncertainty along with the face feature. This paper proposes MagFace, a category of losses that learn a universal feature embedding whose magnitude can measure the quality of the given face. Under the new loss, it can be proven that the magnitude of the feature embedding monotonically increases if the subject is more likely to be recognized. In addition, MagFace introduces an adaptive mechanism to learn a wellstructured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away. This prevents models from overfitting on noisy low-quality samples and improves face recognition in the wild. Extensive experiments conducted on face recognition, quality assessments as well as clustering demonstrate its superiority over state-of-the-arts.
Object Detection is commonly applied with computervision. It detects and positions objects within the range of a specific image-capturing device. Utilizing the technology for various functionalities has been in pract...
详细信息
When creating a new labeled dataset, human analysts or data reductionists must review and annotate large numbers of images. This process is time consuming and a barrier to the deployment of new computervision solutio...
详细信息
ISBN:
(纸本)9781665448994
When creating a new labeled dataset, human analysts or data reductionists must review and annotate large numbers of images. This process is time consuming and a barrier to the deployment of new computervision solutions, particularly for rarely occurring objects. To reduce the number of images requiring human attention, we evaluate the utility of images created from 3D models refined with a generative adversarial network to select confidence thresholds that significantly reduce false alarms rates. The resulting approach has been demonstrated to cut the number of images needing to be reviewed by 50% while preserving a 95% recall rate, with only 6 labeled examples of the target.
Megapixel imagers with computervision capabilities exhibit high acquisition performance at the expense of a power consumption preventing always-on usage. This paper presents ATIM, a wake-up solution for both 3-scale ...
Missing person searches are typically initiated with a description of a person that includes their age, race, clothing, and gender, possibly supported by a photo. Unmanned Aerial Systems (sUAS) imbued with computer Vi...
详细信息
First, data gloves were made. The STM32 microcontroller was selected for the data glove to collect hand motion pose data from the MPU6050 sensor, and the data was transferred to the Unity3D engine via a wireless Bluet...
详细信息
The problem of expert model selection deals with choosing the appropriate pretrained network ("expert") to transfer to a target task. Methods, however, generally depend on two separate assumptions: the prese...
详细信息
ISBN:
(纸本)9781665445092
The problem of expert model selection deals with choosing the appropriate pretrained network ("expert") to transfer to a target task. Methods, however, generally depend on two separate assumptions: the presence of labeled images and access to powerful "probe" networks that yield useful features. In this work, we demonstrate the current reliance on both of these aspects and develop algorithms to operate when either of these assumptions fail. In the unlabeled case, we show that pseudolabels from the probe network provide discriminative enough gradients to perform nearly-equal task selection even when the probe network is trained on imagery unrelated to the tasks. To compute the embedding with no probe network at all, we introduce the Task Tangent Kernel (TTK) which uses a kernelized distance across multiple random networks to achieve performance over double that of other methods with randomly initialized models.
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT ...
详细信息
ISBN:
(纸本)9781665409155
This paper presents a pure transformer-based approach, dubbed the Multi-Modal Video Transformer (MM-ViT), for video action recognition. Different from other schemes which solely utilize the decoded RGB frames, MM-ViT operates exclusively in the compressed video domain and exploits all readily available modalities, i.e., I-frames, motion vectors, residuals and audio waveform. In order to handle the large number of spatiotemporal tokens extracted from multiple modalities, we develop several scalable model variants which factorize self-attention across the space, time and modality dimensions. In addition, to further explore the rich inter-modal interactions and their effects, we develop and compare three distinct cross-modal attention mechanisms that can be seamlessly integrated into the transformer building block. Extensive experiments on three public action recognition benchmarks (UCF-101, Something-Something-v2, Kinetics-600) demonstrate that MM-ViT outperforms the state-of-the-art video transformers in both efficiency and accuracy, and performs better or equally well to the state-of-the-art CNN counterparts with computationally-heavy optical flow.
In recent years, the rapid growth of Internet of Things (IoT) technology and computervision (CV) technology has provided strong support for the application of deep learning (DL) in various fields. Among them, DL base...
详细信息
暂无评论