The ability to learn richer network representations generally boosts the performance of deep learning models. To improve representation-learning in convolutional neural networks, we present a multi-branch architecture...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The ability to learn richer network representations generally boosts the performance of deep learning models. To improve representation-learning in convolutional neural networks, we present a multi-branch architecture, which applies channel-wise attention across different network branches to leverage the complementary strengths of both feature-map attention and multi-path representation. Our proposed Split-Attention module provides a simple and modular computation block that can serve as a drop-in replacement for the popular residual block, while producing more diverse representations via cross-feature interactions. Adding a Split-Attention module into the architecture design space of RegNet-Y and FBNetV2 directly improves the performance of the resulting network. Replacing residual blocks with our Split-Attention module, we further design a new variant of the ResNet model, named ResNeSt, which outperforms EfficientNet in terms of the accuracy/latency trade-off.
Natural language (NL) based vehicle retrieval aims to search specific vehicle given text description. Different from the image-based vehicle retrieval, NL-based vehicle retrieval requires considering not only vehicle ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Natural language (NL) based vehicle retrieval aims to search specific vehicle given text description. Different from the image-based vehicle retrieval, NL-based vehicle retrieval requires considering not only vehicle appearance, but also surrounding environment and temporal relations. In this paper, we propose a Symmetric Network with Spatial Relationship Modeling (SSM) method for NL-based vehicle retrieval. Specifically, we design a symmetric network to learn the unified cross-modal representations between text descriptions and vehicle images, where vehicle appearance details and vehicle trajectory global information are preserved. Besides, to make better use of location information, we propose a spatial relationship modeling methods to take surrounding environment and mutual relationship between vehicles into consideration. The qualitative and quantitative experiments verify the effectiveness of the proposed method. We achieve 43.92% MRR accuracy on the test set of the 6th AI City Challenge on natural language-based vehicle retrieval track, yielding the 4th place on the public leaderboard.
Artificial intelligence technology is increasingly widely used in games, especially for wargames. The addition of artificial intelligence algorithms enables these games to solve decision-making problems in complex env...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Artificial intelligence technology is increasingly widely used in games, especially for wargames. The addition of artificial intelligence algorithms enables these games to solve decision-making problems in complex environments more quickly, surpassing the vast majority of experienced human players in competitive games. However, due to the vulnerability of the neural network before adversarial examples, all modules using artificial intelligence algorithms are at risk of being attacked. For wargames, adversarial examples will make the units in the game no longer able to follow the established routes or actions to perform tasks. Based on such risks, this paper proposes a deceptive concept scheme of attacking intelligent modules in wargames through adversarial examples, and proposes challenges and prospects for current technologies. To our knowledge, we are the first team to analyze the impact of adversarial examples in the running process of wargames, namely the OODA loop, and simulate them in the corresponding wargaming software. In the end, we found that when artificial intelligence technology is widely used in war games, adversarial examples will have a subversive impact on several activities in several steps, which will directly lead to the failure to complete the established game goals.
Facial manipulation by deep fake has caused major security risks and raised severe societal concerns. As a countermeasure, a number of deep fake detection methods have been proposed recently. Most of them model deep f...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Facial manipulation by deep fake has caused major security risks and raised severe societal concerns. As a countermeasure, a number of deep fake detection methods have been proposed recently. Most of them model deep fake detection as a binary classification problem using a backbone convolutional neural network (CNN) architecture pretrained for the task. These CNN-based methods have demonstrated very high efficacy in deep fake detection with the Area under the Curve (AUC) as high as 0.99. However, the performance of these methods degrades significantly when evaluated across datasets. In this paper, we formulate deep fake detection as a hybrid combination of supervised and reinforcement learning (RL) to improve its cross-dataset generalization performance. The proposed method chooses the top-k augmentations for each test sample by an RL agent in an image-specific manner. The classification scores, obtained using CNN, of all the augmentations of each test image are averaged together for final real or fake classification. Through extensive experimental validation, we demonstrate the superiority of our method over existing published research in cross-dataset generalization of deep fake detectors, thus obtaining state-of-the-art performance.
Deep neural networks have shown promising results in image super-resolution by learning a complex mapping from low resolution to high resolution image. However, most of the approaches learns to upsample by using convo...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Deep neural networks have shown promising results in image super-resolution by learning a complex mapping from low resolution to high resolution image. However, most of the approaches learns to upsample by using convolution in spatial domain and are confined to local features. This results into restricting the receptive field of the network and therefore deteriorates the overall quality of the high-resolution image. To alleviate this issue, we propose an architecture that learns both local and global features, and fuses them together to generate high quality images. The network uses a non-local attention aided Fast Fourier Convolutions (NL-FFC) to widen the receptive field and learn long-range dependencies. The analyses further show that these Fourier features implicitly provide faster convergence on low frequency components only to learn prior for unobserved high frequency components. The model generalizes well to different datasets. We further investigate the role of non-local attention, and the ratio of local and global features to maximize the performance gain in the ablation study.
Adversarial training (AT) is a simple yet effective defense against adversarial attacks to image classification systems, which is based on augmenting the training set with attacks that maximize the loss. However, the ...
详细信息
ISBN:
(纸本)9781665487399
Adversarial training (AT) is a simple yet effective defense against adversarial attacks to image classification systems, which is based on augmenting the training set with attacks that maximize the loss. However, the effectiveness of AT as a defense for video classification has not been thoroughly studied. Our first contribution is to show that generating optimal attacks for video requires carefully tuning the attack parameters, especially the step size. Notably, we show that the optimal step size varies linearly with the attack budget. Our second contribution is to show that using a smaller (sub-optimal) attack budget at training time leads to a more robust performance at test time. Based on these findings, we propose three defenses against attacks with variable attack budgets. The first one, Adaptive AT, is a technique where the attack budget is drawn from a distribution that is adapted as training iterations proceed. The second, Curriculum AT, is a technique where the attack budget is increased as training iterations proceed. The third, Generative AT, further couples AT with a denoising generative adversarial network to boost robust performance. Experiments on the UCF101 dataset demonstrate that the proposed methods improve adversarial robustness against multiple attack types.
The lack of large scale labelled datasets in word-level sign language recognition (WSLR) poses a challenge to detecting sign language from videos. Most WSLR approaches operate on datasets that do not model real-world ...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
The lack of large scale labelled datasets in word-level sign language recognition (WSLR) poses a challenge to detecting sign language from videos. Most WSLR approaches operate on datasets that do not model real-world settings very well, as they do not have a high degree of variability in terms of signers, background, lighting and inter signer variation. We chose the MS-ASL dataset to overcome these limitations as they model open-world settings very well. This paper benchmarks successful action recognition architectures on the MS-ASL dataset using transfer learning. We have achieved new state-of-the-art accuracy (92.35%) with an improvement of 7.03% over the previous state-of-the-art introduced by the MS-ASL paper. We have analyzed how action-recognition architectures fair in the task of WSLR, and we propose SlowFast 8x8 ResNet 101 as a robust and suitable architecture for the task of WSLR.
In real-world applications for video editing, humans are arguably the most important objects. When editing videos of humans, the efficient tracking of fine-grained masks and body joints is the fundamental requirement....
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
In real-world applications for video editing, humans are arguably the most important objects. When editing videos of humans, the efficient tracking of fine-grained masks and body joints is the fundamental requirement. In this paper, we propose a simple and efficient system for jointly tracking pose and segmenting high-quality masks for all humans in the video. We design a pipeline that globally tracks pose and locally segments fine-grained masks. Specifically, CenterTrack is first employed to track human poses by viewing the whole scene, and then the proposed local segmentation network leverages the pose information as a powerful query to carry out high-quality segmentation. Furthermore, we adopt a highly light-weight MLP-Mixer layer within the segmentation network that can efficiently propagate the query pose throughout the region of interest with minimal overhead. For the evaluation, we collect a new benchmark called KineMask which includes various appearances and actions. The experimental results demonstrate that our method has superior fine-grained segmentation performance. Moreover, it runs at 33 fps, achieving a great balance of speed and accuracy compared to the prevailing online Video Instance Segmentation methods.
Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns. In recent years, most gait recognition methods used the person's silhoue...
详细信息
ISBN:
(纸本)9781665487399
Gait recognition is a promising biometric with unique properties for identifying individuals from a long distance by their walking patterns. In recent years, most gait recognition methods used the person's silhouette to extract the gait features. However, silhouette images can lose fine-grained spatial information, suffer from (self) occlusion, and be challenging to obtain in real-world scenarios. Furthermore, these silhouettes also contain other visual clues that are not actual gait features and can be used for identification, but also to fool the system. Model-based methods do not suffer from these problems and are able to represent the temporal motion of body joints, which are actual gait features. The advances in human pose estimation started a new era for model-based gait recognition with skeleton-based gait recognition. In this work, we propose an approach based on Graph Convolutional Networks (GCNs) that combines higher-order inputs, and residual networks to an efficient architecture for gait recognition. Extensive experiments on the two popular gait datasets, CASIA-B and OUMVLP-Pose, show a massive improvement (3x) of the state-of-the-art (SotA) on the largest gait dataset OUMVLP-Pose and strong temporal modeling capabilities. Finally, we visualize our method to understand skeleton-based gait recognition better and to show that we model real gait features.
Unsupervised domain adaptation approaches have recently succeeded in various medical image segmentation tasks. The reported works often tackle the domain shift problem by aligning the domain-invariant features and min...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Unsupervised domain adaptation approaches have recently succeeded in various medical image segmentation tasks. The reported works often tackle the domain shift problem by aligning the domain-invariant features and minimizing the domain-specific discrepancies. That strategy works well when the difference between a specific domain and between different domains is slight. However, the generalization ability of these models on diverse imaging modalities remains a significant challenge. This paper introduces UDA-VAE++, an unsupervised domain adaptation framework for cardiac segmentation with a compact loss function lower bound. To estimate this new lower bound, we develop a novel Structure Mutual Information Estimation (SMIE) block with a global estimator, a local estimator, and a prior information matching estimator to maximize the mutual information between the reconstruction and segmentation tasks. Specifically, we design a novel sequential reparameterization scheme that enables information flow and variance correction from the low-resolution latent space to the high-resolution latent space. Comprehensive experiments on benchmark cardiac segmentation datasets demonstrate that our model outperforms previous state-of-the-art qualitatively and quantitatively.
暂无评论