We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM’s lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the trainin...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM’s lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT’s efficiency and capacity, EfficientViT-SAM delivers 48.9× measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://***/mit-han-lab/efficientvit.
In continual learning, a system learns from non-stationary data streams or batches without catastrophic forgetting. While this problem has been heavily studied in supervised image classification and reinforcement lear...
详细信息
ISBN:
(纸本)9781665448994
In continual learning, a system learns from non-stationary data streams or batches without catastrophic forgetting. While this problem has been heavily studied in supervised image classification and reinforcement learning, continual learning in neural networks designed for abstract reasoning has not yet been studied. Here, we study continual learning of analogical reasoning. Analogical reasoning tests such as Raven's Progressive Matrices (RPMs) are commonly used to measure non-verbal abstract reasoning in humans, and recently offline neural networks for the RPM problem have been proposed. In this paper, we establish experimental baselines, protocols, and forward and backward transfer metrics to evaluate continual learners on RPMs. We employ experience replay to mitigate catastrophic forgetting. Prior work using replay for image classification tasks has found that selectively choosing the samples to replay offers little, if any, benefit over random selection. In contrast, we find that selective replay can significantly outperform random selection for the RPM task(1).
In recent years, significant progress has been made within human face synthesis. It is now possible, and easy for anyone, to generate credible high-resolution images of non-existing people. This calls for effective de...
详细信息
ISBN:
(纸本)9781665448994
In recent years, significant progress has been made within human face synthesis. It is now possible, and easy for anyone, to generate credible high-resolution images of non-existing people. This calls for effective detection methods. In this paper, three state-of-the-art deep learning-based methods are evaluated with respect to their robustness and generalizability, which are two factors that must be taken into consideration for methods intended to be deployed in the wild. The robustness experiments show that it is possible to achieve near-perfect performance when discriminating between real and synthetic facial images that have been post-processed heavily with various perturbation techniques;especially when similar perturbations are incorporated during training of the detection models. The generalization experiments show that already trained detection models can achieve high performance on images from sources not known during training, provided that the models are fine-tuned on such images. One model achieved an average accuracy of 96.8% after being fine-tuned on 3 training images from each unknown source considered (one real and one synthetic source). However, additional images were required when fine-tuning using a different approach aimed at preventing catastrophic forgetting. Furthermore, in general, no method generalized well without fine-tuning. Hence, the limited generalization capability remains a shortcoming that must be overcome before the detection methods can be utilized in the wild.
The optimization of Binary Neural Networks (BNNs) relies on approximating the real-valued weights with their binarized representations. Current techniques for weight-updating use the same approaches as traditional Neu...
详细信息
ISBN:
(纸本)9781665448994
The optimization of Binary Neural Networks (BNNs) relies on approximating the real-valued weights with their binarized representations. Current techniques for weight-updating use the same approaches as traditional Neural Networks (NNs) with the extra requirement of using an approximation to the derivative of the sign function - as it is the Dirac-Delta function - for back-propagation;thus, efforts are focused adapting full-precision techniques to work on BNNs. In the literature, only one previous effort has tackled the problem of directly training the BNNs with bit-flips by using the first raw moment estimate of the gradients and comparing it against a threshold for deciding when to flip a weight (Bop). In this paper, we take an approach parallel to Adam which also uses the second raw moment estimate to normalize the first raw moment before doing the comparison with the threshold, we call this method Bop2ndOrder. We present two versions of the proposed optimizer: a biased one and a bias-corrected one, each with its own applications. Also, we present a complete ablation study of the hyperparameters space, as well as the effect of using schedulers on each of them. For these studies, we tested the optimizer in CIFAR10 using the BinaryNet architecture. Also, we tested it in ImageNet 2012 with the XnorNet and BiRealNet architectures for accuracy. In both datasets our approach proved to converge faster, was robust to changes of the hyperparameters, and achieved better accuracy values.
Compound Expression recognition (CER) plays a crucial role in interpersonal interactions. Due to the complexity of human emotional expressions, which leads to the existence of compound expressions, it is necessary to ...
详细信息
ISBN:
(数字)9798350365474
ISBN:
(纸本)9798350365481
Compound Expression recognition (CER) plays a crucial role in interpersonal interactions. Due to the complexity of human emotional expressions, which leads to the existence of compound expressions, it is necessary to consider both local and global facial expressions comprehensively for recognition. In this paper, to address this issue, we propose a solution for compound expression recognition based on ensemble learning methods. Specifically, our task is classification. We trained three expression classification models based on convolutional networks (ResNet50), vision Transformers, and multi-scale local attention networks, respectively. Then, by using late fusion, integrated the outputs of three models to predict the final result, leveraging the strengths of different models. Our method achieves high accuracy on RAF-DB and in sixth Affective Behavior Analysis in-the-wild (ABAW) Challenge, achieves an F1 score of 0.224 on the test set of C-EXPR-DB.
Tracking players in sports videos is commonly done in a tracking-by-detection framework, first detecting players in each frame, and then performing association over time. While for some sports tracking players is suff...
详细信息
ISBN:
(纸本)9781665448994
Tracking players in sports videos is commonly done in a tracking-by-detection framework, first detecting players in each frame, and then performing association over time. While for some sports tracking players is sufficient for game analysis, sports like hockey, tennis and polo may require additional detections, that include the object the player is holding (e.g. racket, stick). The baseline solution for this problem involves detecting these objects as separate classes, and matching them to player detections based on the intersection over union (IoU). This approach, however, leads to poor matching performance in crowded situations, as it does not model the relationship between players and objects. In this paper, we propose a simple yet efficient way to detect and match players and related objects at once without extra cost, by considering an implicit association for prediction of multiple objects through the same proposal box. We evaluate the method on a dataset of broadcast ice hockey videos, and also a new public dataset we introduce called COCO +Torso. On the ice hockey dataset, the proposed method boosts matching performance from 57.1% to 81.4%, while also improving the meanAP of player+stick detections from 68.4% to 88.3%. On the COCO +Torso dataset, we see matching improving from 47.9% to 65.2%. The COCO +Torso dataset, code and pre-trained models will be released at https: //***/foreverYoungGitHub/detectand-match-related-objects.
We developed and tested the architecture of a bio-inspired Spiking Neural Network for motion estimation. The computation performed by the retina is emulated by the neuromorphic event-based image sensor DAVIS346 which ...
详细信息
ISBN:
(纸本)9781665448994
We developed and tested the architecture of a bio-inspired Spiking Neural Network for motion estimation. The computation performed by the retina is emulated by the neuromorphic event-based image sensor DAVIS346 which constitutes the input of our network. We obtained neurons highly tuned to spatial frequency and orientation of the stimulus through a combination of feed-forward excitatory connections modeled as an elongated Gaussian kernel and recurrent inhibitory connections from two clusters of neurons within the same cortical layers. Sums over adjacent nodes weighted by time-variable synapses are used to attain Gabor-like spatio-temporal V1 receptive fields with selectivity to the stimulus' motion. In order to gain the invariance to the stimulus phase, the two polarities of the events provided by the neuromorphic sensor were exploited, which allowed us to build two pairs of quadrature filters from which we obtain Motion Energy detectors as described in [2]. Finally, a decoding stage allows us to compute optic flow from the Motion Detector layers. We tested the approach proposed with both synthetic and natural stimuli.
Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence,...
详细信息
ISBN:
(纸本)9781665448994
Conventionally, AI models are thought to trade off explainability for lower accuracy. We develop a training strategy that not only leads to a more explainable AI system for object classification, but as a consequence, suffers no perceptible accuracy degradation. Explanations are defined as regions of visual evidence upon which a deep classification network makes a decision. This is represented in the form of a saliency map conveying how much each pixel contributed to the network's decision. Our training strategy enforces a periodic saliency-based feedback to encourage the model to focus on the image regions that directly correspond to the ground-truth object. We quantify explainability using an automated metric, and using human judgement. We propose explainability as a means for bridging the visual-semantic gap between different domains where model explanations are used as a means of disentagling domain specific information from otherwise relevant features. We demonstrate that this leads to improved generalization to new domains without hindering performance on the original domain.
With the development of convolutional neural networks (CNN), the super-resolution results of CNN-based method have far surpassed traditional method. In particular, the CNN-based single image super-resolution method ha...
详细信息
ISBN:
(纸本)9781665448994
With the development of convolutional neural networks (CNN), the super-resolution results of CNN-based method have far surpassed traditional method. In particular, the CNN-based single image super-resolution method has achieved excellent results. Video sequences contain more abundant information compare with image, but there are few video super-resolution methods that can be applied to mobile devices due to the requirement of heavy computation, which limits the application of video super-resolution. In this work, we propose the Efficient Video Super-Resolution Network (EVSRNet) with neural architecture search for real-time video super-resolution. Extensive experiments show that our method achieves a good balance between quality and efficiency. Finally, we achieve a competitive result of 7.36 where the PSNR is 27.85 dB and the inference time is 11.3 ms/f on the target snapdragon 865 SoC, resulting in a 2nd place in the Mobile AI(MAI) 2021 real-time video super-resolution challenge. It is noteworthy that, our method is the fastest and significantly outperforms other competitors by large margins.
Complex deep convolutional neural networks such as ResNet require expensive hardware such as powerful GPUs to achieve real-time performance. This problem is critical for applications that run on low-end embedded GPU o...
详细信息
ISBN:
(纸本)9781665448994
Complex deep convolutional neural networks such as ResNet require expensive hardware such as powerful GPUs to achieve real-time performance. This problem is critical for applications that run on low-end embedded GPU or CPU systems with limited resources. As a result, model compression for deep neural networks becomes an important research topic. Popular compression methods such as weight pruning remove redundant neurons from the CNN without affecting the network's output accuracy. While these pruning methods work well on simple networks such as VGG or AlexNet, they are not suitable for compressing current state-of-the-art networks such as ResNets because of these networks' complex architectures with dimensionality dependencies. This dependency results in filter pruning breaking the structure of ResNets leading to an untrainable network. In this paper, we first use the weight pruning method only on a selective number of layers in the ResNet architecture to avoid breaking the network structure. Second, we introduce a knowledge distillation architecture and a loss function to compress the untouched layers during the pruning. We test our method on both image-based regression and classification networks for head-pose estimation and image classification. Our compression method reduces the models' size significantly while maintaining the accuracy very close to the baseline model.
暂无评论