Deep Learning on microcontroller (MCU) based IoT devices is extremely challenging due to memory constraints. Prior approaches focus on using internal memory or external memories exclusively which limit either accuracy...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Deep Learning on microcontroller (MCU) based IoT devices is extremely challenging due to memory constraints. Prior approaches focus on using internal memory or external memories exclusively which limit either accuracy or latency. We find that a hybrid method using internal and external MCU memories outperforms both approaches in accuracy and latency. We develop TinyOps, an inference engine which accelerates inference latency of models in slow external memory, using a partitioning and overlaying scheme via the available Direct Memory Access (DMA) peripheral to combine the advantages of external memory (size) and internal memory (speed). Experimental results show that architectures deployed with TinyOps significantly outperform models designed for internal memory with up to 6% higher accuracy and importantly, 1.3-2.2x faster inference latency to set the state-of-the-art in TinyML ImageNet classification. Our work shows that the TinyOps space is more efficient compared to the internal or external memory design spaces and should be explored further for TinyML applications.
To understand the genuine emotions expressed by humans during social interactions, it is necessary to recognize the subtle changes on the face (micro-expressions) demonstrated by an individual. Facial micro-expression...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
To understand the genuine emotions expressed by humans during social interactions, it is necessary to recognize the subtle changes on the face (micro-expressions) demonstrated by an individual. Facial micro-expressions are brief, rapid, spontaneous gestures and non-voluntary facial muscle movements beneath the skin. Therefore, it is a challenging task to classify facial micro-expressions. This paper presents an end-to-end novel three-stream graph attention network model to capture the subtle changes on the face and recognize micro-expressions (MEs) by exploiting the relationship between optical flow magnitude, optical flow direction, and the node locations features. A facial graph representational structure is used to extract the spatial and temporal information using the three frames. The varying dynamic patch size of optical flow features is used to extract the local texture information across each landmark point. The network only utilizes the landmark points location features and optical flow information across these points and generates good results for the classification of MEs. A comprehensive evaluation of SAMM and the CASME II datasets demonstrates the high efficacy, efficiency, and generalizability of the proposed approach and achieves better results than the state-of-the-art methods.
3D human motion prediction requires making sense of the complex spatio-temporal dynamics which underpin human motion to make highly accurate predictions. Part of this complexity is due to the trade-off between long-te...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
3D human motion prediction requires making sense of the complex spatio-temporal dynamics which underpin human motion to make highly accurate predictions. Part of this complexity is due to the trade-off between long-term (>400ms) and short-term predictions (<400ms) which require different levels of granularity to observe patterns. Several works have explored methods of improving long-term prediction performance by utilizing longer motion histories but this typically comes at the cost of very short-term (<200ms) performance. Inspired by high-resolution network architectures, we propose a novel high-resolution spatio-temporal attention network (HR-STAN) which leverages parallel feature branches and dilated convolutions to observe human motion at different scales. Furthermore, we augment this architecture with split spatial and temporal attention mechanisms to efficiently capture spatio-temporal dependencies within a given motion. We evaluate the ability of our HR-STAN architecture at incorporating long-term motion histories while producing short-term predictions and show that it improves over several state-of-the-art methods on both the AMASS and Human3.6M benchmarks.
This work reviews the results of the NTIRE 2024 Challenge on Shadow Removal. Building on the last year edition, the current challenge was organized in two tracks, with a track focused on increased fidelity reconstruct...
详细信息
ISBN:
(纸本)9798350365474
This work reviews the results of the NTIRE 2024 Challenge on Shadow Removal. Building on the last year edition, the current challenge was organized in two tracks, with a track focused on increased fidelity reconstruction, and a separate ranking for high performing perceptual quality solutions. Track 1 (fidelity) had 214 registered participants, with 17 teams submitting in the final phase, while Track 2 (perceptual) registered 185 participants, resulting in 18 final phase submissions. Both tracks were based on data from the WSRD dataset, simulating interactions between self-shadows and cast shadows, with a large variety of represented objects, textures, and materials. Improved image alignment enabled increased fidelity reconstruction, with restored frames mostly indistinguishable from the references images for top performing solutions.
Deep convolutional neural networks are now performing increasingly superior in various fields, while the network parameters are getting massive as the advanced neural networks tend to be deeper. Among various model co...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Deep convolutional neural networks are now performing increasingly superior in various fields, while the network parameters are getting massive as the advanced neural networks tend to be deeper. Among various model compression methods, quantization is one of the most potent approaches to compress neural networks by compacting model weights and activations to lower bit-width. The data-free quantization method is also proposed, which is specialized for some privacy and security scenarios and enables quantization without access to real data. In this work, we find that the tuning robustness of existing data-free quantization is flawed, progressing an empirical study and determining some hyperparameter settings that can converge the model stably in the data-free quantization process. Our study aims to evaluate the overall tuning robustness of the current data-free quantization system, which is existing methods are significantly affected by parameter fluctuations in tuning. We also expect data-free quantification methods with tuning robustness to appear in the future.
With the demands of analyzing and predicting traffic flow for applications in smart cities, Multi-Target Multi-Camera vehicle Tracking(MTMCT) at the city scale has become a fundamental problem. The MTMCT is challengin...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
With the demands of analyzing and predicting traffic flow for applications in smart cities, Multi-Target Multi-Camera vehicle Tracking(MTMCT) at the city scale has become a fundamental problem. The MTMCT is challenging due to the view variations, frequent occlusions, and similar vehicle models in the same camera. This work proposes an MTMCT framework based on occlusion-aware and intervehicle information that can effectively match vehicle tracklets. The occlusion-aware module segments the tracklets of an occluded and occluding vehicle pair. It recalculates the similarity of the complete tracklets, which can handle the occlusions and suppress false detections. This work proposes an inter-vehicle information module to improve the matching accuracy. The module can enhance the ability to distinguish similar vehicles under the same camera at different times. The proposed whole framework consists of four modules: (1) vehicle detection and feature extraction by re-identification models, (2) single-camera tracking (SCT) to produce initial tracklets with an occlusion-aware module, (3) tracklets similarity by inter-vehicle association, (4) clustering in adjacent cameras for multi-camera tracklets matching. The proposed method obtains IDF1 score of 0.8285 on the Track-1 multi-camera vehicle tracking task of the 2022 AI City Challenge.
Class-imbalanced datasets can severely deteriorate the performance of semi-supervised learning (SSL). This is due to the confirmation bias especially when the pseudo labels are highly biased towards the majority class...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
Class-imbalanced datasets can severely deteriorate the performance of semi-supervised learning (SSL). This is due to the confirmation bias especially when the pseudo labels are highly biased towards the majority classes. Traditional resampling or reweighting techniques may not be directly applicable when the unlabeled data distribution is unknown. Inspired by the threshold-moving method that performs well in supervised learning-based binary classification tasks, we provide a simple yet effective scheme to address the multiclass imbalance issue of SSL. This scheme, named SaR, is a Self-adaptive Refinement of soft labels before generating pseudo labels. The pseudo labels generated post-SaR will be less biased, resulting in higher quality data for training the classifier. We show that SaR can consistently improve recent consistency-based SSL algorithms on various image classification problems across different imbalanced ratios. We also show that SaR is robust to the situations where unlabeled data have different distributions as labeled data. Hence, SaR does not rely on the assumptions that unlabeled data share the same distribution as the labeled data.
Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, inter...
详细信息
ISBN:
(纸本)9798350301298
Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL's potential to rebuild neural networks for better interpretability.
vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dime...
详细信息
ISBN:
(数字)9781665487399
ISBN:
(纸本)9781665487399
vision transformers (ViT) have recently attracted considerable attentions, but the huge computational cost remains an issue for practical deployment. Previous ViT pruning methods tend to prune the model along one dimension solely, which may suffer from excessive reduction and lead to sub-optimal model quality. In contrast, we advocate a multi-dimensional ViT compression paradigm, and propose to harness the redundancy reduction from attention head, neuron and sequence dimensions jointly. Firstly, we propose a statistical dependence based pruning criterion that is generalizable to different dimensions for identifying the deleterious components. Moreover, we cast the multi-dimensional ViT compression as an optimization problem, objective of which is to learn an optimal pruning policy across the three dimensions while maximizing the compressed model's accuracy under a computational budget. The problem is solved by an adapted Gaussian process search with expected improvement. Experimental results show that our method effectively reduces the computational cost of various ViT models. For example, our method reduces 40% FLOPs without top-1 accuracy loss for DeiT and T2T-ViT models on the ImageNet dataset, outperforming previous state-of-the-art ViT pruning methods.
Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a m...
详细信息
ISBN:
(纸本)9798350301298
Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset [72], we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues.
暂无评论