Matrix-vector multiplication (MvM) operations play an important role in applications such as data processing and artificial neural networks. To meet the growing demand for computing power, the photonic MvM processor p...
详细信息
Matrix-vector multiplication (MvM) operations play an important role in applications such as data processing and artificial neural networks. To meet the growing demand for computing power, the photonic MvM processor provides what we believe to be a new computing architecture. In this paper, we propose a reconfigurable parallel MvM (RP-MvM) processor. To further improve the parallel computing dimension, wavelength division multiplexing (WDM) and digital subcarrier multiplexing (DSM) technologies were first incorporated into the photonic MvM. Compared with the traditional WDM-MvM architecture, the parallelism of RP-MvM scheme is increased by N times, where N is the carrier number of DSM signal. Moreover, the input data channel can be dynamically adjusted without changing the hardware scale, which improves the flexibility of computing system. The simulation results show that the RP-MvM scheme can achieve parallel computing operations of eight MvMs, with a computing speed of 128 GOPs. For a random 6-bit resolution data sequence, the root mean square error (RMSE) of calculation results is on the order of 1E-3. In addition, for the image edge extraction task based on Roberts operator, this scheme can realize the parallel processing of four grayscale images. Therefore, the proposed scheme provides an alternative approach for realizing a highly parallel and reconfigurable large-scale photonic MvM architecture.
Quantum machine Learning (QML) promises the transformative potential in computer vision by utilizing quantum computing to facilitate faster high-dimensional data processing. In this paper, we will go through some of t...
详细信息
This paper presents a method for synthesizing 2D and 3D sensor data for various machinevision tasks. Depending on the task, different processing steps can be applied to a 3D model of an object. For object detection, ...
详细信息
visual Question Answering (vQA) is pivotal in various industries, including medicine. Current approaches typically rely on identifying patterns between image regions and questions, using attention-learning techniques ...
详细信息
visual Question Answering (vQA) is pivotal in various industries, including medicine. Current approaches typically rely on identifying patterns between image regions and questions, using attention-learning techniques to highlight essential information and suppress noise. However, existing vQA systems often overlook crucial foreground and background-related features in images, limiting their ability to tackle complex questions effectively. Most vQA models employ either spatial or channel attention mechanisms. Spatial attention localizes the region of interest (ROI) but may overlook global semantic relationships between salient objects. Conversely, channel attention enhances feature representation but disregards spatial dynamics within images. To address these limitations, we propose "ENvQA" (Enriching v in vQA), a novel vQA model that integrates enriched visual features by leveraging both spatial and object-level features, alongside spatial and channel attention networks. Our model aims to enhance understanding by capturing both local and global contexts within images. Experimental evaluations on benchmark datasets such as vQA 2.0, TDIUC, and GQA demonstrate that ENvQA outperforms state-of-the-art (SOTA) models utilizing attention mechanisms.
Medical image analysis based on deep learning has important research significance for accurately locating and identifying lesion targets. This article aims to address the issues of improving the detection efficiency a...
详细信息
In the last few years, the abundance of available plank-ton images has significantly increased due to advancements in acquisition system technology. Consequently, a growing interest in automatic plankton image classif...
详细信息
This paper investigates the optimization and deployment of YOLOv7 deep learning model on NvIDIA Jetson Nano, an AI-focused edge computing platform for object detection in various computer visionapplications. The work...
详细信息
Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/vR and...
详细信息
Event cameras record sparse illumination changes with high temporal resolution and high dynamic range. Thanks to their sparse recording and low consumption, they are increasingly used in applications such as AR/vR and autonomous driving. Current top-performing methods often ignore specific event-data properties, leading to the development of generic but computationally expensive algorithms, while event-aware methods do not perform as well. We propose Event Transformer(+), that improves our seminal work EvT with a refined patch-based event representation and a more robust backbone to achieve more accurate results, while still benefiting from event-data sparsity to increase its efficiency. Additionally, we show how our system can work with different data modalities and propose specific output heads, for event-stream classification (i.e. action recognition) and per-pixel predictions (dense depth estimation). Evaluation results show better performance to the state-of-the-art while requiring minimal computation resources, both on GPU and CPU.
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of token...
详细信息
ISBN:
(纸本)9783031434143;9783031434150
The vision transformer is a model that breaks down each image into a sequence of tokens with a fixed length and processes them similarly to words in natural language processing. Although increasing the number of tokens typically results in better performance, it also leads to a considerable increase in computational cost. Motivated by the saying "A picture is worth a thousand words," we propose an innovative approach to accelerate the viT model by shortening long images. Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed. First, we train a Resizable-viT (ReviT) model capable of processing input with diverse token lengths. Next, we extract token-length labels from ReviT that indicate the minimum number of tokens required to achieve accurate predictions. We then use these labels to train a lightweight Token-Length Assigner (TLA) that allocates the optimal token length for each image during inference. The TLA enables ReviT to process images with the minimum sufficient number of tokens, reducing token numbers in the viT model and improving inference speed. Our approach is general and compatible with modern vision transformer architectures, significantly reducing computational costs. We verified the effectiveness of our methods on multiple representative viT models on image classification and action recognition.
Recently, achieving accurate remote sensing images (RSI) classification has been a primary goal in deep learning, given its extensive applications, including urban planning and disaster management. The performance of ...
详细信息
Recently, achieving accurate remote sensing images (RSI) classification has been a primary goal in deep learning, given its extensive applications, including urban planning and disaster management. The performance of existing convolutional neural networks (CNN)-based strategies is primarily influenced by their parameter settings, necessitating automated hyperparameter tuning through metaheuristic methods. The proposed BWODLF-RSI technique integrates black widow optimisation with a deep learning feature fusion model for enhanced RSI analysis. The preliminary processing step is to enhance RSI quality using noise reduction through a Gaussian filter (GF), enhancing contrast with the help of contrast limited adaptive histogram equalisation (CLAHE), and data augmentation to prevent overfitting. It is followed by employing Inception v3 and DenseNet201 to extract and fuse potent features. A critical aspect of this strategy is using black widow optimisation to fine-tune the kernel extreme learning machine (KELM) model, attaining a notable RSI classification accuracy of 94.05%. When tested on UCM and AID datasets, the BWODLF-RSI approach demonstrated superior feature selection and RSI analysis performance.
暂无评论