The Fast Fourier Transform (FFT) is a fundamental algorithm in signal processing;significant efforts have been made to improve its performance using software optimizations and specialized hardware accelerators. Comput...
详细信息
ISBN:
(纸本)9781665440660
The Fast Fourier Transform (FFT) is a fundamental algorithm in signal processing;significant efforts have been made to improve its performance using software optimizations and specialized hardware accelerators. Computational imaging modalities, such as MRI, often rely on the Non-uniform Fast Fourier Transform (NuFFT), a variant ()I' the EFT for processing data acquired from non-uniform sampling patterns. The most time-consuming step of the NuFFT algorithm is "gridding," wherein non-uniform samples are interpolated to allow a uniform FFT to be computed over the data. Each non-uniform sample affects a window of non-contiguous memory locations, resulting in poor cache and memory bandwidth utilization. As a result, gridding can account for more than 99.6% of the NuFFT computation time, while the FFT requires less than 0.4%. We present Slice-and-Dice, a novel approach to the NuFFT's gridding step that eliminates the presorting operations required by prior methods and maps more efficiently to hardware. Our GPI! implementation achieves gridding speedups of over 250x and 16x vs prior ***-art CPU and GPU implementations, respectively. We achieve further speedup and energy efficiency gains by implementing Slice-and-Dice in hardware with JIGSAW, a streaming hardware accelerator for non-uniform data gridding. JIGSAW uses stall-free fixed-point pipelines to process M non-uniform samples in approximately M cycles, irrespective of sampling pattern-yielding speedups of over 1500x the CPU baseline and 36x the state-of-the-art GPU implementation, consuming similar to 200 mW power and similar to 12 mm(2) area in 16 nm technology. Slice-and-Dice GPU and JIGSAW ASIC implementations achieve unprecedented end-to-end NuFFT speedups of 8x and 36x compared to the state-of-the-art GPU implementation, respectively.
Deep convolutional neural networks have shown great potential in image recognition tasks. However, the fact that the mechanism of deep learning is difficult to explain hinders its development. It involves a large amou...
详细信息
Deep convolutional neural networks have shown great potential in image recognition tasks. However, the fact that the mechanism of deep learning is difficult to explain hinders its development. It involves a large amount of parameter learning, which results in high computational complexity. Moreover, deep convolutional neural networks are often limited by overfitting in regimes in which the number of training samples is limited. Conversely, kernel learning methods have a clear mathematical theory, fewer parameters, and can contend with small sample sizes;however, they are not able to handle high-dimensional data, e.g., images. It is important to achieve a performance and complexity trade-off in complicated tasks. In this paper, we propose a novel scalable deep convolutional random kernel learning in Gaussian process architecture called SDCRKL-GP, which is characterized by excellent performance and low complexity. First, we successfully incorporated the deep convolutional architecture into kernel learning by implementing the random Fourier feature transform for Gaussian processes, which can effectively capture hierarchical and local image-level features. This approach enabled the kernel method to effectively handle imageprocessing problems. Second, we optimized the parameters of deep convolutional filters and Gaussian kernels by stochastic variational inference. Then, we derived the lower variational bound of the marginal likelihood. Finally, we explored the model architecture design space selection method to determine the appropriate network architecture for different datasets. The design space consists of the number of layers, the channels per layer, and so on. Different design space selections improved the scalability of the SDCRKL-GP architecture. We evaluated SDCRKL-GP on the MNIST, FMNIST, CIFAR10, and CALTECH4 benchmark datasets. Taking MNIST as an example, the error rate of classification is 0.60%, and the number of parameters, number of computations and memo
image data is expanding rapidly, along with technology development, so efficient solutions must be considered to achieve high, real-time performance in the case of processing large image datasets. parallelprocessing ...
详细信息
image data is expanding rapidly, along with technology development, so efficient solutions must be considered to achieve high, real-time performance in the case of processing large image datasets. parallelprocessing is increasingly used as an attractive alternative to improve the performance, when using existing distributed architectures but also for sequential commodity computers. It can provide speedup, efficiency, reliability, incremental growth, and flexibility. We present such an alternative and stress the effectiveness of the methods to accelerate computations on a small cluster of PCs compared to a single CPU. Our paper is focused on applying edge detection on large image data sets, as a fundamental and challenging task in imageprocessing and computer vision. Five different techniques, mainly Sobel, Prewitt, LoG, Canny, and Roberts, are compared in a simple experimental setup that includes the OpenCV library functions for image pixels manipulation. Gaussian blur is used to reduce high-frequency components to manage the noise that edge detection is impacted by. Overall, this work is part of a more extensive investigation of image segmentation methods on large image datasets, but the results presented are relevant and show the effectiveness of our approach.
Artificial intelligence has shown great potential in a variety of applications, from natural language models to audio visual recognition, classification, and manipulation. AI Researchers have to work with massive amou...
详细信息
ISBN:
(纸本)9798400709036
Artificial intelligence has shown great potential in a variety of applications, from natural language models to audio visual recognition, classification, and manipulation. AI Researchers have to work with massive amount of collected data for use in machine learning, raising some challenges in effectively managing and utilizing the collected data in the training phase to develop and iterate on more accurate, and more generalized models. In this paper we conducted a review on parallel and distributed machine learning methods and challenges. We also propose a distributed and scalable deep learning model architecture which can span across multiple processing nodes. We tested the model on the MIT Indoor dataset, to evaluate the performance and scalability of the model using multiple hardware nodes, and showed the scaling characteristics of the different model using different model sizes. We find that distributed training is 80% faster using 2 GPUs than 1 GPU. We also find that the model keeps the benefits of distributed training such as speed and accuracy regardless of its size or training batch size.
We propose parallel implementation on GPU (graphics processing unit) system for some generic algorithms applied to superpixel image segmentation problem. The aim is to provide standard algorithms based on generic dece...
详细信息
We propose parallel implementation on GPU (graphics processing unit) system for some generic algorithms applied to superpixel image segmentation problem. The aim is to provide standard algorithms based on generic decentralized data structures that could be easily improved and customized on many optimization problems on parallel platforms. Note that superpixel segmentation are clustering algorithms applied to imageprocessing. Two types of algorithms are presented and implemented on GPU based on common parallel data structures. Firstly, we present a parallel implementation of the well-known k-means algorithm with application to 3D data. It is based on a cellular grid subdivision of space that allows closest point findings in constant optimal time for bounded distributions. Secondly, we present an application of the parallel Boruvka minimum spanning forest algorithm to compute watershed segmentation. Both techniques are fully executed on GPU and share the same data structures that embed disjoint-set-trees and distributed-link-lists. We evaluate our k-means approach with regards to state-of-the-art methods, that are, the well known SLIC algorithm, and the adaptive segmentation approach SPASM. We argue that our implementation has the shortest execution time among the tested methods, with a near real time performance and quasi linear acceleration factor, while it provides more regular shape superpixel segmentation based on hexagonal tessellation. Watershed minimum spanning forest method is presented and evaluated accordingly to the same experimental framework.
Generative Adversarial Networks (GAN) are approaches that are utilized for data augmentation, which facilitates the development of more accurate detection models for unusual or unbalanced datasets. Computer-assisted d...
详细信息
As Deep Neural Networks (DNNs) are evolving in complexity to meet the demands of novel applications, a single device becomes insufficient for training, leading to the emergence of distributed DNN training. However, th...
详细信息
ISBN:
(数字)9798350303582
ISBN:
(纸本)9798350303599
As Deep Neural Networks (DNNs) are evolving in complexity to meet the demands of novel applications, a single device becomes insufficient for training, leading to the emergence of distributed DNN training. However, this evolution exposes a gap in research surrounding security vulnerabilities on model poisoning attacks, especially in model parallel setups, an area that has been scarcely studied. To bridge this gap, we introduce Patronus, an approach that counters model poisoning attacks in distributed DNN training, accommodating both data and model parallelism. With the employment of Loss-aware Credit Evaluation, Patronus scores each participating client. Based on the continuously updated credit, malicious clients are isolated and detected after multiple epochs by Shuffling-based Isolation Mechanism. Additionally, the training system is reinforced by Byzantine Fault-tolerant Aggregation to minimize malicious client impacts. Comprehensive experiments confirm Patronus's superior reliable and efficient performance over the existing methods under attack scenarios.
Failure recovery is one of the most essential problems in Internet of Things (IoT) systems, and the conventional snapshot method is an effective way to solve this problem. However, snapshot methods lack specialized de...
详细信息
ISBN:
(数字)9798331509712
ISBN:
(纸本)9798331509729
Failure recovery is one of the most essential problems in Internet of Things (IoT) systems, and the conventional snapshot method is an effective way to solve this problem. However, snapshot methods lack specialized designs for heterogeneous IoT devices, and when implemented in edge devices, serious system interruptions occur and performance is impacted. To address these problems, a dynamic checkpointing strategy is proposed for IoT systems that consist of heterogeneous devices. Firstly, an anomaly detection network for snapshots (i.e., ADSnet) that combines long short-term memory networks with multilayer convolutional networks is used to learn the multidimensional features of system resource usage. Secondly, ADSnet is tuned during deployment to learn the behaviors of target devices, so that ADSnet can report the anomalies of target devices in the near future. Finally, a dynamic checkpointing strategy is proposed to dynamically create snapshots on the basis of the anomaly detection results. The experimental results show that the proposed ADSnet achieves 97.73% accuracy in detecting anomalies in the target device; furthermore, our proposed dynamic checkpointing strategy reduces 25.4% snapshots than that created by the recently proposed ResCheck.
NASA has committed to open-source science that enables Earth observation data transparency, inclusivity, accessibility, and reproducibility - all fundamental to the pace and quality of scientific progress. We have emb...
详细信息
ISBN:
(纸本)9798350320107
NASA has committed to open-source science that enables Earth observation data transparency, inclusivity, accessibility, and reproducibility - all fundamental to the pace and quality of scientific progress. We have embraced this vision by producing standard InSAR science products that are freely available to the public through NASA Data Active Archive Centers (DAACs) and are generated using state-of-the-art open-source and openly-developed methods. The Advanced Rapid image Analysis (ARIA) project's Sentinel-1 Geocoded Unwrapped Phase product (ARIA-S1-GUNW) is a 90 meter InSAR product that spans major, land-based fault systems, the US Coasts, and active volcanic regions through the complete Sentinel-1 record. The products enable the measurement of centimeter-scale surface displacement with applications across the solid earth, hydrology, and sea-level disciplines. The ARIA-S1-GUNW also enables rapid response mapping of surface motion after earthquakes, landslides, and subsidence. The ARIA-S1-GUNW products are freely available through the Alaska Satellite Facility (ASF) DAAC. In the last year, we have successfully grown the archive to over 1.1 million products, a 6 fold increase, through NASA ACCESS by improving our processing workflow and leveraging HyP3, an AWS-based cloud processing environment. We are continuing to partner with researchers to generate more products over relevant areas of scientific interest. All the processing software and cloud infrastructure are open-source to ensure reproducibility and enable other scientists to modify, improve upon, and scale their own cloud workflows for related InSAR analyses. We have, in parallel, developed and supported open-source, well-documented tools to further streamline time-series analysis from the ARIA-S1-GUNW into deformation analysis workflows.
This article presents a GPU-accelerated software design of the recently proposed model of Slanted Stixels, which represents the geometric and semantic information of a scene in a compact and accurate way. We reformula...
详细信息
This article presents a GPU-accelerated software design of the recently proposed model of Slanted Stixels, which represents the geometric and semantic information of a scene in a compact and accurate way. We reformulate the measurement depth model to reduce the computational complexity of the algorithm, relying on the confidence of the depth estimation and the identification of invalid values to handle outliers. The proposed massively parallel scheme and data layout for the irregular computation pattern that corresponds to a Dynamic Programming paradigm is described and carefully analyzed in performance terms. Performance is shown to scale gracefully on current generation embedded GPUs. We assess the proposed methods in terms of semantic and geometric accuracy as well as run-time performance on three publicly available benchmark datasets. Our approach achieves real-time performance with high accuracy for 2048 x 1024 image sizes and 4 x 4 Stixel resolution on the low-power embedded GPU of an NVIDIA Tegra Xavier.
暂无评论