In recent years, FPGA platforms have shown significant potential for accelerating artificial intelligence (AI) applications, particularly in Embedded AI. While various studies have explored adaptive AI deployment on F...
详细信息
In recent years, FPGA platforms have shown significant potential for accelerating artificial intelligence (AI) applications, particularly in Embedded AI. While various studies have explored adaptive AI deployment on FPGAs, there remains a gap in methodologies fully integrating software adaptability with FPGA hardware reconfigurability. This article presents a novel end-to-end co-design methodology for deploying adaptable and scalable Convolutional Neural Networks (CNNs) on FPGA platforms. The framework enhances computational performance and reduces latency by dynamically modifying hardware acceleration units by combining CNN architecture adaptability with dynamic partial reconfiguration of FPGA hardware. The proposed methodology enables automated synthesis and runtime customization of both hardware accelerators and CNN architectures, eliminating the need for iterative synthesis. This approach has been implemented and tested on a Xilinx XC7020 FPGA board for a CNN-based image classifier, achieving superior computation performance (0.68s/image) and accuracy (97%) compared to state-of-the-art alternatives.
Compared with distributed graph computation, traditionally single node computation is unfitted in processing large scale graph data. The GAS (Gather, Apply and Scatter) Model is a universal vertex-cut graph computatio...
详细信息
ISBN:
(纸本)9781538632932
Compared with distributed graph computation, traditionally single node computation is unfitted in processing large scale graph data. The GAS (Gather, Apply and Scatter) Model is a universal vertex-cut graph computation programming model based on edge-centric programs to support graph algorithms, which process distributed graph computation after graph partition. In this paper, we introduce that three minor-steps of GAS. We then analyze more complete process of GAS considering intra-node computation and inter node communication of distributed graph computation. Based on our analysis, we evaluate the performance in different nodes of graph analysis algorithm applying GAS model. The evaluation shows that the bottleneck is computation performance or communication bandwidth depending on number of nodes, which is an inspiration of optimizing the GAS model.
Thread mapping is one of the techniques which allow for efficient exploiting of the potential of modern multicore architectures. The aim of this paper is to study the impact of thread mapping on the computing performa...
详细信息
Thread mapping is one of the techniques which allow for efficient exploiting of the potential of modern multicore architectures. The aim of this paper is to study the impact of thread mapping on the computing performance, the scalability, and the energy consumption for parallel dense linear algebra kernels on hierarchical shared memory multicore systems. We consider the basic application, namely a matrix-matrix product (GEMM), and two parallel matrix decompositions (LU and WZ). Both factorizations exploit parallel BLAS (basic linear algebra subprograms) operations, among others GEMM. We compare differences between various thread mapping strategies for these applications. Our results show that the choice of thread mapping has the measurable impact on the performance, the scalability, and energy consumption of the GEMM and two matrix factorizations.
GIS spatial raster analysis has become a powerful tool for geographical phenomena. Unfortunately the computation-intensive raster operations are likely to create computer performance bottlenecks when running on the CP...
详细信息
ISBN:
(纸本)9780819469144
GIS spatial raster analysis has become a powerful tool for geographical phenomena. Unfortunately the computation-intensive raster operations are likely to create computer performance bottlenecks when running on the CPUs. Over the last few years, GPU performance has improved much more than CPU performance. For this reason, many researches have applied the GPUs for scientific, geometric and database computations beyond graphics. This paper demonstrates a general framework for the GPU-based implementation of GIS raster operations, and conducts experiments to compare the computation performance between GPU-based and CPU-based algorithms. The test results indicate that using GPU on spatial raster operations can significantly improve their computation performance. This means that realizing GIS spatial analysis on the GPU create new opportunities by drastically lowering the cost of raster operations on the same hardware performance.
In this study, a fast and accurate method to predict the radar cross-section (RCS) of large-scale and complicated shape targets is proposed based on a high-performance parallel finite difference time-domain (FDTD) num...
详细信息
In this study, a fast and accurate method to predict the radar cross-section (RCS) of large-scale and complicated shape targets is proposed based on a high-performance parallel finite difference time-domain (FDTD) numerical method. To this end, several most popular parallel computation methods [including OpenMP, graphics processing unit (GPU), and message-passing interface (MPI)] are discussed first. Based on this discussion, a novel MPI-OpenMP-GPU hybrid parallel computation scheme for FDTD is developed. Moreover, the corresponding load-balance parallel configuration is discussed as well. Since this hybrid parallel scheme combines the merits of existing parallel technologies, the computation performance is remarkably improved. The results show that the computation time of the RCS simulation of a large-scale target can be reduced from 3 days to 0.8 h, that is, similar to 98.9% time saving.
暂无评论