The article explores character recognition using convolutional neural networks (CNNs) optimized with the CUDA platform to enhance computational efficiency. It outlines the CNN architecture, methods for leveraging GPU-...
详细信息
ISBN:
(数字)9798331531836
ISBN:
(纸本)9798331531843
The article explores character recognition using convolutional neural networks (CNNs) optimized with the CUDA platform to enhance computational efficiency. It outlines the CNN architecture, methods for leveraging GPU-based parallel data processing, and presents experimental results derived from the MNIST dataset. The study highlights that implementing CUDA drastically reduces processing time while maintaining a high level of predictive accuracy. The findings emphasize the potential of GPU acceleration in handling intensive computational tasks, making it a promising approach for real-time applications in image recognition and machine learning.
Remote sensing image segmentation is a specialized form of semantic segmentation that presents unique challenges not typically found in general semantic segmentation tasks. The key issues addressed in this study are t...
详细信息
Driven in part of the rapid growth of consortium blockchain applications, blockchain interoperability becomes extremely essential to exchange transactional data among decentralized applications. To ensure the data int...
详细信息
ISBN:
(纸本)9783030967727;9783030967710
Driven in part of the rapid growth of consortium blockchain applications, blockchain interoperability becomes extremely essential to exchange transactional data among decentralized applications. To ensure the data integrity of transactions, the state-of-the-art studies of the blockchain interoperability apply data locks, which however severely decrease system efficiency. To boost interoperability performance, this paper proposes a novel approach based on multi-version concurrency control to parallelize interoperable transactions, which aims high transaction processing throughput while ensuring data integrity. The experimental evaluation with the Smallbank benchmark shows that the proposed method achieves up to 4x performance increase (in terms of processed transactions per second, TPS) compared with the existing methods, and moreover, it decreases the average latency with 58%.
High inference times of machine learning-based axon tracing algorithms pose a significant challenge to the practical analysis and interpretation of large-scale brain imagery. This paper explores a distributed data pip...
详细信息
ISBN:
(数字)9781665497862
ISBN:
(纸本)9781665497862
High inference times of machine learning-based axon tracing algorithms pose a significant challenge to the practical analysis and interpretation of large-scale brain imagery. This paper explores a distributed data pipeline that employs a SLURM-based job array to run multiple machine learning algorithm predictions simultaneously. image volumes were split into N (1-16) equal chunks that are each handled by a unique compute node and stitched back together into a single 3D prediction. Preliminary results comparing the inference speed of 1 versus 16 node job arrays demonstrated a 90.95% decrease in compute time for 32 GB input volume and 88.41% for 4 GB input volume. The general pipeline may serve as a baseline for future improved implementations on larger input volumes which can be tuned to various application domains.
image semantic segmentation is an important research direction in imageprocessing, computer vision and deep learning. Semantic segmentation is to classify the image pixel by pixel, so that the original image is divid...
详细信息
Brain tumours are among the most life-threatening diseases, and automatic segmentation of brain tumours from medical images is crucial for clinicians to identify and quantify tumour regions with high precision. While ...
详细信息
ISBN:
(数字)9798350352894
ISBN:
(纸本)9798350352900
Brain tumours are among the most life-threatening diseases, and automatic segmentation of brain tumours from medical images is crucial for clinicians to identify and quantify tumour regions with high precision. While traditional segmentation models have laid the groundwork, diffusion models have since been developed to better manage complex medical data. However, diffusion models often face challenges related to insufficient parallel computing power and inefficient GPU utilization. To address these issues, we propose the DF-SegDiff model, which includes diffusion segmentation, parallel data processing, a distributed training model, a dynamic balancing parameter and model fusion. This approach significantly reduces training time while achieving an average Dice score of 0.87, with several samples reaching Dice values close to 0.94. By combining BRATS2020 with the Medical Segmentation Decathlon dataset, we also integrated a comprehensive dataset containing 800 training samples and 53 test samples. Evaluation of the model using Dice, IoU, and other relevant metrics demonstrates that our method outperforms current state-of-the-art techniques.
During the past 10 years, there has been a surging interest in developing distributed graph processing systems. This tutorial provides a comprehensive review of existing distributed graph processing systems. We firstl...
详细信息
ISBN:
(纸本)9789811604782;9789811604799
During the past 10 years, there has been a surging interest in developing distributed graph processing systems. This tutorial provides a comprehensive review of existing distributed graph processing systems. We firstly review the programming models for distributed graph processing and then summarize the common optimization techniques for improving graph execution performance, including graph partitioning methods, communication mechanisms, parallelprocessing models, hardware-specific optimizations, and incremental graph processing. We also present an emerging hot topic, distributed Graph Neural Networks (GNN) frameworks, and review recent progress on this topic.
In the age of big data, the volume of RDF data has been exploding due to the growing demands for open data, including Linked Open Data (LOD), semantic data processing, and knowledge graphs. Large-scale RDF data may co...
详细信息
Modern space applications require high computing power and high reliability from on-board processors. To meet these requirements, the German Aerospace Center (DLR) is developing a Scalable On-board Computer for Space ...
Modern space applications require high computing power and high reliability from on-board processors. To meet these requirements, the German Aerospace Center (DLR) is developing a Scalable On-board Computer for Space Avionics (ScOSA) system with a distributed non-shard memory architecture. As performance is an important criterion in the selection of hardware for space missions, the European Space Agency has published an open source benchmark suite called OBPMark. It is a set of benchmarks based on typical space applications and designed to measure system-level performance. However, there is currently no standard tool for evaluating the performance of distributed on-board computers. In this paper, we propose a parallelization strategy for running the OBPMark imageprocessing benchmark on a distributed on-board computer. We used a split-map-reduce model to integrate the #1.1 image calibration and correction benchmark of OBPMark into the ScOSA system. We evaluated the developed distributed benchmark on the existing ScOSA High Performance Nodes (HPNs) consisting of 5 Xilinx Zynq 7020 SoCs. The results show a significant reduction of the benchmark execution time from 9.0 to 2.8 seconds using 5 nodes. In the case of dual-core with 4 nodes, the execution time was reduced to 2.5 seconds. We conclude that OBPMark is a valuable tool for evaluating the performance of distributed on-board computers with non-shared memory architectures and contributes to the standardisation of performance evaluation in the space domain.
Graphics processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerate...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Graphics processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the overhead from launching several fine-grained kernels. CUDA Graph addresses these performance challenges by enabling a graph-based execution model that captures operations as nodes and dependence as edges in a static graph. Thereby consolidating several kernel launches into one graph launch. We propose a performance optimization strategy for iteratively launched kernels. By grouping kernel launches into iteration batches and then unrolling these batches into a CUDA Graph, iterative applications can benefit from CUDA Graph for performance boosting. We analyze the performance gain and overhead from this approach by designing a skeleton application. The skeleton application also serves as a generalized example of converting an iterative solver to CUDA Graph, and for deriving a performance model. Using the skeleton application, we show that when unrolling iteration batches for a given platform, there is an optimal size of the iteration batch, which is independent of workload, balancing the extra overhead from graph creation with the performance gain of the graph execution. Depending on workload, we show that the optimal iteration batch size gives more than $1.4 \times$ speed-up in the skeleton application. Furthermore, we show that similar speed-up can be gained in Hotspot and Hotspot3D from the Rodinia benchmark suite and a Finite-Difference Time-Domain (FDTD) Maxwell solver.
暂无评论