In this paper we evaluate a new coalesced data and kernel scheme used to reduce the execution costs of cardiac simulations that run on multi-GPU environments. the new scheme was tested for an important part of the sim...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
In this paper we evaluate a new coalesced data and kernel scheme used to reduce the execution costs of cardiac simulations that run on multi-GPU environments. the new scheme was tested for an important part of the simulator, the solution of the systems of Ordinary Differential Equations (ODEs). the results have shown that the proposed scheme is very effective. the execution time to solve the systems of ODEs on the multi-GPU environment was reduced by half, when compared to a scheme that does not implemented the proposed data and kernel coalescing. As a result, the total execution time of cardiac simulations was 25% faster.
Given the advent of cyber-physical systems (CPS), eventbased control paradigms such as complex event processing (CEP) are vital enablers for adaptive analytical control mechanisms. CPS are becoming a high-profile rese...
详细信息
Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that ...
详细信息
ISBN:
(纸本)9781509028252
Computer scientists and programmers face the difficultly of improving the scalability of their applications while using conventional programming techniques only. As a base-line hypothesis of this paper we assume that an advanced runtime system can be used to take full advantage of the available parallel resources of a machine in order to achieve the highest parallelism possible. In this paper we present the capabilities of HPX - a distributed runtime system for parallel applications of any scale - to achieve the best possible scalability through asynchronous task execution [1]. OP2 is an active library which provides a framework for the parallel execution for unstructured grid applications on different multi-core/many-core hardware architectures [2]. OP2 generates code which uses OpenMP for loop parallelization within an application code for both single-threaded and multi-threaded machines. In this work we modify the OP2 code generator to target HPX instead of OpenMP, i.e. port the parallel simulation backend of OP2 to utilize HPX. We compare the performance results of the different parallelization methods using HPX and OpenMP for loop parallelization within the Airfoil application. the results of strong scaling and weak scaling tests for the Airfoil application on one node with up to 32 threads are presented. Using HPX for parallelization of OP2 gives an improvement in performance by 5%-21%. By modifying the OP2 code generator to use HPX's parallelalgorithms, we observe scaling improvements by about 5% as compared to OpenMP. To fully exploit the potential of HPX, we adapted the OP2 API to expose a future and dataflow based programming model and applied this technique for parallelizing the same Airfoil application. We show that the dataflow oriented programming model, which automatically creates an execution tree representing the algorithmic data dependencies of our application, improves the overall scaling results by about 21% compared to OpenMP. Our results show
Solution of the finding a minimum spanning tree problem is common in various areas of research: recognition of different objects, computer vision, analysis and construction of networks (eg, telephone, electrical, comp...
详细信息
Solution of the finding a minimum spanning tree problem is common in various areas of research: recognition of different objects, computer vision, analysis and construction of networks (eg, telephone, electrical, computer, travel, etc.), chemistry and biology, and many others. there are at least three well-known algorithms for solving this problem: Boruvka, Kruskal and Prim. processing large graphs is a quite time-consuming task for the central processor (CPU), and in high demand at the present moment. the usage of Graphics processing units (GPUs) as a mean to solve general-purpose problems grows every day, because GPUs have more computing power than CPUs. But the minimum spanning tree (MST) computation on a general graph is an irregular algorithm. So it suits poorly the GPU architecture. this article examins a hybrid implementation of this algorithm on GPU and CPU.
the Particle-In-Cell (PIC) method is effectively used in many scientific simulation codes. In order to optimize the performance of the PIC approach, data locality is required. this relies on efficient sorting algorith...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
the Particle-In-Cell (PIC) method is effectively used in many scientific simulation codes. In order to optimize the performance of the PIC approach, data locality is required. this relies on efficient sorting algorithms. We present a bucket sort algorithm with small memory footprint for the PIC method targeting Graphics processing Units (GPUs). Our sorting algorithm shows an increased performance withthe amount of storage provided and withthe orderliness of the particles. For our application where particles are presorted it performs better and requires less memory than other sorting algorithms in the literature. the overall PIC algorithm performs at its best if the sorting is applied.
the increasing use of mobile social networks has lately transformed news media. Real-world events are nowadays reported in social networks much faster than in traditional channels. As a result, the autonomous detectio...
详细信息
ISBN:
(纸本)9783319495835;9783319495828
the increasing use of mobile social networks has lately transformed news media. Real-world events are nowadays reported in social networks much faster than in traditional channels. As a result, the autonomous detection of events from networks like Twitter has gained lot of interest in both research and media groups. DBSCAN-like algorithms constitute a well-known clustering approach to retrospective event detection. However, scaling such algorithms to geographically large regions and temporarily long periods present two major shortcomings. First, detecting real-world events from the vast amount of tweets cannot be performed anymore in a single machine. Second, the tweeting activity varies a lot within these broad space-time regions limiting the use of global parameters. Against this background, we propose to scale DBSCAN-like event detection techniques by parallelizing and distributing them through a novel density-aware MapReduce scheme. the proposed scheme partitions tweet data as per its spatial and temporal features and tailors local DBSCAN parameters to local tweet densities. We implement the scheme in Apache Spark and evaluate its performance in a dataset composed of geo-located tweets in the Iberian peninsula during the course of several football matches. the results pointed out to the benefits of our proposal against other state-of-the-art techniques in terms of speed-up and detection accuracy.
We present FERARI, a prototype for processing voluminous event streams over multi-cloud platforms. At its core, FERARI both exploits the potential for in-situ (intra-cloud) processing and orchestrates inter-cloud comp...
详细信息
We present a multi-threaded solver for symmetric positive definite linear systems where the coefficient matrix of the problem features a bordered-band non-zero pattern. the algorithmsthat implement this approach heav...
详细信息
ISBN:
(纸本)9783319321493;9783319321486
We present a multi-threaded solver for symmetric positive definite linear systems where the coefficient matrix of the problem features a bordered-band non-zero pattern. the algorithmsthat implement this approach heavily rely on a compact storage format, tailored for this type of matrices, that reduces the memory requirements, produces a regular data access pattern, and allows to cast the bulk of the computations in terms of efficient kernels from the Level-3 and Level-2 BLAS. the efficiency of our approach is illustrated by numerical experiments.
Complex image processingalgorithmsthat require higher computational power with large scale inputs can be processed efficiently using the parallel and distributed processing of Hadoop MapReduce Framework. Hadoop MapR...
详细信息
ISBN:
(纸本)9781467385664
Complex image processingalgorithmsthat require higher computational power with large scale inputs can be processed efficiently using the parallel and distributed processing of Hadoop MapReduce Framework. Hadoop MapReduce is a scalable model which is capable of processing petabytes (10(15) order) of data with improved fault tolerance and data parallelism. In this paper we present a MapReduce framework for performing parallel remote sensing satellite data processing using Hadoop and storing the output in HBase. the speedup and performance show that by utilizing Hadoop, we can distribute our workload across different clusters to take advantage of combined processing power on commodity hardware.
Coherent stacking is a key procedure for a class of algorithmsthat are used to process seismic data. the paper presents an efficient implementation of coherent stacking algorithm on CUDA-based GPUs. We discuss a comp...
详细信息
Coherent stacking is a key procedure for a class of algorithmsthat are used to process seismic data. the paper presents an efficient implementation of coherent stacking algorithm on CUDA-based GPUs. We discuss a complex of optimizations that allowed the implementation to reach 70% of peak hardware performance. Tests reveal linear dependency between computing time and problem size. Terabytes of seismic data can not be placed into the memory of GPU card at once and thus the processing must be organized in portions. Optimal portion sizes where found for the following generations of Nvidia GPUs: Fermi, Kepler, Maxwell.
暂无评论