In this paper we describe an optimized implementation of a Lattice Boltzmann (LB) code on the BlueGene/Q system, the latest generation massively parallel system of the BlueGene family. We consider a state-of-art LB co...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
In this paper we describe an optimized implementation of a Lattice Boltzmann (LB) code on the BlueGene/Q system, the latest generation massively parallel system of the BlueGene family. We consider a state-of-art LB code, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. the regular structure of LB algorithms offers several levels of algorithmic parallelism that can be matched by a massively parallel computer architecture. However the complex memory access patterns associated to our LB model make it not trivial to efficiently exploit all available parallelism. We describe our implementation strategies, based on previous experience made on clusters of many-core processors and GPUs, present results and analyze and compare performances.
We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, w...
详细信息
ISBN:
(纸本)9781450365109
We consider the loop less k-shortest path (KSP) problem. Although this problem has been studied in the sequential setting for at least the last two decades, no good parallel implementations are known. In this paper, we provide (i) a first systematic empirical comparison of various KSP algorithms and heuristic optimisations, (ii) carefully engineer various parallel implementations of these sequential algorithms and (iii) perform an extensive study of these parallel implementations on a range of graph classes and multicore architectures to determine the best algorithm and parallelization strategy for different graph classes. We find that even though the worst-case complexity of the best undirected KSP algorithm O (k (m + n log n)) is significantly better than that of the popular and considerably simpler directed KSP algorithm O(kn(m + n log n)), the two algorithms are fairly competitive in terms of their empirical performance on small diameter graphs. Furthermore, we show that a few simple optimisations help to bridge the gap between these KSP algorithms even more. However, on moderate to large diameter graphs, the undirected KSP algorithm is considerably faster than the directed algorithms, both in sequential and parallel settings. In terms of the parallelisation strategy, simply replacing the shortest path subroutine by parallel A-stepping algorithm can provide a good speed-up for many KSP algorithms on random graphs. In contrast, for graphs with skewed degree distribution, a more complex strategy of parallelizing the different deviations and then parallelizing the shortest path computation inside the deviations withthe remaining threads, provides a better performance.
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known mod...
详细信息
ISBN:
(纸本)9783030050573;9783030050566
the Gurevich's thesis stipulates that sequential abstract state machines (ASMS) capture the essence of sequential algorithms. On another hand, the bulk-synchronous parallel (BSP) bridging model is a well known model for HPC algorithm design. It provides a conceptual bridge between the physical implementation of the machine and the abstraction available to a programmer of that machine. the assumptions of the BSP model are thus provide portable and scalable performance predictions on most HPC systems. We follow Gurevich's thesis and extend the sequential postulates in order to intuitively and realistically capture BSP algorithms.
Data mining algorithms are expensive by nature, but when dealing with today's dataset sizes, they are becoming even more slow and hard to use. Previous work has focused on parallelizing data mining algorithms on d...
详细信息
the parallelization of numerical simulation algorithms, i.e., their adaptation to parallelprocessingarchitectures, is an aim to reach in order to hinder exorbitant execution times. the parallelism has been imposed a...
详细信息
the parallelization of numerical simulation algorithms, i.e., their adaptation to parallelprocessingarchitectures, is an aim to reach in order to hinder exorbitant execution times. the parallelism has been imposed at the level of processor architectures and graphics cards are now used for general-purpose calculation, also known as "General-Purpose computation on Graphics processing Unit (GPGPU)". the clear benefit is the excellent performance over price ratio. Besides hiding the low level programming, software engineering leads to a faster and more secure application development. this paper presents the real interest of using GPU processors to increase performance of larger problems which concern electrical machines simulation. Indeed, we show that our auto-generated code applied to several models allows achieving speedups of the order of 10 x.
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a f...
详细信息
ISBN:
(纸本)9781479975051
Sorting is one of the classic problems of data processing and many practical applications require implementation of parallel sorting algorithms. Only a few algorithms have been implemented using MPI, in this paper a few additional parallel sorting algorithms have been implemented using MPI. A unified performance analysis of all these algorithms has been presented using two different architectures. On basis of experimental results obtained some guidelines has been suggested for the selection of proper algorithms.
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) ar...
详细信息
ISBN:
(纸本)0818656026
Barrier algorithms are central to the performance of numerous algorithms on scalable, high-performance architectures. Numerous barrier algorithms have been suggested and studied for Non-Uniform Memory Access (NUMA) architectures, but less work has been done for Cache Only Memory Access (COMA) or attraction memory [1] architectures such as the KSR-1. In this paper, we presented two new barrier algorithmsthat offer the best performance we have recorded on the KSR-1 distributed cache multiprocessor. We discuss the trade-offs and the performance of seven algorithms on two architectures. the new barrier algorithms adapt well to a hierarchical caching memory model and take advantage of parallel communication offered by most multiprocessor interconnection networks,. Performance results are shown for a 256-processor KSR-1 and a 20-processor Sequent Symmetry.
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. ...
详细信息
ISBN:
(纸本)9783642246494
Memory-CPU single communication channel bottleneck of the von Neumann architecture is quickly stalling the growth of computer processors. A probable solution to this problem is to fuse processing and memory elements. A simple low latency single on-chip memory and processor cannot solve the problem as the fundamental channel bottleneck will still be there due to the logical splitting of processor and memory. this paper presents that a paradigm shift is possible by combining Arithmetic logic unit and Random Access Memory (ARAM) elements at bit level. this bit level modest ARAM is used to perform word level ALU instructions with minor modifications. this makes the ARAM cells capable of executing instructions in parallel. It is also asynchronous and hence reduces power consumption significantly. A CMOS implementation is presented that verifies the practicality of the proposed ARAM.
Matrix decomposition applications that involve large matrix operations can take advantage of the flexibility and adaptability of reconfigurable computing systems to improve performance. the benefits come from replicat...
详细信息
ISBN:
(纸本)9781424416936
Matrix decomposition applications that involve large matrix operations can take advantage of the flexibility and adaptability of reconfigurable computing systems to improve performance. the benefits come from replication, which includes vertical replication and horizontal replication. If viewed on a space-time chart, vertical replication allows multiple computations executed in parallel, and horizontal replication renders multiple functions on the same piece of hardware. In this paper, the reconfigurable architecture that supports replications for matrix decomposition applications on reconfigurable computing systems is described, and issues including the comparison of algorithms on the system and data movement between the internal computation cores and the external memory subsystem are addressed. A prototype of such a system is implemented to prove the concept. It is expected to improve the performance and scalability of matrix decomposition involving large matrices.
this paper reports on methods for the parallelization of artificial neural networks algorithms using multithreaded and multicore CPUs in order to speed up the training process. the developed algorithms were implemente...
详细信息
ISBN:
(纸本)9783642202810
this paper reports on methods for the parallelization of artificial neural networks algorithms using multithreaded and multicore CPUs in order to speed up the training process. the developed algorithms were implemented in two common parallel programming paradigms and their performances are assessed using four datasets with diverse amounts of patterns and with different neural network architectures. All results show a significant increase in computation speed. which is reduced nearly linear withthe number of cores for problems with very large training datasets.
暂无评论