Erasure codes such as Reed-Solomon codes can improve the availability of distributed storage in comparison with replication systems. In previous studies we investigated implementation of these codes on multi/many-core...
详细信息
ISBN:
(纸本)9783642552243
Erasure codes such as Reed-Solomon codes can improve the availability of distributed storage in comparison with replication systems. In previous studies we investigated implementation of these codes on multi/many-core architectures, such as Cell/B. E. and GPUs. In particular, it was shown that bandwidth of PCIe bus is a bottleneck for the implementation on GPUs. In this paper, we focus on investigation how to map systematically the Reed-Solomon erasure codes onto the AMD Accelerated processing Unit (APU), a new heterogeneous multi/many-core architecture. this architecture combines CPU and GPU in a single chip, eliminating costly transfers between them through the PCI bus. Moreover, APU processors combine some features of Cell/B. E. processors and many-core GPUs, allowing for both vectorization and SIMT processing simultaneously. Based on the previous works, the method for the systematic mapping of computation kernels of Reed-Solomon and Cauchy Reed-Solomon algorithms onto the AMD APU architecture is proposed. this method takes into account properties of the architecture on all the levels of its parallelprocessing hierarchy.
parallel disk systems are capable of fulfilling rapidly increasing demands on both large storage capacity and high I/O performance. However, it is challenging to significantly increase disk I/O bandwidth for data-inte...
详细信息
parallel disk systems are capable of fulfilling rapidly increasing demands on both large storage capacity and high I/O performance. However, it is challenging to significantly increase disk I/O bandwidth for data-intensive workloads due to (1) reliability and instant processing of data requests under dynamic workload conditions, and (2) the optimum tradeoff between system scalability and data reliability in data-intensive systems. To increase computing performance and reduce power consumption, Graphics processing Units (GPUs) will be used. As the architectures and data processingalgorithms for GPU-based parallel disk systems are still in their infancy, this research will develop novel hardware and software architecturesthat include parallel GPU, flash disks, and disk arrays for data-intensive applications. (c) 2014 Published by Elsevier B.V.
In this paper we describe an optimized implementation of a Lattice Boltzmann (LB) code on the BlueGene/Q system, the latest generation massively parallel system of the BlueGene family. We consider a state-of-art LB co...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
In this paper we describe an optimized implementation of a Lattice Boltzmann (LB) code on the BlueGene/Q system, the latest generation massively parallel system of the BlueGene family. We consider a state-of-art LB code, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. the regular structure of LB algorithms offers several levels of algorithmic parallelism that can be matched by a massively parallel computer architecture. However the complex memory access patterns associated to our LB model make it not trivial to efficiently exploit all available parallelism. We describe our implementation strategies, based on previous experience made on clusters of many-core processors and GPUs, present results and analyze and compare performances.
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to matc...
详细信息
ISBN:
(纸本)9780735412125
the computational algorithms for device synthesis and nondestructive evaluation (NDE) are often the same. In both we have a goal a particular field configuration yielding the design performance in synthesis or to match exterior measurements in NDE. the geometry of the design or the postulated interior defect is then computed. Several optimization methods are available for this. the most efficient like conjugate gradients are very complex to program for the required derivative information the least efficient zeroth order algorithms like the genetic algorithm take much computational time but little programming effort. this paper reports launching a Genetic Algorithm kernel on thousands of compute unified device architecture (CUDA) threads exploiting the NVIDIA graphics processing unit (GPU) architecture. the efficiency of parallelization, although below that on shared memory supercomputer architectures, is quite effective in cutting down solution time into the realm of the practicable. We carry this further into multi-physics electro-heat problems where the parameters of description are in the electrical problem and the object function in the thermal problem. Indeed, this is where the derivative of the object function in the heat problem with respect to the parameters in the electrical problem is the most difficult to compute for gradient methods, and where the genetic algorithm is most easily implemented.
the proceedings contain 76 papers. the topics discussed include: clustering and change detection in multiple streaming time series;lightweight identification of captured memory for software transactional memory;layer-...
ISBN:
(纸本)9783319038889
the proceedings contain 76 papers. the topics discussed include: clustering and change detection in multiple streaming time series;lightweight identification of captured memory for software transactional memory;layer-based scheduling of parallel tasks for heterogeneous cluster platforms;optimistic concurrency control for energy efficiency in the wireless environment;synchronization-reducing variants of the biconjugate gradient and the quasi-minimal residual methods;exploring irregular reduction support in transactional memory;coordinate task and memory management for improving power efficiency;hardware-assisted intrusion detection by preserving reference information integrity;towards automatic generation of hardware classifiers;a practical approach for finding small independent, distance dominating sets in large-scale graphs;and heterogeneous computing vs. big data: the case of cryptanalytical applications.
the proceedings contain 76 papers. the topics discussed include: clustering and change detection in multiple streaming time series;lightweight identification of captured memory for software transactional memory;layer-...
ISBN:
(纸本)9783319038582
the proceedings contain 76 papers. the topics discussed include: clustering and change detection in multiple streaming time series;lightweight identification of captured memory for software transactional memory;layer-based scheduling of parallel tasks for heterogeneous cluster platforms;optimistic concurrency control for energy efficiency in the wireless environment;synchronization-reducing variants of the biconjugate gradient and the quasi-minimal residual methods;exploring irregular reduction support in transactional memory;coordinate task and memory management for improving power efficiency;hardware-assisted intrusion detection by preserving reference information integrity;towards automatic generation of hardware classifiers;a practical approach for finding small independent, distance dominating sets in large-scale graphs;and heterogeneous computing vs. big data: the case of cryptanalytical applications.
Optimising the execution of Bag-of-Tasks (BoT) applications on the cloud is a hard problem due to the trade-offs between performance and monetary cost. the problem can be further complicated when multiple BoT applicat...
详细信息
Optimising the execution of Bag-of-Tasks (BoT) applications on the cloud is a hard problem due to the trade-offs between performance and monetary cost. the problem can be further complicated when multiple BoT applications need to be executed. In this paper, we propose and implement a heuristic algorithm that schedules tasks of multiple applications onto different cloud virtual machines in order to maximise performance while satisfying a given budget constraint. Current approaches are limited in task scheduling since they place a limit on the number of cloud resources that can be employed by the applications. However, in the proposed algorithm there are no such limits, and in comparison with other approaches, the algorithm on average achieves an improved performance of 10%. the experimental results also highlight that the algorithm yields consistent performance even with low budget constraints which cannot be achieved by competing approaches.
this paper presents the development of a Coloured Petri Net model for a concurrent application running on a heterogeneous multi/manycore node. the used software runtime (StarPu) allows the expression of the applicatio...
详细信息
ISBN:
(纸本)9781479965694
this paper presents the development of a Coloured Petri Net model for a concurrent application running on a heterogeneous multi/manycore node. the used software runtime (StarPu) allows the expression of the application as a DAG (Directed Acyclic Graph) of tasks and the partition of the heterogeneous hardware in worker units. the CPN modelling allows the rapid evaluation of the suitability of the implemented scheduling algorithms for a given problem and supports the process of new algorithms design and implementation. the scheduler models were validated through runs on the real architecture.
In this article we consider parallel numerical algorithms to solve the 3D mathematical model, that describes a wave propagation in rectangular waveguide. the main goal is to formulate and analyze a minimal algorithmic...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
In this article we consider parallel numerical algorithms to solve the 3D mathematical model, that describes a wave propagation in rectangular waveguide. the main goal is to formulate and analyze a minimal algorithmic template to solve this problem by using the CUDA platform. this template is based on explicit finite difference schemes obtained after approximation of systems of differential equations on the staggered grid. the parallelization of the discrete algorithm is based on the domain decomposition method. the theoretical complexity model is derived and the scalability of the parallel algorithm is investigated. Results of numerical simulations are presented.
Selection algorithms find the kth smallest element from a set of elements. Although there are optimal parallel selection algorithms available for theoretical machines, these algorithms are not only difficult to implem...
详细信息
ISBN:
(纸本)9783642552243
Selection algorithms find the kth smallest element from a set of elements. Although there are optimal parallel selection algorithms available for theoretical machines, these algorithms are not only difficult to implement but also inefficient in practice. Consequently, scalable applications can only use few special cases such as minimum and maximum, where efficient implementations exist. To overcome such limitations, we propose a general parallel selection algorithm that scales even on today's largest supercomputers. Our approach is based on an efficient, unbiased median approximation method, recently introduced as median-of-3 reduction, and Hoare's sequential QuickSelect idea from 1961. the resulting algorithm scales with a time complexity of O(log(2) n) for n distributed elements while needing only O(1) space. Furthermore, we prove it to be a practical solution by explaining implementation details and showing performance results for up to 458, 752 processor cores.
暂无评论