MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the process...
详细信息
ISBN:
(纸本)9781450326711
MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the processing of multi-core and cloud computing applications based on the MapReduce programming framework. The proposed MapReduce configurable accelerator is augmented to multi-core processors and it performs a fast indexing and accumulation of the key/value pairs based on an efficient memory architecture using Cuckoo hashing. The MapReduce accelerator consists of the memory buffers that store the key/value pairs, and the processing units that are used to accumulate the key's value sent from the processors. In essence, this accelerator is used to alleviate the processors from executing the Reduce tasks, and thus executing only the Map tasks and emitting the intermediate key/value pairs to the hardware acceleration unit that performs the Reduce operation. The number and the size of the keys that can be stored on the accelerator are configurable and can be configured based on the application requirements. The MapReduce accelerator has been implemented and mapped to a multi-core FPGA with embedded ARM processors (Xilinx Zynq FPGA) and has been integrated with the MapReduce programming framework under Linux. The performance evaluation shows that the proposed accelerator can achieve up to 1.8x system speedup of the MapReduce applications and hence reduce significantly the execution time of multi-core and cloud computing applications. (Action: "Supporting Postdoctoral Researchers", "Education and Lifelong Learning" Program (GSRT) and co-financed by the ESF and the Greek State.)
With the development of computer technology, multi-core programming is now becoming hot issues. Based on directed acyclic graph, this paper gives definition of a number of executable operations and establishes a paral...
详细信息
ISBN:
(纸本)9783037856529
With the development of computer technology, multi-core programming is now becoming hot issues. Based on directed acyclic graph, this paper gives definition of a number of executable operations and establishes a parallel programming pattern. Using verticies to represent tasks and edges to represent communication between vertex, this parallel programming pattern let the programmers easily to identify the available concurrency and expose it for use in the algorithm design. The proposed pattern can be used for large-scale static data batch processing in multi-core environments and can bring lots of convenience when deal with complex issues.
Markov random field models provide a robust formulation of low-level vision problems. Among all these problems, stereo vision remains the most investigated field. The belief propagation (BP) method provides accurate r...
详细信息
Markov random field models provide a robust formulation of low-level vision problems. Among all these problems, stereo vision remains the most investigated field. The belief propagation (BP) method provides accurate result in stereo vision problems. However, the algorithm remains slow for practical use. This paper describes a case study on the parallelization of belief propagation for stereo matching using the "multi-core Software APIs" (MSA) on embedded MPSoC environments. MSA is a library-based middleware providing an asynchronous remote procedure call (RPC) mechanism. It supplies a function-offloading programming model to hide the underlying interprocessor communication and configuration detail from programmers. Furthermore, MSA provides a set of stream-specific APIs for supporting a streaming-function remoting mechanism on heterogeneous multi-core architectures. Our experiments shows that the BP method for stereo matching can be adapted from a single core program to a multi-core one for embedded MPSoC environments rapidly.
Porous media simulation is an important contribution in the study of many physical phenomena. The NoMISS greedy algorithm outstands from the existing sequential algorithms for constructing a pore subnetwork, in a rela...
详细信息
ISBN:
(纸本)9780769548463
Porous media simulation is an important contribution in the study of many physical phenomena. The NoMISS greedy algorithm outstands from the existing sequential algorithms for constructing a pore subnetwork, in a relatively fast way. However, despite the NoMISS time reduction, there are still problems related to the required processing time when very large networks need to be studied. In this work, a non scalable parallel version of the NoMISS algorithm is presented, and a new approach is proposed to alleviate this issue;in both versions cluster cores work simultaneously on different porous subnetwork spaces. The first approach, named as Unbounded-NoMISS, allows the cores to go forward with the initialization of the porous subnetwork space, applying a balancing policy when a core needs more data. At the end, the cores require a sequential synchronization to finish the porous network construction. The second approach, named as Bounded-NoMISS, controls the porous subnetwork initialization by considering a site-size boundary, avoiding the final strong synchronization and improving considerably the scalability. The obtained results using a 125-core cluster are presented.
The need to increase performance while conserving energy lead to the emergence of multi-core processors. These processors provide a feasible option to improve performance of software applications by increasing the num...
详细信息
ISBN:
(纸本)9781424437511
The need to increase performance while conserving energy lead to the emergence of multi-core processors. These processors provide a feasible option to improve performance of software applications by increasing the number of cores, instead of relying on increased clock speed of a single core. The uptake of multi-core processors by hardware vendors present variety of challenges to the software community. In this context, it is important that messaging libraries based on the Message Passing Interface (MPI) standard support efficient inter-core communication. Typically processing cores of today's commercial multi-core processors share the main memory. As a result, it is vital to develop devices to exploit this. MPJ Express is our implementation of the MPI-like Java bindings. The software has mainly supported communication with two devices;the first is based on Java New I/O (NIO) and the second is based on Myrinet. In this paper, we present two shared memory implementations meant for providing efficient communication of multi-core and SMP clusters. The first implementation is pure Java and uses Java threads to exploit multiple cores. Each Java thread represents an MPI level OS process and communication between these threads is achieved using shared data structures. The second implementation is based on the System V (SysV) IPC API. Our goal is to achieve better communication performance than already existing devices based on Transmission Control Protocol (TCP) and Myrinet on SMP and multi-core platforms. Another design goal is that existing parallel applications must not be modified for this purpose, thus relieving application developers from extra efforts of porting their applications to such modern clusters. We have benchmarked our implementations and report that threads-based device performs the best on an Intel quad-core Xeon cluster.
The implementation of a proof-of-concept Lattice Quantum Chromodynamics kernel on the Cell processor is described in detail, illustrating issues encountered in the porting process. The resulting code performs up to 45...
详细信息
The implementation of a proof-of-concept Lattice Quantum Chromodynamics kernel on the Cell processor is described in detail, illustrating issues encountered in the porting process. The resulting code performs up to 45 GFlop/s per socket (without inter-node parallel communications), indicating that the Cell processor is likely to be a good platform for future Lattice QCD calculations. (C) 2008 Elsevier B.V. All rights reserved.
A multifrontal code is introduced for the efficient solution of the linear system of equations arising from the analysis of structures. The factorization phase is reduced into a series of interleaved element assembly ...
详细信息
A multifrontal code is introduced for the efficient solution of the linear system of equations arising from the analysis of structures. The factorization phase is reduced into a series of interleaved element assembly and dense matrix operations for which the BLAS3 kernels are used. A similar approach is generalized for the forward and back substitution phases for the efficient solution of structures having multiple load conditions. The program performs all assembly and solution steps in parallel. Examples are presented which demonstrate the code’s performance on single and dual core processor computers.
Software reuse technology can improve the efficiency of program development greatly. A reusable Apla-Java component has been developed in the research of PAR (Partition and Recur) method and their tools. We have made ...
详细信息
ISBN:
(纸本)9780769534985
Software reuse technology can improve the efficiency of program development greatly. A reusable Apla-Java component has been developed in the research of PAR (Partition and Recur) method and their tools. We have made the most of reuse-driven software theory and the partial implementation theory for reference which ensure the accuracy of the components effectively. Apla-Java component is an important part of "Apla --> Java automatic conversion software". It can support multi-core programming for the implementation of parallel and concurrent mechanism. Some multi-core program examples have been developed based on the components. Experiments show that the approach of multicoreprogramming based on Apla-Java reusable components can greatly enhance the efficiency of development multi-core program.
暂无评论