The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present...
详细信息
ISBN:
(纸本)9781605584133
The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present the concept, design and implementation of the Pleiad platform'. Having its origin in the proposition of distributed sharedmemory (DSM), Pleiad is a cluster middleware that provides sharedmemory abstraction which enables transparent multithreaded execution across the cluster nodes. It belongs to the new generation of cluster middleware that aside from providing the proof of concept regarding unification of the cluster memory resources, they aim to achieve satisfactory levels of performance and scalability for a broad range of multithreaded applications. First results from the performance evaluation of Pleiad appear emboldening and they are presented in comparison with an efficient implementation of MPI for the Java platform.
In shared-memory multicore architectures, handling a write cache operation is more complicated than in single-processor systems. A cache line may be present in more than one private L1 cache. Any cache willing to writ...
详细信息
ISBN:
(纸本)9781479987191
In shared-memory multicore architectures, handling a write cache operation is more complicated than in single-processor systems. A cache line may be present in more than one private L1 cache. Any cache willing to write this line must inform all the other sharers. Therefore, it is necessary to implement a cache coherence protocol for multicore architectures. At present, directory based protocols are popular cache coherence protocols in both industry and academic domains because of their reduced coherence traffic compared to snooping protocols, at the expense of an indirection. The write policy - write through or write back - is crucial in the protocol design. The write-through policy reduces the bandwidth because it augments the write traffic in the interconnection network, and also augments the energy consumption. However, it can efficiently solve the false sharing problem via write updates. In this paper, we introduce a new way to reduce the write traffic of a write-through coherence protocol by combining write-through coherence with a write-back policy for non coherent lines. The baseline write-through used as reference is a scalable hybrid invalidate/update protocol. Simulation results show that with our enhanced protocol, we can reduce at least by 50% the write traffic in the interconnection network, and gain up to 20% performance compared with the baseline write-through protocol.
We present the project of parallelising the computational algebra system GAP. Our design aims to make concurrency facilities available for GAP users, while preserving as much of the existing codebase (about one millio...
详细信息
ISBN:
(纸本)9783642155819
We present the project of parallelising the computational algebra system GAP. Our design aims to make concurrency facilities available for GAP users, while preserving as much of the existing codebase (about one million lines of code) with as few changes as possible without requiring users (a large percentage of which are domain experts in their fields without necessarily having a background in parallel programming) to have to learn complicated parallel programming techniques. To this end, we preserve the appearance of sequentiality on a per-thread basis by containing each thread within its own data space. Parallelism is made possible through the notion of migrating objects out of one thread's data space into that of another one, allowing threads to interact.
The still increasing number of transistors per chip offered by Moore's law, together with the Post-Dennard scaling era shifted the performance gain from frequency increase to multi-core processing. Consequently, t...
详细信息
ISBN:
(纸本)9781450363211
The still increasing number of transistors per chip offered by Moore's law, together with the Post-Dennard scaling era shifted the performance gain from frequency increase to multi-core processing. Consequently, the support of parallel execution of applications is becoming mandatory. Furthermore, the need for efficient parallel models and languages is more critical for the embedded domain, due to power consumption and memory constraints, among others. This work focuses on parallelizing an embedded speaker recognition application, which is a biometric technique for identification. While a lot of work has been done for speech recognition, fewer efforts have focused on recognizing who the speaker is. In this paper, we analyze two implementations for speaker recognition applications (SRA), namely dataflow and shared memory programming models. More precisely, we use Process Networks (PNs) as a dataflow representation, which is an intuitive way to design streaming applications. We use the language "C for Process Networks" for the dataflow implementation and OpenMP for the sharedmemory one. For two different target platforms, we compared two implementations using OpenMP (exploring data-level parallelism only and with pipelining) against a dataflow-based compiled implementation that allows for functional optimization. Despite faster communication over sharedmemory, we show that the dataflow model is superior in terms of performance (up to twice as fast).
Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. programming such systems has traditionally...
详细信息
ISBN:
(纸本)9781450375887
Due to the slowdown of Moore's Law, systems designers have begun integrating non-cache-coherent heterogeneous computing elements in order to continue scaling performance. programming such systems has traditionally been difficult - developers were forced to use programming models that exposed multiple memory regions, requiring developers to manually maintain memory consistency. Previous works proposed distributed sharedmemory (DSM) as a way to achieve high programmability in such systems. However, past DSM systems were plagued by low-bandwidth networking and utilized complex memory consistency protocols, which limited their adoption. Recently, new networking technologies have begun to change the assumptions about which components are bottlenecks in the system. Additionally, many popular shared-memoryprogramming models utilize memory consistency semantics similar to those proposed for DSM, leading to widespread adoption in mainstream programming. In this work, we argue that it is time to revive DSM as a means for achieving good programmability and performance on non-cache-coherent systems. We explore optimizing an existing DSM protocol by relaxing memory consistency semantics and exposing new cross-node barrier primitives. We integrate the new mechanisms into an existing OpenMP runtime, allowing developers to leverage cross-node execution without changing a single line of code. When evaluated on an x86 server connected to an ARMv8 server via InfiniBand, the DSM optimizations achieve an average of 11% (up to 33%) improvement versus the baseline DSM implementation.
This work presents a novel parallel branch and bound algorithm to efficiently solve to optimality a set of instances of the multi-objective flexible job shop scheduling problem for the first time, to the very best of ...
详细信息
This work presents a novel parallel branch and bound algorithm to efficiently solve to optimality a set of instances of the multi-objective flexible job shop scheduling problem for the first time, to the very best of our knowledge. It makes use of the well-known NSGA-II algorithm to initialize its upper bound. The algorithm is implemented for shared-memory architectures, and among its main features, it incorporates a grid representation of the solution space, and a concurrent priority queue to store and dispatch the pending sub-problems to be solved. We report the optimal Pareto front of thirteen well-known instances from the literature, which were unknown before. They will be very useful for the scientific community to provide more accuracy in the performance measurement of their algorithms. Indeed, we carefully analyze the performance of NSGA-II on these instances, comparing the results against the optimal ones computed in this work. Extensive computational experiments show that the proposed algorithm using 24 cores achieves a speedup of 15.64x with an efficiency of 65.20%.
With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memoryprogramming, n...
详细信息
With the emergence of accelerators like GPUs, MICs and FPGAs, the availability of domain specific libraries (like MKL) and the ease of parallelization associated with CUDA and OpenMP based shared-memoryprogramming, node-based parallelization has recently become a popular choice among developers in the field of scientific computing. This is evident from the large volume of recently published work in various domains of scientific computing, where shared-memoryprogramming and accelerators have been used to accelerate applications. Although these approaches are suitable for small problem-sizes, there are issues that need to be addressed for them to be applicable to larger input domains. Firstly, the primary focus of these works has been to accelerate the core kernel;acceleration of input/output operations is seldom considered. Many operations in scientific computing operate on large matrices-both sparse and dense - that are read from and written to external files. These input-output operations present themselves as bottlenecks and significantly effect the overall application time. Secondly, node-based parallelization limits a developer from distributing the computation beyond a single node without him having to learn an additional programming paradigm like MPI. Thirdly, the problem size that can be effectively handled by a node is limited by the memory of the node and accelerator. In this paper, an Asynchronous Multi-node Execution (AMNE) approach is presented that uses a unique combination of the shared-file system and pseudo-replication to extend node-based algorithms to a distributed multiple node implementation with minimal changes to the original node-based code. We demonstrate this approach by applying it to GEMM, a popular kernel in dense linear algebra and show that the presented methodology significantly advances the state of art in the field of parallelization and scientific computing.
暂无评论