In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (U...
详细信息
In this work, two parallel techniques based on shared memory programming are presented. These models are specially suitable to be applied over evolutionary algorithms. To study their performance, the algorithm UEGO (Universal Evolutionary Global Optimizer) has been chosen.
In shared-memory multicore architectures, handling a write cache operation is more complicated than in single-processor systems. A cache line may be present in more than one private L1 cache. Any cache willing to writ...
详细信息
ISBN:
(纸本)9781479987191
In shared-memory multicore architectures, handling a write cache operation is more complicated than in single-processor systems. A cache line may be present in more than one private L1 cache. Any cache willing to write this line must inform all the other sharers. Therefore, it is necessary to implement a cache coherence protocol for multicore architectures. At present, directory based protocols are popular cache coherence protocols in both industry and academic domains because of their reduced coherence traffic compared to snooping protocols, at the expense of an indirection. The write policy - write through or write back - is crucial in the protocol design. The write-through policy reduces the bandwidth because it augments the write traffic in the interconnection network, and also augments the energy consumption. However, it can efficiently solve the false sharing problem via write updates. In this paper, we introduce a new way to reduce the write traffic of a write-through coherence protocol by combining write-through coherence with a write-back policy for non coherent lines. The baseline write-through used as reference is a scalable hybrid invalidate/update protocol. Simulation results show that with our enhanced protocol, we can reduce at least by 50% the write traffic in the interconnection network, and gain up to 20% performance compared with the baseline write-through protocol.
We present the project of parallelising the computational algebra system GAP. Our design aims to make concurrency facilities available for GAP users, while preserving as much of the existing codebase (about one millio...
详细信息
ISBN:
(纸本)9783642155819
We present the project of parallelising the computational algebra system GAP. Our design aims to make concurrency facilities available for GAP users, while preserving as much of the existing codebase (about one million lines of code) with as few changes as possible without requiring users (a large percentage of which are domain experts in their fields without necessarily having a background in parallel programming) to have to learn complicated parallel programming techniques. To this end, we preserve the appearance of sequentiality on a per-thread basis by containing each thread within its own data space. Parallelism is made possible through the notion of migrating objects out of one thread's data space into that of another one, allowing threads to interact.
Despite the continuous advances of the last years in grid computing, programming paradigms are dominated by the message passing concept. There is little support for other paradigms such as shared data or associative p...
详细信息
Despite the continuous advances of the last years in grid computing, programming paradigms are dominated by the message passing concept. There is little support for other paradigms such as shared data or associative programming. In this paper, we analyse why previous attempts did not have a significant impact in the grid computing community. We start by assessing the landscape of grid programming solutions with a focus on shared data concepts. Next, we introduce an original idea to attack shared data programming on the grid by making use of both relaxed consistency models and user specified type consistency in an object-oriented model. Last but not least, we present a prototype architecture together with experimental results.
The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present...
详细信息
ISBN:
(纸本)9781605584133
The engagement of cluster and grid computing, two popular trends of today's high performance computation, has formed an imperative need for efficient utilization of the afforded resources. In this paper we present the concept, design and implementation of the Pleiad platform'. Having its origin in the proposition of distributed sharedmemory (DSM), Pleiad is a cluster middleware that provides sharedmemory abstraction which enables transparent multithreaded execution across the cluster nodes. It belongs to the new generation of cluster middleware that aside from providing the proof of concept regarding unification of the cluster memory resources, they aim to achieve satisfactory levels of performance and scalability for a broad range of multithreaded applications. First results from the performance evaluation of Pleiad appear emboldening and they are presented in comparison with an efficient implementation of MPI for the Java platform.
OpenMP is an architecture-independent language for programming in the sharedmemory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent...
详细信息
OpenMP is an architecture-independent language for programming in the sharedmemory model. OpenMP is designed to be simple and powerful in terms of programming abstractions. Unfortunately, the architecture-independent abstractions sometimes come with the price of low parallel performance. This is especially true for applications with an unstructured data access pattern running on distributed sharedmemory systems (DSM). Here, proper data distribution and algorithmic optimizations play a vital role for performance. In this article, we have investigated ways of improving the performance of an industrial class conjugate gradient (CG) solver, implemented in OpenMP running on two types of sharedmemory systems. We have evaluated bandwidth minimization, graph partitioning and reformulations of the original algorithm reducing global barriers. By a detailed analysis of barrier time and memory system performance, we found that bandwidth minimization is the most important optimization reducing both L2 misses and remote memory accesses. On a uniform memory system, we get perfect scaling. On a NUMA system, the performance is significantly improved with the algorithmic optimizations leaving the system dependent global reduction operations as a bottleneck.
Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving th...
详细信息
Finding a good graph coloring quickly is often a crucial phase in the development of efficient, parallel algorithms for many scientific and engineering applications. In this paper we consider the problem of solving the graph coloring problem itself in parallel, We present a simple and fast parallel graph coloring heuristic that is well suited for shared memory programming and yields an almost linear speedup on the PRAM model. We also present a second heuristic that improves on the number of colors used. The heuristics have been implemented using OpenMP, Experiments conducted on an SGI Gray Origin 2000 supercomputer using very large graphs from finite element methods and eigenvalue computations validate the theoretical run-time analysis. Copyright (C) 2000 John Whey & Sons, Ltd.
暂无评论