This paper presents the design and analysis of parallel approximation algorithms for facility-location problems, including NC and RNC algorithms for (metric) facility location, k-center, k-median, and k-means These pr...
详细信息
ISBN:
(纸本)9781450300797
This paper presents the design and analysis of parallel approximation algorithms for facility-location problems, including NC and RNC algorithms for (metric) facility location, k-center, k-median, and k-means These problems have received considerable attention during the past decades from the approximation algorithms community, which primarily concentrates on improving the approximation guarantees In this paper, we ask. Is it possible to parallelize some of the beautiful results from the sequential setting? Our starting point is a small, but diverse, subset of results in approximation algorithms for facility-location problems, with a primary goal of developing techniques for devising their efficient parallel counterparts. We focus on giving algorithms with low depth, near work efficiency (compared to the sequential versions), and low cache complexity
Energy consumption by computer systems has emerged as an important concern However, the energy consumed in executing an algorithm cannot be inferred from its performance alone it must be modeled explicitly This paper ...
详细信息
ISBN:
(纸本)9781450300797
Energy consumption by computer systems has emerged as an important concern However, the energy consumed in executing an algorithm cannot be inferred from its performance alone it must be modeled explicitly This paper analyzes energy consumption of parallelalgorithms executed on shared memory multicore processors Specifically, we develop a methodology to evaluate how energy consumption of a given parallel algorithm changes as the number of cores and their frequency is varied We use this analysis to establish the optimal number of cores to minimize the energy consumed by the execution of a parallel algorithm for a specific problem size while satisfying a given performance requirement We study the sensitivity of our analysis to changes in parameters such as the ratio of the power consumed by a computation step versus the power consumed in accessing memory The results show that the relation between the problem size and the optimal number of cores is relatively unaffected for a wide range of these parameters.
In this paper, we present a dynamic load-balancing algorithm for parallel digital logic simulation making use of reinforcement learning We first introduce two dynamic load-balancing algorithms oriented towards balanci...
详细信息
ISBN:
(纸本)9781450300797
In this paper, we present a dynamic load-balancing algorithm for parallel digital logic simulation making use of reinforcement learning We first introduce two dynamic load-balancing algorithms oriented towards balancing the computational and communication load respectively and then utilize reinforcement learning to create an algorithm which is a combination of the first two algorithms In addition, the algorithm determines the value of two important parameters the number of processors which participate in the algorithm and the load which is exchanged during its execution. We investigate the algorithms on gate level simulations of several open source VLSI circuits
In this paper we explore a simple and general approach for developing parallelalgorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallel...
详细信息
ISBN:
(纸本)9781450300797
In this paper we explore a simple and general approach for developing parallelalgorithms that lead to good cache complexity on parallel machines with private or shared caches The approach is to design nested-parallelalgorithms that have low depth (span. critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators Using known mappings. our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.
The proceedings contain 45 papers. The topics discussed include: buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures;scheduling to minimize power consumption using su...
ISBN:
(纸本)9781450300797
The proceedings contain 45 papers. The topics discussed include: buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures;scheduling to minimize power consumption using submodular functions;collaborative scoring with dishonest participants;securing every bit: authenticated broadcast in radio networks;brief announcement: on speculative replication of transactional systems;data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory;basic network creation games;on the bit communication complexity of randomized rumor spreading;algorithms and application for grids and clouds;towards optimizing energy costs of algorithms for shared memory architectures;brief announcement: on regenerator placement problems in optical networks;best-effort group service in dynamic networks;and implementing and evaluating nested parallel transactions in software transactional memory.
In dynamically multithreaded platforms that employ work stealing, there appears to be a fundamental tradeoff between providing provably good time and space bounds and supporting SP-reciprocity, the property of allowin...
详细信息
ISBN:
(纸本)9781450300797
In dynamically multithreaded platforms that employ work stealing, there appears to be a fundamental tradeoff between providing provably good time and space bounds and supporting SP-reciprocity, the property of allowing arbitrary calling between parallel and serial code, including legacy serial binaries. Many known dynamically multithreaded platforms either fail to support SP-reciprocity or sacrifice on the provable time and space bounds that an efficient work-stealing scheduler could otherwise guarantee We describe PR-Cilk, a design of a runtime system that supports SP-reciprocity in Cilk and provides provable bounds on time and space In order to maintain the space bound, PR-Cilk uses subtree-restricted work stealing. We show that with subtree-restricted work stealing. PR-Cilk provides the same guarantee on stack space usage as ordinary Cilk. The completion time guaranteed by PR-Cilk is slightly worse than ordinary Cilk Nevertheless, if the number of times a C function calls a Cilk function is small, or if each Cilk function called by a C function is sufficiently parallel. PR-Cilk still guarantees linear speedup.
Transactional Memory (TM) is a promising technique that simplifies parallel programming for shared-memory applications To date, most TM systems have been designed to efficiently support single-level parallelism To ach...
详细信息
ISBN:
(纸本)9781450300797
Transactional Memory (TM) is a promising technique that simplifies parallel programming for shared-memory applications To date, most TM systems have been designed to efficiently support single-level parallelism To achieve widespread use and maximize performance gains. TM must support nested parallelism available in many applications and supported by several programming models. We present NesTM, a software TM (STM) system that supports closed-nested parallel transactions NesTM is based on a high-performance. blocking STM that uses eager version management and word-granularity conflict detection Its algorithm targets the state and runtime overheads of nested parallel transactions We also describe several subtle correctness issues in supporting nested parallel transactions in NesTM and discuss their performance impact Through our evaluation, we quantitatively analyze the performance of NesTM using STAMP applications and microbenchmarks based on concurrent data structures. First, we show that the performance overhead of NesTM is reasonable when single-level parallelism is used. Second. we quantify the incremental overhead of NesTM when the parallelism is exploited in deeper nesting levels and draw conclusions that can be useful in designing a nesting-aware TM runtime environment. Finally, we demonstrate a use-case where nested parallelism improves the performance of a transactional microbenchmark
In this paper. we show new parallelalgorithms for a set of classical string comparison problems. computation of string alignments. longest common subsequences (LCS) or edit distances, and longest increasing subsequen...
详细信息
ISBN:
(纸本)9781450300797
In this paper. we show new parallelalgorithms for a set of classical string comparison problems. computation of string alignments. longest common subsequences (LCS) or edit distances, and longest increasing subsequence computation. These problems have a wide range of applications, in particular in computational biology and signal processing. We discuss the scalability of our new parallelalgorithms in computation time, in memory, and in commumcation Our new algorithms are based on an efficient parallel method for (min, +)-multiplication of distance matrices The core result of this paper is a scalable parallel algorithm for multiplying Implicit simple unit-Monge matrices of size n x n on p processors using lime O(n log n/p), communication O(n log p/p) and O( log p) supersteps. This algorithm allows us to implement scalable LCS computation for two strings of length n using time O(n(2)/p) and communication O(n/root P) requiring local memory of size O(n/root P) on each processor Furthermore, our algorithm can be used to obtain the first generally work-scalable algorithm for computing the longest increasing subsequence (LIS) Our algorithm for LIS computation requires computation O(n log(2) n/p), communication O(n log(2) p/p), and O(log(2) p) supersteps for computing the LIS of a sequence of length n This is within a log n factor of work-optimality for the LIS problem. which can be solved sequentially in time O(n log n) in the comparison-based model. Our LIS algorithm is also within a log p-factor of achieving perfectly scalable communication and furthermore has perfectly scalable memory size requirements of O(n/p) per processor
暂无评论