The proceedings contain 43 papers. The topics discussed include: speed scaling of processes with arbitrary speedup curves on a multiprocessor;the bell is ringing in speed-scaled multiprocessor scheduling;mapping filte...
ISBN:
(纸本)9781605586069
The proceedings contain 43 papers. The topics discussed include: speed scaling of processes with arbitrary speedup curves on a multiprocessor;the bell is ringing in speed-scaled multiprocessor scheduling;mapping filtering streaming applications with communication costs;scheduling to minimize staleness and stretch in real-time data warehouses;parameterized maximum and average degree approximation in topic-based publish-subscribe overlay network design;selfishness in transactional memory;at-most-once semantics in asynchronous shared memory;memory models: a case for rethinking parallel languages and hardware;the life and times of a ZooKeeper;Cassandra - a structured storage system on a P2P network;Pregel: a system for large-scale graph processing;towards transactional memory semantics for C++;on avoiding spare aborts in transactional memory;inherent limitations on disjoint-access parallel implementations of transactional memory;reducers and other Cilk++ hyperobjects;and beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures.
The proceedings contain 53 papers. The topics discussed include: a first insight into object-aware hardware transactional memory;safe open-nested transactions through ownership;leveraging non-blocking collective commu...
ISBN:
(纸本)9781595939739
The proceedings contain 53 papers. The topics discussed include: a first insight into object-aware hardware transactional memory;safe open-nested transactions through ownership;leveraging non-blocking collective communication in high-performance applications;fractal communication in software data dependency graphs;many random walks are faster than one;improved distributed approximate matching;graph partitioning into isolated, high conductance clusters: theory, commutation and applications to preconditioning;automatic data partitioning in software transactional memories;checkpoints and continuations instead of nested transactions;adaptive transaction scheduling for transactional memory systems;operational analysis of processor speed scaling;and kicking the tires of software transactional memory: why the going gets tough.
The proceedings contain 37 papers. The topics discussed include: on triangulation of simple networks;strong-diameter decompositions of minor free graphs;approximation algorithms for multiprocessor scheduling under unc...
详细信息
ISBN:
(纸本)159593667X
The proceedings contain 37 papers. The topics discussed include: on triangulation of simple networks;strong-diameter decompositions of minor free graphs;approximation algorithms for multiprocessor scheduling under uncertainty;scheduling DAGs on asynchronous processors;scheduling to minimize gaps and power consumption;cache-oblivious streaming B-trees;an experimental comparison of cache-oblivious and cache-conscious programs;scheduling threads for constructive cache sharing on CMPs;proximity-aware directory-based coherence for multi-core processor architectures;a parallel dynamic programming algorithm on a multi-core architecture;tight bounds for distributed selection;local MST computation with short advice;distributed approximation of capacitated dominating sets;packing to angles and sectors;and the notion of a timed register and its application to indulgent synchronization.
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures;with an eye towards multi-core processors with many cores. We argue that traditional...
详细信息
ISBN:
(纸本)9781595936677
We discuss the high-performance parallel implementation and execution of dense linear algebra matrix operations on SMP architectures;with an eye towards multi-core processors with many cores. We argue that traditional implementations, as those incorporated in LAPACK, cannot be easily modified to render high performance as well as scalability on these architectures. The solution we propose is to arrange the data structures and algorithms so that matrix blocks become the fundamental units of data;and operations on these blocks become the fundamental units of computation, resulting in algorithms-by-blocks as opposed to the snore traditional blocked algorithms. We show that this facilitates the adoption of techniques akin to dynamic scheduling and out-of-order execution usual in superscalar processors;which we name SuperMatrix Out-of-Order scheduling. Performance results on a 16 CPU Itanium2-based server are used to highlight opportunities and issues related to this new approach.
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with te...
详细信息
ISBN:
(纸本)9781595936677
As the number of cores increases on chip multiprocessors, coherence is fast becoming a central issue for multi-core performance. This is exacerbated by the fact that interconnection speeds are not scaling well with technology. This paper describes mechanisms to accelerate coherence for a multi-core architecture that has multiple private L2 caches and a scalable point-to-point interconnect between cores. These techniques exploit the differences in geometry between chip multiprocessors and traditional multiprocessor architectures. Directory-based protocols have been proposed as a scalable alternative to snoop-based protocols. In this paper, we discuss implementations of coherence for CMPs and propose and evaluate a novel directory-based coherence scheme to improve the performance of parallel programs on such processors. Proximity-aware coherence accelerates read and write misses by initiating cache-to-cache transfers from the spatially closest sharer. This has the dual benefit of eliminating unnecessary accesses to off-Chip memory, and minimizing the distance over which communicated data moves across the network. The proposed schemes result in speedups up to 74.9% for our workloads.
The proceedings contain 43 papers. The topics discussed include: publish and perish: definition and analysis of an n-person publication impact game;exponential separation of quantum and classical online space complexi...
详细信息
ISBN:
(纸本)1595934529
The proceedings contain 43 papers. The topics discussed include: publish and perish: definition and analysis of an n-person publication impact game;exponential separation of quantum and classical online space complexity;minimizing the stretch when scheduling flows of biological requests;position paper and brief announcement: the FG programming environment - good and good for you;efficient parallel algorithms for dead sensor diagnosis and multiple access channels;on the communication complexity of randomized broadcasting in random-like graphs;strip packing with precedence constraints and strip packing with release times;on space-stretch trade-offs: lower bounds;a performance analysis of local synchronization;the cache complexity of multithreaded cache oblivious algorithms;and deterministic load balancing and dictionaries in the parallel disk model.
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently sch...
详细信息
ISBN:
(纸本)9781595934529
In chip multiprocessors (CMPs), limiting the number of off-chip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this brief announcement, we highlight our ongoing study [4] comparing the performance of two schedulers designed for fine-grained multithreaded programs: Parallel Depth First (PDF) [2], which is designed for constructive sharing, and Work Stealing (WS) [3], which takes a more traditional *** of schedulers. In PDF, processing cores are allocated ready-to-execute program tasks such that higher scheduling priority is given to those tasks the sequential program would have executed earlier. As a result, PDF tends to co-schedule threads in a way that tracks the sequential execution. Hence, the aggregate working set is (provably) not much larger than the single thread working set [1]. In WS, each processing core maintains a local work queue of readyto-execute threads. Whenever its local queue is empty, the core steals a thread from the bottom of the first non-empty queue it finds. WS is an attractive scheduling policy because when there is plenty of parallelism, stealing is quite rare. However, WS is not designed for constructive cache sharing, because the cores tend to have disjoint working *** configurations studied. We evaluated the performance of PDF and WS across a range of simulated CMP configurations. We focused on designs that have fixed-size private L1 caches and a shared L2 cache on chip. For a fixed die size (240 mm2), we varied the number of cores from 1 to 32. For a given number of cores, we used a (default) configuration based on current CMPs and realistic projections of future CMPs, as process technologies decrease from 90nm to *** of findings. We studied a variety of benchmark programs to show the following *** several application classes, PDF enable
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting t...
详细信息
ISBN:
(纸本)0769524451
In recent years, reconfigurable technology has emerged as a popular choice for implementing various types of cryptographic functions. Nevertheless, an insufficient amount effort has been placed into fully exploiting the tremendous amounts of parallelism intrinsic to FPGAs for this class of algorithms. In this paper, we focus on block cipher architectures and explore design decisions that leverage the multi-grained parallelism inherent in many of these algorithms. We demonstrate the usefulness of this approach with a highly parallel FPGA implementation of the AES standard, and present results detailing the area/delay tradeoffs resulting from our design decisions.
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing dequ...
详细信息
The proceedings contain 39 papers. The topics discussed include: randomized queue management for DiffServ;randomization does not reduce the average delay in parallel packet switches;dynamic circular work-stealing deque;coloring unstructured radio networks;name independent routing for growth bounded networks;windows scheduling of arbitrary length jobs on parallel machines;parallel scheduling of complex dags under uncertainty;on distributed smooth scheduling;scheduling malleable tasks with precedence constraints;an adaptive power conservation scheme for heterogeneous wireless sensor networks with node redeployment;irrigating ad hoc networks ion constant time;and constant density spanners for wireless ad-hoc networks.
暂无评论