Efficient memory management of dynamic non-blocking data structures remains an important open question. Existing methods either sacrifice the ability to deallocate objects or reduce performance notably. In this paper,...
详细信息
ISBN:
(纸本)9781450315722
Efficient memory management of dynamic non-blocking data structures remains an important open question. Existing methods either sacrifice the ability to deallocate objects or reduce performance notably. In this paper, we present a novel technique, called Drop the Anchor, which significantly reduces the overhead associated with the memory management while reclaiming objects even in the presence of thread failures. We demonstrate this memory management scheme on the common linked list data structure. Using extensive evaluation, we show that Drop the Anchor significantly outperforms Hazard Pointers, the widely used technique for non-blocking memory management.
As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now preval...
详细信息
ISBN:
(纸本)9781450315722
As we increase the number of cores on a processor die, the on-chip cache hierarchies that support these cores are getting larger, deeper, and more complex. As a result, non-uniform memory access effects are now prevalent even on a single chip. To reduce execution time and energy consumption, data access locality should be exploited. This is especially important for task-based programming systems, where a scheduler decides when and where on the chip the code segments, i.e., tasks, should execute. Capturing locality for structured task parallelism has been done effectively, but the more difficult case, unstructured parallelism, remains largely unsolved-little quantitative analysis exists to demonstrate the potential of locality-aware scheduling, and to guide future scheduler implementations in the most fruitful direction. This paper quantifies the potential of locality-aware scheduling for unstructured parallelism on three different many-core processors. Our simulation results of 32-core systems show that locality-aware scheduling can bring up to 2.39x speedup over a randomized schedule, and 2.05x speedup over a state-of-the-art baseline scheduling scheme. At the same time, a locality-aware schedule reduces average energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. In addition, our 1024-core simulation results project that these benefits will only increase: Compared to 32-core executions, we see up to 1.83x additional locality benefits. To capture such potentials in a practical setting, we also perform a detailed scheduler design space exploration to quantify the impact of different scheduling decisions. We also highlight the importance of locality-aware stealing, and demonstrate that a stealing scheme can exploit significant locality while performing load balancing. Over randomized stealing, our proposed scheme shows up to 2.0x speedup for stolen tasks.
We present novel work-optimal PRAM algorithms for Burrows-Wheeler (BW) compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), ...
详细信息
ISBN:
(纸本)9781450315722
We present novel work-optimal PRAM algorithms for Burrows-Wheeler (BW) compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), and the depth of the corresponding decompression algorithm is O(log n). These appear to be the first polylogarithmic-time work-optimal parallelalgorithms for any standard lossless compression *** algorithms for the individual stages of compression and decompression may also be of independent interest: 1. a novel O(log n)-time, O(n)-work PRAM algorithm for Huffman decoding; 2. original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent. We then mapped such parallelism in interesting ways to elementary parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression (inverse blocksorting transform).Companion work reports empirical speedups of up to 25x for compression and up to 13x for decompression. This reflects a speedup of 70x over recent work on BW compression on GPUs.
The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal ...
ISBN:
(纸本)9781450312134
The proceedings contain 40 papers. The topics discussed include: time vs. space trade-offs for rendezvous in trees;allowing each node to communicate only once in a distributed system: shared whiteboard models;optimal and competitive runtime bounds for continuous, local gathering of mobile robots;online multi-robot exploration of grid graphs with rectangular obstacles;in search of parallel dimensions;delegation and nesting in best-effort hardware transactional memory;design, verification and applications of a new read-write lock algorithm;a lock-free B+tree;brief announcement: the problem based benchmark suite;brief announcement: subgraph isomorphism on a multithreaded shared memory architecture;efficient cache oblivious algorithms for randomized divide-and-conquer on the multicore model;a scalable framework for heterogeneous GPU-based clusters;and faster and simpler width-independent parallelalgorithms for positive semidefinite programming.
Programs written for an architecture with n processors require a re-write when migrated to an m processor architecture {(m > n)} to benefit from additional resources. Compiler based solutions do not match manual op...
详细信息
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of inno...
详细信息
ISBN:
(纸本)9781450312134
Performance matters. But how we improve it is changing. Historically, transparent hardware improvements would mean software just ran faster. That may not necessarily be true in the future. To continue the pace of innovation, the future will need to be increasingly parallel- involving parallelism across data, threads, cores, and nodes. This talk will explore some of the dimensions of parallelism, and the opportunities and challenges they pose.
Creating components based on concurrent and parallelalgorithms and data structures often requires more attention to "engineering" issues not seen with most other libraries. Components created in the "o...
详细信息
ISBN:
(纸本)9781450312134
Creating components based on concurrent and parallelalgorithms and data structures often requires more attention to "engineering" issues not seen with most other libraries. Components created in the "obvious" way sometimes turn out to be wrong, to perform poorly, or are unusable in most applications, because the abstractions in which they are expressed are leaky, imprecise, or incorrectly specified. While many of these issues are encountered in other aspects of concurrent programming, the impact is accentuated when they continue to leak through to APIs provided by library components. This presentation surveys some examples and lessons learned from the design, implementation, and applications of the *** library, including those surrounding memory models, resource management and garbage collection, thread management, optimization, and code generation.
暂无评论