the proceedings contain 44 papers. the topics discussed include: predicate RCU: an RCU for scalable concurrent updates;automatic scalable atomicity via semantic locking;a framework for practical parallel fast matrix m...
ISBN:
(纸本)9781450332057
the proceedings contain 44 papers. the topics discussed include: predicate RCU: an RCU for scalable concurrent updates;automatic scalable atomicity via semantic locking;a framework for practical parallel fast matrix multiplication;PLUTO+: near-complete modeling of affine transformations for parallelism and locality;distributed memory code generation for mixed irregular/regular computations;performance implications of dynamic memory allocators on transactional memory systems;low-overhead software transactional memory with progress guarantees and strong semantics∗;barrier elision for production parallel programs;scalable and efficient implementation of 3D unstructured meshes computation: a case study on matrix assembly;and diagnosing the causes and severity of one-sided message contention.
this paper studies the essence of heterogeneity from the perspective of language mechanism design. the proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer...
详细信息
ISBN:
(纸本)9781450332057
this paper studies the essence of heterogeneity from the perspective of language mechanism design. the proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer level of source data in larger, slower or more distributed memory and an inner level of data blocks in smaller, faster or more localized memory.
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be config...
详细信息
ISBN:
(纸本)9781450332057
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be configured at runtime. It is built from a logarithmic number of sorted arrays, similar to log-structured merge-trees (LSM). For keys added and removed by the same thread the behavior is identical to a non-relaxed priority queue. We compare to state-of-the-art lock-free priority queues with both relaxed and non-relaxed semantics, showing high performance and good scalability of our approach.
this paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operators. Surge exposes a code generation int...
详细信息
ISBN:
(纸本)9781450332057
this paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operators. Surge exposes a code generation interface, decoupled from the core computation, that enables programmers and autotuners to easily generate multiple implementations of the same computation on various parallel architectures such as multi-core CPUs and GPUs. By decoupling computations from architecture-specific implementation, programmers can target multiple architectures more easily, and generate a search space that facilitates optimization and customization for specific architectures. We express in Surge four real-world benchmarks from domains such as sparse linear-algebra and machine learning and from the same performance-portable specification, generate OpenMP and CUDA C++ implementations. Surge generates efficient, scalable code which achieves up to 1.32x speedup over handcrafted, well-optimized CUDA code.
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. the key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately e...
详细信息
ISBN:
(纸本)9781450368186
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. the key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the same decision procedure running in parallel. the decision procedures work on different partitions of the search space without cooperating, whence distribution follows effortlessly. Our experiments on a selection of complex multi-threaded programs show significant analysis speedups and scalability, and greater performance gains than with general-purpose parallel solvers.
Data movement has a significant impact on program performance. For multithread programs, this impact is amplified, since different threads often interfere with each other by competing for shared cache space. However, ...
详细信息
ISBN:
(纸本)9781450368186
Data movement has a significant impact on program performance. For multithread programs, this impact is amplified, since different threads often interfere with each other by competing for shared cache space. However, recent de facto locality metrics consider either sequential execution only, or derive locality for multithread programs in an inefficient way, i.e. exhaustive simulation. this paper presents PLUM, a compiler solution for timescale locality analysis for parallel programs. Experiments demonstrate that the prediction accuracy is 93.97% on average. PLUM is the first tool that analyzes data locality for parallel programs during compile time;in addition, it provides an approach for efficiently studying the representative interleaving pattern for parallel executions.
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the cla...
详细信息
ISBN:
(纸本)9781450332057
Matrix multiplication is a fundamental computation in many scientific disciplines. In this paper, we show that novel fast matrix multiplication algorithms can significantly outperform vendor implementations of the classical algorithm and Strassen's fast algorithm on modest problem sizes and shapes. Furthermore, we show that the best choice of fast algorithm depends not only on the size of the matrices but also the shape. We develop a code generation tool to automatically implement multiple sequential and shared-memory parallel variants of each fast algorithm, including our novel parallelization scheme. this allows us to rapidly benchmark over 20 fast algorithms on several problem sizes. Furthermore, we discuss a number of practical implementation issues for these algorithms on shared-memory machines that can direct further research on making fast algorithms practical. Copyright 2015acm.
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance...
详细信息
ISBN:
(纸本)9781450332057
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). this paper presents TBB-NUMA, a parallelprogramming library based on Intel threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations. Copyright 2015acm.
Stream computing is a popular paradigm for parallel and distributed computing, where compute nodes are connected by first-in first-out data channels. Each channel can be considered as a concatenation of several data b...
详细信息
Stream computing is a popular paradigm for parallel and distributed computing, where compute nodes are connected by first-in first-out data channels. Each channel can be considered as a concatenation of several data buffers, including an output buffer for the sender and an input buffer for the receiver. the configuration of buffer sizes impacts the performance as well as the correctness of the application. In this article, we focus on application deadlocks that are caused by incorrect configuration of buffer sizes. We describe three types of deadlock in streaming applications, categorized by how they can be created. To avoid them, we first prove necessary and sufficient conditions for deadlock-free computations;then based on the theorems, we propose both compile-time and runtime solutions for deadlock avoidance.
parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity ...
详细信息
ISBN:
(纸本)9781450332057
parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. this work contributes APPRIME, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPRIME benchmarks. they retain the original applications' performance characteristics, in particular the relative performance across platforms.
暂无评论