the performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performan...
详细信息
ISBN:
(纸本)9781450368186
the performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performance tools to monitor and analyze application behavior. Profiles generated by performance tools can usually indicate the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. this paper describes an automated framework that analyzes sample-based time series data to diagnose scalability losses in parallel executions. the framework's automated diagnosis of scalability losses indicates their symptoms, severity, and causes. Two case studies illustrate the effectiveness of this framework. When compared to a tool that analyzes performance using instrumentation-based traces, our overhead for collecting sample-based time series is 1/28 in time and 1/1600 in space while our automated analysis takes 1/25 of the time.
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verificatio...
详细信息
Withthe increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. this paper presents a parallel algorithm for BDD construction targeted at shared memory multiprocessors and distributed shared memory systems. this algorithm focuses on improving memory access locality through specialized memory managers and partial breadth-first expansion, and on improving processor utilization through dynamic load balancing. the results on a shared memory system show speedups of over two on four processors and speedups of up to four on eight processors. the measured results clearly identify the main source of bottlenecks and point out some interesting directions for further improvements.
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be config...
详细信息
ISBN:
(纸本)9781450332057
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be configured at runtime. It is built from a logarithmic number of sorted arrays, similar to log-structured merge-trees (LSM). For keys added and removed by the same thread the behavior is identical to a non-relaxed priority queue. We compare to state-of-the-art lock-free priority queues with both relaxed and non-relaxed semantics, showing high performance and good scalability of our approach.
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe ...
ISBN:
(纸本)9781605581170
the proceedings contain 25 papers. the topics discussed include: order-sorted dependency pairs;macros for context-free grammars;inferring precise polymorphic type dependencies in logic programs;a type system for safe memory management and its proof of correctness;programming with proofs and explicit contexts;towards execution time estimation in abstract machine-based languages;similarity-based reasoning in qualified logic programming;classifying integrity checking methods with regard to inconsistency tolerance;comprehending finite maps for algorithmic debugging of higher-order functional programs;parallel execution of multi-set constraint rewrite rules;a rewriting framework for the composition of access control policies;global difference constraint propagation for finite domain solvers;and dynamic variable elimination during propagation solving.
this paper presents a method to design and auto-tune a new parallel 3-D FFT code using the non-blocking MPI all-to-all operation. We achieve high performance by optimizing computation-communication overlap. Our code p...
详细信息
We present XIndex, a concurrent ordered index designed for fast queries. Similar to a recent proposal of the learned index, XIndex uses learned models to optimize index efficiency. Comparing withthe learned index, XI...
详细信息
ISBN:
(纸本)9781450368186
We present XIndex, a concurrent ordered index designed for fast queries. Similar to a recent proposal of the learned index, XIndex uses learned models to optimize index efficiency. Comparing withthe learned index, XIndex is able to effectively handle concurrent writes without affecting the query performance by leveraging fine-grained synchronization and a new compaction scheme, Two-Phase Compaction. Furthermore, XIndex adapts its structure according to runtime workload characteristics to support dynamic workload. We demonstrate the advantages of XIndex with both YCSB and TPC-C (KV), a TPC-C variant for key-value stores. XIndex achieves up to 3.2x and 4.4x performance improvement comparing with Masstree and Wormhole, respectively, on a 24-core machine, and it is open-sourced(1).
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data par...
详细信息
ISBN:
(纸本)9781450301190
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. the characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. the Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. In this paper, we discuss the language, compiler, and runtime features that enable Copperhead to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We also outline the runtime support by which Copperhead programs interoperate with standard Python modules. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallelprogramming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel pr...
详细信息
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel processor can access remote memory without message passing. AC provides support for distributed arrays as well as pointers to distributed data. Simple array references and pointer dereferencing are sufficient to generate low-overhead remote reads and writes. We have implemented these ideas in a compiler based on the GNU C compiler and targeted at Cray Research's T3D. Initial performance measurements show that AC generates code for remote accesses which is considerably faster than that of the native compiler for structures up to about 16 words in size and virtually equivalent for larger transfers.
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and w...
详细信息
ISBN:
(纸本)9781450311601
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-level graph analysis algorithms. It is also representative of a class of parallel computations whose memory accesses and work distribution are both irregular and data-dependent. Recent work has demonstrated the plausibility of GPU sparse graph traversal, but has tended to focus on asymptotically inefficient algorithms that perform poorly on graphs with non-trivial diameter. We present a BFS parallelization focused on fine-grained task management constructed from efficient prefix sum that achieves an asymptotically optimal O(|V|+|E|) work complexity. Our implementation delivers excellent performance on diverse graphs, achieving traversal rates in excess of 3.3 billion and 8.3 billion traversed edges per second using single and quad-GPU configurations, respectively. this level of performance is several times faster than state-of-the-art implementations both CPU and GPU platforms.
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel im...
详细信息
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of (1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph partitioning algorithms applied to the linearized data structure, (3) techniques for expressing irregular communication and nonuniform computations associated withthe elements of linearized data structures. We demonstrate and evaluate our formulation on a parallel, hierarchical N-body method for the evaluation of potentials and forces of nonuniform particle distributions. Our experimental results demonstrate that efficient data parallel (HPF) implementations of highly nonuniform problems are feasible withthe proper language/compiler/runtime support. Our data parallel N-body code provides a much needed 'benchmark' code for evaluating and improving HPF compilers.
暂无评论