May-Happen-in-parallel (MHP) analysis forms the basis for many problems of program analysis and program understanding. MHP analysis can also be used by IDEs (integrateddevelopment-environments) to help programmers to ...
详细信息
ISBN:
(纸本)9781450368186
May-Happen-in-parallel (MHP) analysis forms the basis for many problems of program analysis and program understanding. MHP analysis can also be used by IDEs (integrateddevelopment-environments) to help programmers to refactor parallel-programs, identify racy programs, understand which parts of the program run in parallel, and so on. Since the code keeps changing in the IDE, re-computing the MHP information after every change can be an expensive affair. In this manuscript, we propose a novel scheme to perform incremental MHP analysis (on the fly) of programs written in task parallel languages like X10 to keep the MHP information up to date, in an IDE environment. the key insight of our proposed approach to maintain the MHP information up to date is that we need not rebuild (from scratch) every data structure related to MHP information, after each modification (addition or deletion of statements) in the source code. the idea is to reuse the old MHP information as much as possible and incrementally recompute the MHP information (of a small set of statements) which depends on the statement added/removed. We introduce two new algorithms that deal with addition and removal of parallel constructs like finish, async, atomic, and sequential constructs like loop, if, if-else and other sequential statements, on the fly. Our evaluation shows that our algorithms run much faster than the repeated invocations of the fastest known MHP analysis for X10 programs [Sankar et al. 2016].
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and ineicient utilization of network and/or GPU resources. In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact withthe network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
While there has been significant work on parallel graph processing, there has been very surprisingly little work on high-performance hypergraph processing. this paper presents a collection of efficient parallel algori...
详细信息
ISBN:
(纸本)9781450368186
While there has been significant work on parallel graph processing, there has been very surprisingly little work on high-performance hypergraph processing. this paper presents a collection of efficient parallel algorithms for hypergraph processing, including algorithms for betweenness centrality, maximal independent set, k-core decomposition, hyper-trees, hyperpaths, connected components, PageRank, and single-source shortest paths. For these problems, we either provide new parallel algorithms or more efficient implementations than prior work. Furthermore, our algorithms are theoretically-efficient in terms of work and depth. To implement our algorithms, we extend the Ligra graph processing framework to support hypergraphs, and our implementations benefit from graph optimizations including switching between sparse and dense traversals based on the frontier size, edge-aware parallelization, using buckets to prioritize processing of vertices, and compression. Our experiments on a 72-core machine and show that our algorithms obtain excellent parallel speedups, and are *** faster than algorithms in existing hypergraph processing frameworks.
A finite-state machine (FSM) is a key component for many important applications, such as Huffman decoding, regular expression matching and HTML tokenization. Due to its inherent dependencies and unpredictable memory a...
详细信息
ISBN:
(纸本)9781450368186
A finite-state machine (FSM) is a key component for many important applications, such as Huffman decoding, regular expression matching and HTML tokenization. Due to its inherent dependencies and unpredictable memory access pattern, FSM computations are considered to be extremely difficult to parallelize. As such, significant research efforts have been made to accelerate FSM computations. Although they achieve promising performance results on multi-core machines, these methods are not scalable for emerging many-core architectures such as the GPUs. Based on our experiments, we point out that the bottleneck of achieving scalability on GPUs is the sequential merge inherent to these methods. However, unlike the case for simple reduction loops, parallel merge implementations for FSM computations typically require runtime checks and re-executions, which can also impede performance. Based on these observations, we develop parallel merge techniques that select efficient runtime check implementations and avoids unnecessary re-executions. Further, based on GPU architectural features, we develop optimization techniques to improve performance. We evaluate our parallel merge implementations on a set of representative algorithms. Experimental results show that our parallel merge implementations are 2.02-6.74 times more efficient than corresponding sequential merge implementations and achieve better scalability on an Nvidia V100 GPU.
the performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performan...
详细信息
ISBN:
(纸本)9781450368186
the performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performance tools to monitor and analyze application behavior. Profiles generated by performance tools can usually indicate the presence of scalability losses while time series data are generally necessary to pinpoint the root causes of such losses. However, manual analysis of time series data can be difficult in executions with a large number of processes, long running times, and deep call chains. this paper describes an automated framework that analyzes sample-based time series data to diagnose scalability losses in parallel executions. the framework's automated diagnosis of scalability losses indicates their symptoms, severity, and causes. Two case studies illustrate the effectiveness of this framework. When compared to a tool that analyzes performance using instrumentation-based traces, our overhead for collecting sample-based time series is 1/28 in time and 1/1600 in space while our automated analysis takes 1/25 of the time.
In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and ...
详细信息
ISBN:
(纸本)9781450368186
In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other approaches, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag withthe data. Applying Pilot only requires minor changes to applications and provides 10%-360% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.
Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, code serialization, and resource contention. Performance analysis tools for finding such scaling bottlenecks eithe...
详细信息
ISBN:
(纸本)9781450368186
Scaling a parallel program to modern supercomputers is challenging due to inter-process communication, code serialization, and resource contention. Performance analysis tools for finding such scaling bottlenecks either base on profiling or tracing. Profiling incurs lower overheads but does not capture detailed dependencies needed for root-cause analyses. Tracing collects all information at prohibitive overheads. In this work, we develop ScalAna that uses static analysis techniques to achieve the best of both worlds-it enables the analyzability of traces at a cost similar to profiling. We leverage compiler and runtime lightweight techniques to generate performance graph and perform graph analysis algorithm to detect the root cause of scaling issues. We evaluate ScalAna with real applications on the Tianhe-2 supercomputer. Results show that our approach can effectively locate the root cause of scalability bottlenecks for real applications and incur less than 6.38% overhead (1.89% on average) for up to 2,048 processes.
Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, parallel Haskell, parallel ML, and X10. Although these systems raise the leve...
详细信息
ISBN:
(纸本)9781450362252
Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, parallel Haskell, parallel ML, and X10. Although these systems raise the level of abstraction for writing parallel codes, performance continues to require labor-intensive optimizations for coarsening the granularity of parallel executions. In this paper, we present provably and practically efficient techniques for controlling granularity within the run-time system of the language. Our starting point is "oracle-guided scheduling", a result from the functional-programming community that shows that granularity can be controlled by an "oracle" that can predict the execution time of parallel codes. We give an algorithm for implementing such an oracle and prove that it has the desired theoretical properties under the nested-parallelprogramming model. We implement the oracle in C++ by extending Cilk and evaluate its practical performance. the results show that our techniques can essentially eliminate hand tuning while closely matching the performance of hand tuned codes.
Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a docu...
详细信息
ISBN:
(纸本)9781450368186
Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck. Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time. We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.
Compilation techniques for nested-parallel applications that can adapt to hardware and dataset characteristics are vital for unlocking the power of modern hardware. this paper proposes such a technique, which builds o...
详细信息
ISBN:
(纸本)9781450362252
Compilation techniques for nested-parallel applications that can adapt to hardware and dataset characteristics are vital for unlocking the power of modern hardware. this paper proposes such a technique, which builds on flattening and is applied in the context of a functional data-parallel language. Our solution uses the degree of utilized parallelism as the driver for generating a multitude of code versions, which together cover all possible mappings of the application's regular nested parallelism to the levels of parallelism supported by the hardware. these code versions are then combined into one program by guarding them with predicates, whose threshold values are automatically tuned to hardware and dataset characteristics. Our unsupervised method-of statically clustering datasets to code versions-is different from autotuning work that typically searches for the combination of code transformations producing a single version, best suited for a specific dataset or on average for all datasets. We demonstrate-by fully integrating our technique in the repertoire of a compiler for the Futhark programming language-significant performance gains on two GPUs for three real-world applications, from the financial domain, and for six Rodinia benchmarks.
暂无评论