Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-prod...
详细信息
ISBN:
(纸本)9781450392044
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be "masked out". In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures.
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to mak...
详细信息
ISBN:
(纸本)9798400704352
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.
Multi-core and many-core were already major trends for the past six years, and are expected to continue for the next decades. Withthese trends of parallel computing, it becomes increasingly difficult to decide on whi...
详细信息
ISBN:
(纸本)9781450311601
Multi-core and many-core were already major trends for the past six years, and are expected to continue for the next decades. Withthese trends of parallel computing, it becomes increasingly difficult to decide on which architecture to run a given application. In this work, we use an algorithm classification to predict performance prior to algorithm implementation. For this purpose, we modify the roofline model to include class information. In this way, we enable architectural choice through performance prediction prior to the development of architecture specific code. the new model, the boat hull model, is demonstrated using a GPU as a target architecture. We show for 6 example algorithms that performance is predicted accurately without requiring code to be available.
this paper proposes an efficient parallel algorithm for an important class of dynamic programming problems that includes Viterbi, Needleman-Wunsch, Smith-Waterman, and Longest Common Subsequence. In dynamic programmin...
详细信息
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists o...
详细信息
ISBN:
(纸本)9781450319225
Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. this approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. the main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. the new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important optimizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.
Efficient parallel execution of a high-level data-parallel language based on nested sequences, higher order functions and generalized iterators can be realized in the vector model using a suitable representation of ne...
详细信息
the proceedings contain 16 papers. the topics discussed include: description and evaluation of a generic design to integrate CLP and tabled execution;higher-order logic programming: an expressive language for represen...
ISBN:
(纸本)9781450341486
the proceedings contain 16 papers. the topics discussed include: description and evaluation of a generic design to integrate CLP and tabled execution;higher-order logic programming: an expressive language for representing qualitative preferences;a framework for easing the development of applications embedding answer set programming;proving inductive validity of constrained inequalities;analysis of access control policy updates through narrowing;strand spaces with choice via a process algebra semantics;reducing the overhead of assertion run-time checks via static analysis;and iterated process analysis over lattice-valued regular expressions.
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation unde...
详细信息
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. the irregular parallelism is modeled by task dependence graphs with mixed granularities. the trade-off in achieving both time and space efficiency is investigated. the main difficulty of designing efficient run-time system support is caused by the use of fast communication primitives available on modern parallel architectures. A run-time active memory management scheme and new scheduling techniques are proposed to improve memory utilization while retaining good time efficiency, and a theoretical analysis on correctness and performance is provided. this work is implemented in the context of RAPID system [5] which provides run-time support for parallelizing irregular code on distributed memory machines and the effectiveness of the proposed techniques is verified on sparse Cholesky and LU factorization with partial pivoting. the experimental results on Cray-T3D show that solvable problem sizes can be increased substantially under limited memory capacities and the loss of execution efficiency caused by the extra memory managing overhead is reasonable.
We report on two tools that extend Java with support for static type-checking of communication protocols. Our Mungo tool extends Java with typestate definitions, which allow classes to be associated with state machine...
详细信息
ISBN:
(纸本)9781450341486
We report on two tools that extend Java with support for static type-checking of communication protocols. Our Mungo tool extends Java with typestate definitions, which allow classes to be associated with state machines defining permitted sequences of method calls. A complementary tool, StMungo, takes a communication protocol specified in the Scribble protocol description language, and generates a typestate specification for each endpoint, capturing the permitted sequences of messages along that channel. Endpoint implementations can be validated by Mungo against their typestate definitions and then compiled as usual with j avac. We formalise Mungo's typestate inference system and demonstrate the Scribble, Mungo and StMungo toolchain via a typechecked SMTP client that can communicate with a real-world SMTP server.
the emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high perf...
详细信息
ISBN:
(纸本)9781450392044
the emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.
暂无评论