作者:
PHILIPPSEN, MICSI
International Computer Science Institute Berkeley CA and Dept. of Informatics University of Karlsruhe
this paper investigates the problem of aligning array data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subse...
详细信息
ISBN:
(纸本)9780897917001
this paper investigates the problem of aligning array data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subsequent code-generation, and discuss their complexity. We finally evaluate the practical usefulness by quantitative experiments. the technique presented analyzes complete programs, including branches, loops, and nested parallelism. Alignment is determined with respect to offset, stride, and general ass's relations. Pplacement of both data and processes are computed in a unifying framework based on an extended preference graph and its analysis. Dynamic redistributions are derived. the experimental results are very encouraging. the optimization algorithms implemented in our Modula-2* compiler improved the execution times of the programs by an average over 40% on a MasPar MP-1 with 16384 processors.
the proceedings contain 26 papers. the topics discussed include: LogP: towards a realistic model of parallel computation;exploiting task and data parallelism on a multicomputer;ActorSpace: an open distributed programm...
ISBN:
(纸本)0897915895
the proceedings contain 26 papers. the topics discussed include: LogP: towards a realistic model of parallel computation;exploiting task and data parallelism on a multicomputer;ActorSpace: an open distributed programming paradigm;experiences using the ParaScope editor: an interactive parallelprogramming tool;perturbation analysis of high level instrumentation for SPMD programs;integrating message-passing and shared-memory: early experience;using scheduler information to achieve optimal barrier synchronization performance;and a concurrent copying garbage collector for languages that distinguish (im)mutable data.
Collecting a program's execution profile is important for many reasons: code optimization, memory layout, program debugging and program comprehension. Path based execution profiles are more detailed than count bas...
详细信息
ISBN:
(纸本)9781581135886
Collecting a program's execution profile is important for many reasons: code optimization, memory layout, program debugging and program comprehension. Path based execution profiles are more detailed than count based execution profiles, since they present the order of execution of the various blocks in a program: modules, procedures, basic blocks etc. Recently, online string compression techniques have been employed for collecting compact representations of sequential program executions. In this paper, we show how a similar approach can be taken for shared memory parallel programs. Our compaction scheme yields one to two orders of magnitude compression compared to the uncompressed parallel program trace on some of the SPLASH benchmarks. Our compressed execution traces contain detailed information about synchronization and control/data flow which can be exploited for post-mortem analysis. In particular, information in our compact execution traces are useful for accurate data race detection (detecting unsynchronized shared variable accesses that occurred in the execution).
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated thro...
详细信息
ISBN:
(纸本)9781605583976
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated through the SRF (Stream Register File), a non-bypassing software-managed on-chip memory. therefore, optimizing utilization of the SRF is crucial for good performance. the key insight is that the interference graphs formed by the streams in stream applications tend to be comparability graphs or decomposable into a set of multiple comparability graphs. We present a compiler algorithm that can find optimal or near-optimal colorings in stream IGs, thereby improving SRF utilization than the First-Fit bin-packing algorithm, the best in the literature.
the accompanying poster to this short paper presents a combination of reverse mode AD and formal methods to enable efficient differentiation of (or backpropagation through) shared-memory parallel code. Compared to the...
详细信息
ISBN:
(纸本)9781450392044
the accompanying poster to this short paper presents a combination of reverse mode AD and formal methods to enable efficient differentiation of (or backpropagation through) shared-memory parallel code. Compared to the state of the art, our approach can more often avoid the need for atomic updates or private data copies during the parallel derivative computation, even in the presence of unstructured or data-dependent data access patterns. this is achieved by gathering information about the memory access patterns from the input program, which is assumed to be correctly parallelized. this information is then used to build a model of assertions in a theorem prover, which can be used to check the safety of shared memory accesses during the parallel derivative computation.
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-prod...
详细信息
ISBN:
(纸本)9781450392044
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be "masked out". In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures.
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to mak...
详细信息
ISBN:
(纸本)9798400704352
Pure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced withthe ability to use tasks to make use of idle cores. Pure leverages shared memory in two ways: (a) by allowing cores to steal work from each other while waiting on messages to arrive, and, (b) by leveraging *** lock-free data structures in shared memory to achieve highperformance messaging and collective operations between the ranks within nodes. We use microbenchmarks to evaluate Pure's key messaging and collective features and also show application speedups up to 2.1 Chi on the CoMD molecular dynamics and the miniAMR adaptive mesh *** applications scaling up to 4,096 cores.
this paper proposes an efficient parallel algorithm for an important class of dynamic programming problems that includes Viterbi, Needleman-Wunsch, Smith-Waterman, and Longest Common Subsequence. In dynamic programmin...
详细信息
Efficient parallel execution of a high-level data-parallel language based on nested sequences, higher order functions and generalized iterators can be realized in the vector model using a suitable representation of ne...
详细信息
parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. this paper introduces the LoPC model, which is inspired by the Log...
详细信息
parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. this paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention for message processing resources in parallel algorithms on a multiprocessor or network of workstations. LoPC takes the L, o and P parameters directly from the LogP model and uses them to predict the cost of contention, C. this paper defines the LoPC model and derives the general form of the model for parallel applications that communicate via active messages. Model modifications for systems that implement coherent shared memory abstractions are also discussed. We carry out the analysis for two important classes of applications that have irregular communication. In the case of parallel applications with homogeneous all-to-any communication, such as sparse matrix computations, the analysis yields a simple rule of thumb and insight into contention costs. In the case of parallel client-server algorithms, the LoPC analysis provides a simple and accurate calculation of the optimal allocation of nodes between clients and servers. the LoPC estimates for these applications are shown to be accurate when compared against event driven simulation and against a sparse matrix computation on the MIT Alewife multiprocessor.
暂无评论