We present a parallel implementation of a constraint-based local search algorithm and investigate its performance results for hard combinatorial optimization problems on two different platforms up to several hundreds ...
详细信息
ISBN:
(纸本)9781450311601
We present a parallel implementation of a constraint-based local search algorithm and investigate its performance results for hard combinatorial optimization problems on two different platforms up to several hundreds of cores. On a variety of classical CSPs benchmarks, speedups are very good for a few tens of cores, and good up to a hundred cores. More challenging problems derived from real-life applications (Costas array) shows even better speedups, nearly optimal up to 256 cores.
the proceedings contains 14 papers from the conference on the Proceedings of the acmsigplansymposium on principles and practice of parallelprogramming, PPOPP. Topics discussed include: reference idempotency analysi...
详细信息
the proceedings contains 14 papers from the conference on the Proceedings of the acmsigplansymposium on principles and practice of parallelprogramming, PPOPP. Topics discussed include: reference idempotency analysis: a framework for optimizing speculative execution;pointer and escape analysis for multithread programs;language support for motion-order matrices;efficient load balancing for wide-area divide-and-conquer applications;scalable queue-based spin locks with timeout;contention ellimination by replication of sequential sections in distributed shared memory programs;and accurate data redistribution cost estimation in software distributes shared memory systems.
this paper studies the essence of heterogeneity from the perspective of language mechanism design. the proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer...
详细信息
ISBN:
(纸本)9781450332057
this paper studies the essence of heterogeneity from the perspective of language mechanism design. the proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer level of source data in larger, slower or more distributed memory and an inner level of data blocks in smaller, faster or more localized memory.
Miniapps serve as test beds for prototyping and evaluating new algorithms, data structures, and programming models before incorporating such changes into larger applications. For the miniapp to accurately predict how ...
详细信息
ISBN:
(纸本)9781450311601
Miniapps serve as test beds for prototyping and evaluating new algorithms, data structures, and programming models before incorporating such changes into larger applications. For the miniapp to accurately predict how a prototyped change would affect a larger application it is necessary that the miniapp be shown to serve as a proxy for that larger application. Although many benchmarks claim to proxy the performance for a set of large applications, little work has explored what criteria must be met for a benchmark to serve as a proxy for examining programmability. In this poster we describe criteria that can be used to establish that a miniapp serves as a performance and programmability proxy.
We present a new parallel computational model, named Log-GPS, which captures synchronization. the LogGPS model is an extension of the LogGP model, which abstracts communication on parallel platforms. Although the LogG...
详细信息
ISBN:
(纸本)9781581133462
We present a new parallel computational model, named Log-GPS, which captures synchronization. the LogGPS model is an extension of the LogGP model, which abstracts communication on parallel platforms. Although the LogGP model captures long messages with one bandwidth parameter (G), it does not capture synchronization that is needed before sending a long message by high-level communication libraries. Our model has one additional parameter, S, defined as the threshold for message length, above which synchronous messages are sent. We also present some experimental results using both models. the results include (1) a verification of the LogGPS model, (2) an example of synchronization analysis using an MPI program and (3) a comparison of the models. the results indicate that the LogGPS model is more accurate than the LogGP model, and analyzing synchronization costs is important when improving parallel program performance.
作者:
PHILIPPSEN, MICSI
International Computer Science Institute Berkeley CA and Dept. of Informatics University of Karlsruhe
this paper investigates the problem of aligning array data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subse...
详细信息
ISBN:
(纸本)9780897917001
this paper investigates the problem of aligning array data and processes in a distributed-memory implementation. We present complete algorithms for compile-time analysis, the necessary program restructuring, and subsequent code-generation, and discuss their complexity. We finally evaluate the practical usefulness by quantitative experiments. the technique presented analyzes complete programs, including branches, loops, and nested parallelism. Alignment is determined with respect to offset, stride, and general ass's relations. Pplacement of both data and processes are computed in a unifying framework based on an extended preference graph and its analysis. Dynamic redistributions are derived. the experimental results are very encouraging. the optimization algorithms implemented in our Modula-2* compiler improved the execution times of the programs by an average over 40% on a MasPar MP-1 with 16384 processors.
the proceedings contain 26 papers. the topics discussed include: LogP: towards a realistic model of parallel computation;exploiting task and data parallelism on a multicomputer;ActorSpace: an open distributed programm...
ISBN:
(纸本)0897915895
the proceedings contain 26 papers. the topics discussed include: LogP: towards a realistic model of parallel computation;exploiting task and data parallelism on a multicomputer;ActorSpace: an open distributed programming paradigm;experiences using the ParaScope editor: an interactive parallelprogramming tool;perturbation analysis of high level instrumentation for SPMD programs;integrating message-passing and shared-memory: early experience;using scheduler information to achieve optimal barrier synchronization performance;and a concurrent copying garbage collector for languages that distinguish (im)mutable data.
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated thro...
详细信息
ISBN:
(纸本)9781605583976
A stream processor executes an application that has been decomposed into a sequence of kernels that operate on streams of data elements. During the execution of a kernel, all streams accessed must be communicated through the SRF (Stream Register File), a non-bypassing software-managed on-chip memory. therefore, optimizing utilization of the SRF is crucial for good performance. the key insight is that the interference graphs formed by the streams in stream applications tend to be comparability graphs or decomposable into a set of multiple comparability graphs. We present a compiler algorithm that can find optimal or near-optimal colorings in stream IGs, thereby improving SRF utilization than the First-Fit bin-packing algorithm, the best in the literature.
the accompanying poster to this short paper presents a combination of reverse mode AD and formal methods to enable efficient differentiation of (or backpropagation through) shared-memory parallel code. Compared to the...
详细信息
ISBN:
(纸本)9781450392044
the accompanying poster to this short paper presents a combination of reverse mode AD and formal methods to enable efficient differentiation of (or backpropagation through) shared-memory parallel code. Compared to the state of the art, our approach can more often avoid the need for atomic updates or private data copies during the parallel derivative computation, even in the presence of unstructured or data-dependent data access patterns. this is achieved by gathering information about the memory access patterns from the input program, which is assumed to be correctly parallelized. this information is then used to build a model of assertions in a theorem prover, which can be used to check the safety of shared memory accesses during the parallel derivative computation.
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-prod...
详细信息
ISBN:
(纸本)9781450392044
Computing the product of two sparse matrices (SpGEMM) is a fundamental operation in various combinatorial and graph algorithms as well as various bioinformatics and data analytics applications for computing inner-product similarities. For an important class of algorithms, only a subset of the output entries are needed, and the resulting operation is known as Masked SpGEMM since a subset of the output entries is considered to be "masked out". In this work, we investigate various novel algorithms and data structures for this rather challenging and important computation, and provide guidelines on how to design a fast Masked-SpGEMM for shared-memory architectures.
暂无评论