Two operations commute if they generate the same result regardless of the order in which they execute. Commutativity is an important property — commuting operations enable significant optimizations in the fields of p...
详细信息
Two operations commute if they generate the same result regardless of the order in which they execute. Commutativity is an important property — commuting operations enable significant optimizations in the fields of parallel computing, optimizing compilers, parallelizing compilers and database concurrency control. Algorithms that statically decide if operations commute can be an important component of systems in these fields because they enable the automatic application of these optimizations. In this paper we define the commutativity decision problem and establish its complexity for a variety of basic instructions and control constructs. Although deciding commutativity is, in general, undecidable or computationally intractable, we believe that efficient algorithms exist that can solve many of the cases that arise in practice.
Divide-and-conquer algorithms obtain the solution to a given problem by dividing it into subproblems, solving these recursively and combining their solutions. In this paper we present a system that automatically trans...
详细信息
Divide-and-conquer algorithms obtain the solution to a given problem by dividing it into subproblems, solving these recursively and combining their solutions. In this paper we present a system that automatically transforms sequential divide-and-conquer algorithms written in the C programming language into parallel code which is then executed on message-passing multicomputers. The user of the system is expected to add only a few annotations to an existing sequential program. The strategies required for transforming sequential source code to executable binaries are discussed. The performance speedups attainable will be illustrated by several examples.
In a parallelizing Compiler, code transformations help to reduce the data dependencies and identify parallelism in a code. In our earlier paper, we proposed a model Data Dependence Identifier (DDI), in which a program...
详细信息
In a parallelizing Compiler, code transformations help to reduce the data dependencies and identify parallelism in a code. In our earlier paper, we proposed a model Data Dependence Identifier (DDI), in which a program P is represented as graph G(P). Using G(P), we could identify data dependencies in a program and also perform transformations like dead code elimination and constant propagation. In this paper, we present algorithms for loop invariant code motion, live range analysis, node splitting and loop fusion code transformations using DDI in polynomial time.
Distributed-memory parallel processors (DMPPs) can deliver peak performance higher than vector supercomputers while promising a better cost-performance ratio. Programming, however, is harder than on traditional vector...
详细信息
Distributed-memory parallel processors (DMPPs) can deliver peak performance higher than vector supercomputers while promising a better cost-performance ratio. Programming, however, is harder than on traditional vector systems, especially when problems necessitating unstructured solution methods are considered. A class of such applications, with large resource requirements, is the numerical solution of partial differential equations (PDEs) on nonuniformly refined three-dimensional finite element discretizations. Porting an application of this class from vector and shared-memory parallel machines to DMPPs involves some fundamental algorithm changes, such as grid decomposition, mapping, and coloring strategies. In addition, no standardized language interface is available to ease the efficient parallelization and porting among DMPPs and between vector computers and DMPPs. This article describes how PILS-an existing package for the iterative solution of large unstructured sparse linear systems of equations on vector computers-was ported to DMPPs, using the parallelizing Fortran compiler Oxygen. Two DMPPs, namely an Intel Paragon and a Fujitsu AP1000, were used to evaluate the performance of the generated parallel program quantitatively. The results indicate how an application should be designed to be portable among supercomputers of different architecture. Several language and architecture features are essential for such a porting process and ease the parallelization of similar applications drastically.
Developing parallel software is far more complex than traditional sequential software. An effective approach to deal with the complexity of parallel software is domain-specific programming in an abstraction higher tha...
详细信息
Developing parallel software is far more complex than traditional sequential software. An effective approach to deal with the complexity of parallel software is domain-specific programming in an abstraction higher than general-purpose programming languages. In this paper, we focus on the domain of the applications based on partial differential equations (PDE) and provide a formal framework and methods for PDE compilers to generate parallel iterative codes for the domain. We also provide a PDE compiler optimization to minimize the number of messages between parallel processors. Our framework and methods can be used to build PDE compilers to generate efficient parallel software for PDE-based applications automatically.
This paper discusses the use of shared register channels as a data exchange mechanism among processors in a fine-grained MIMD system with a load/store architecture. A register channel is provided with a synchronizatio...
详细信息
ISBN:
(纸本)9780897914130
This paper discusses the use of shared register channels as a data exchange mechanism among processors in a fine-grained MIMD system with a load/store architecture. A register channel is provided with a synchronization bit that is used to ensure that a processor succeeds in reading a channel only after a value has been written to the channel. The instructions supported by this load/store architecture allow both registers and register channels to be used as operand sources and result destinations. Conditional load, store, and move instructions are provided to allow processors to exchange values through channels in presence of aliasing caused by array references. Compiler support required to take proper advantage of channels is briefly discussed. In contrast to a VLIW machine a system with channels does not require strict lockstep operation of its processors. This reduces the delays caused by unpredictable events such as memory bank conflicts.
parallelizing compilers have traditionally focussed mainly on parallelizing loops. This paper presents a new framework for automatically parallelizing recursive procedures that typically appear in divide-and-conquer a...
详细信息
ISBN:
(纸本)9780769504254
parallelizing compilers have traditionally focussed mainly on parallelizing loops. This paper presents a new framework for automatically parallelizing recursive procedures that typically appear in divide-and-conquer algorithms. We present compile-time analysis to detect the independence of multiple recursive calls in a procedure. This allows exploitation of a scalable form of nested parallelism, where each parallel task can further spawn off parallel work in subsequent recursive *** describe a run-time system which efficiently supports this kind of nested parallelism without unnecessarily blocking tasks. We have implemented this framework in a parallelizing compiler, which is able to automatically parallelize programs like quicksort and mergesort, written in *** cases where even the advanced symbolic analysis and array section analysis we describe are not able to prove the independence of procedure calls, we propose novel techniques for speculative run-time parallelization, which are more efficient and powerful in this context than analogous techniques proposed previously for speculatively parallelizing loops. Our experimental results on an IBM G30 SMP machine show good speedups obtained by following our approach.
The following topics are dealt with: Java virtual machine (JVM) optimization and prefetching; procedure optimization; low power aware architecture and optimization; and region based optimization.
The following topics are dealt with: Java virtual machine (JVM) optimization and prefetching; procedure optimization; low power aware architecture and optimization; and region based optimization.
The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a ...
详细信息
ISBN:
(纸本)9780897919845
The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera *** a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA *** then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of ***, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except f
暂无评论