Physically-distributed memory multiprocessors are becoming popular and data distribution and loop parallelization are aspects that a parallelizing compiler has to consider in order to get efficiency from the system. T...
详细信息
Physically-distributed memory multiprocessors are becoming popular and data distribution and loop parallelization are aspects that a parallelizing compiler has to consider in order to get efficiency from the system. the cost of accessing local and remote data can be one or several orders of magnitude different, and this can dramatically affect the performance of the system. It would be desirable to free the programmer from considerations of the low-level details of the target architecture, to program explicit processes or specify interprocess communication. In this paper, we present an approach to automatically derive static or dynamic data distribution strategies for the arrays used in a program. All the information required about data movement and parallelism is contained in a single data structure, called the Communication-parallelism Graph (CPG). the problem is modeled and solved using a general purpose linear 0-1 integer programming solver. this allows us to find the optimal solution for the problem for one-dimensional array distributions. We also show the feasibility of using this approach in terms of compilation time and quality of the solutions generated.
A self-stabilizing distributed system is a network of processors, which, regardless of its initial global state, will achieve the desired state in a finite number of steps. there are two main performance issues in the...
详细信息
A self-stabilizing distributed system is a network of processors, which, regardless of its initial global state, will achieve the desired state in a finite number of steps. there are two main performance issues in the design of a self-stabilizing system: the stabilization time and memory requirements (the number of states required by each process). In this paper, we first show that the probabilistic two-state algorithm for asynchronous, unidirectional token rings stabilizes only in systems where k, the upper bound for the ratio of the speeds of any two processes, exists, but is unknown, and neither the convergence time nor token circulation delay of this algorithm can be bounded. then we present an almost two-state self-stabilizing algorithm for unidirectional token rings. the processes move synchronously and k is known. the algorithm requires each process in the ring to have two states;one process, called the exceptional process, needs an additional integer variable of size O(n), where n is the number of nodes in the ring;the algorithm stabilizes in O(n) time and achieves an O(kn) token circulation delay.
Object dataflow is a popular approach used in parallel rendering. the data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object d...
详细信息
Object dataflow is a popular approach used in parallel rendering. the data representing the 3D scene is statically distributed among processors and objects are fetched and cached only on demand. Most previous object dataflow methods were implemented on shared memory architectures and exploited spatial coherency to reduce hardware cache misses. In this paper, we propose an efficient model for object dataflow parallel volume rendering on message passing machines. the `Active Ray Tracing' algorithm is introduced and its ray storage mechanism is used to support latency hiding by postponing computation on inactive rays. Memory usage is optimized by letting objects migrate and replicate at different processors rather than the common static assignments. Our cache-only-memory approach uses a distributed-directory scheme to trace the location of objects at other nodes. A mechanism to minimize network congestion was implemented which optimizes channel utilization. Unlike previous methods, our approach can benefit from temporal coherence and effectively minimizes communication costs in successive frames. We implemented a volume ray casting instance of the algorithm on the Cray T3D and achieved higher efficiency and scalability than existing algorithms. We achieve interactive frame rates of approximately 20 Hz for 1283 volume, and 4 Hz for 2563 volume on 128 processors.
In this paper, a network-partitioning scheme for single-node broadcasting on wormhole-routed networks is proposed. To broadcast a message, the scheme works in three phases. First, a number of data-distributing network...
详细信息
In this paper, a network-partitioning scheme for single-node broadcasting on wormhole-routed networks is proposed. To broadcast a message, the scheme works in three phases. First, a number of data-distributing networks (DDNs), which can work independently, are constructed. then the message is evenly divided into sub-messages each being sent to a representative node in one DDN. Second, the sub-messages are broadcast on the DDNs concurrently. Finally, a number of data-collecting networks (DCNs), which can work independently too, are constructed. then concurrently on each DCN the sub-messages are re-collected and combined into the original message. One interesting issue is on the definition of independent, in the sense of worm-hole routing. DDNs and DCNs. We show how to apply this scheme to tori, meshes, and hypercubes. thorough analyses and experiments based on different system parameters and configurations are conducted. the results do confirm the advantage of our scheme, under various system parameters and conditions, over other existing broadcasting algorithms.
We present new algorithmic techniques for a classical research problem, runtime redistribution of an array from one block-cyclic layout to another. Our methodology for reducing communication overheads is based on a ge...
详细信息
We present new algorithmic techniques for a classical research problem, runtime redistribution of an array from one block-cyclic layout to another. Our methodology for reducing communication overheads is based on a generalized circulant matrix formalism. Using this formalism, we derive direct, indirect, and hybrid communication schedules for the cyclic redistribution problem when the block size changes by an integer factor K. We have also developed formulae to estimate the timing performance of each of these schedules for a given parallel machine and redistribution problem. In our indirect communication schedule, blocks are moved from a source processor to a destination processor through intermediate `relay' processors. this reduces the number of communication steps by an order of magnitude, in comparison with previous approaches. this algorithm performs cyclic(x) to cyclic(Kx) redistribution on P processors in [log2K]+2 steps. Implementations of these algorithms on the Cray T3D and on the IBM SP-2 show superior performance over previous approaches. Since our algorithms are developed using MPI, they can be easily ported to different application environments. Our techniques can be used in the design of scalable redistribution libraries, in efficient implementations of the REDISTRIBUTE directive of HPF, and in developing parallel algorithms for various HPC applications.
the main contribution of this work is to show that a number of seemingly unrelated problems in database design, pattern recognition, robotics, and image processing can be solved simply and elegantly by formulating the...
详细信息
the main contribution of this work is to show that a number of seemingly unrelated problems in database design, pattern recognition, robotics, and image processing can be solved simply and elegantly by formulating them as instances of a general problem - the Multiple Query (MQ) problem. An arbitrary instance of the Multiple Query problem consists of a collection A = {a1, a2,..., an} of items, a collection Q = {q1, q2,..., qm} (1&lem&len) of queries, a decision problem φ:Q×A&rarr{`yes', `no'}, and an associative and commutative function f operating on subsets of A. For every query qi, let Si be the set of items aj in A for which φ(qi, aj) = `yes'. the solution of qi is defined to be f(Si). In this context, the Multiple Query problem involves solving all the queries in Q. We begin by showing that if the collections A and Q are stored one item and at most one query per processor on a mesh with multiple broadcasting of size √n×√n, then any algorithm that solves the MQ problem requires Ω(m1/3n1/6) time in the worst case. Second, we show that a number of fundamental problems can be solved simply and elegantly by formulating them as instances of the MQ problem.
Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughpu...
详细信息
Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughput. In this paper, we present and analyze a queueing theoretic model for a general gang scheduling scheme that forms the basis of a multiprogramming environment currently being developed for IBM's SP2 parallel system and for clusters of workstations. Our model and analysis can be used to tune our scheduler in order to maximize its performance on each hardware platform.
Diffracting trees are an effective and highly scalable distributed-parallel technique for shared counting and load balancing. this paper presents the first steady-state combinatorial model and analysis for diffracting...
详细信息
Diffracting trees are an effective and highly scalable distributed-parallel technique for shared counting and load balancing. this paper presents the first steady-state combinatorial model and analysis for diffracting trees, and uses it to answer several critical algorithmic design questions. Our model is simple and sufficiently high level to overcome many implementation specific details, and yet as we will show it is rich enough to accurately predict empirically observed behaviors. As a result of our analysis we were able to identify starvation problems in the original diffracting tree algorithm and modify it to a create a more stable version. We are also able to identify the range in which the diffracting tree performs most efficiently, and the ranges in which its performance degrades. We believe our model and modeling approach open the way to steady-state analysis of other distributed-parallel structures such am counting networks and elimination trees.
the methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on netwo...
详细信息
the methods for mitigating the degradation in performance caused by high latencies in parallel and distributed networks were described. Most of the analysis were centered on the simulation of unit-delay rings on networks of workstations (NOWs) with arbitrary delays on the links. Emulations were also derived for the wide variety of other unit-delay network architectures on a NOW with high-latency links. the lower bounds that established limits on the degree to which the high latency links were proven, can be mitigated. these bounds demonstrates that overcoming latencies in dataflow types of computations that require access to large local databases is easier.
the issue of effectiveness of private caches for processors were studied. Since time for all processors to access the shared memory simultaneously is usually much longer than the time for a processor to access its own...
详细信息
ISBN:
(纸本)9780897918091
the issue of effectiveness of private caches for processors were studied. Since time for all processors to access the shared memory simultaneously is usually much longer than the time for a processor to access its own private cache, scheduling with private caches falls into the distributed memory model where the lower bound applies. the effectiveness of private caches were shown by proving that a version of Dynamic Equi-partition Scheduling Policy (DEQ) achieves a mean response time with five times the optimal mean response time in the cache clock time for a large class of parallel jobs well accepted in the parallel scheduling community. this shows an improvement of system performance by using private caches over that of purely shared memory.
暂无评论