this paper presents a methodology to predict the execution time of massively parallel applications before any significant implementation actions are taken. this methodology captures the problem decomposition into task...
详细信息
this paper presents a methodology to predict the execution time of massively parallel applications before any significant implementation actions are taken. this methodology captures the problem decomposition into tasks and their precedence relationship, along withthe computational and communication demands placed by the application on the underlying architecture. An example shows how the methodology may be used to study the effects of various data placement strategies, problem size, and number of processors for an LU factorization algorithm. the model predictions were validated with published experimental results on a Touchstone Delta machine.< >
this paper describes a parallel implementation of a Hidden Markov Model (HMM) for spoken language recognition on the MasPar MP-1. By exploiting the massive parallelism of explicit duration HMMs, we can develop more co...
详细信息
this paper describes a parallel implementation of a Hidden Markov Model (HMM) for spoken language recognition on the MasPar MP-1. By exploiting the massive parallelism of explicit duration HMMs, we can develop more complex models for real-time speech recognition. Implementational issues such as choice of data structures, method of communication, and utilization of parallel functions are explored. the results of our experiments show that the parallelism in HMMs can be effectively exploited by the MP-1. Training that use to take nearly a week can now be completed in about an hour. the system can recognize the phones of a test utterance in a fraction of a second.< >
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. this paper examines parallel dynamic storage allocation algorithms. three new algorithm...
详细信息
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. this paper examines parallel dynamic storage allocation algorithms. three new algorithms, multiple free list fit I, modified quick fit, and multiple free list fit II, are presented which outperform all previous algorithms we tested. the performances of the three algorithms differ when the percentage of requests for large blocks is high. Multiple free list fit I is the fastest algorithm when large blocks of many different sizes are requested. Multiple free list fit II is the fastest algorithm when large blocks of only a few different sizes are requested.< >
In this paper, we present algorithms for solving several basic geometric problems of size n in a network of p processors each with O(n/p) local memory. Our algorithms achieve the best possible (up to a constant factor...
详细信息
In this paper, we present algorithms for solving several basic geometric problems of size n in a network of p processors each with O(n/p) local memory. Our algorithms achieve the best possible (up to a constant factor) time bound in hypercubic parallel architectures such as hypercube, shuffle exchange and cube connected cycles, provided that n /spl les/ p/sup 1/+/spl epsi/ for some positive constant /spl epsi/. Our algorithms use only sorting and parallel prefixes that involve interprocessor communications, and can be easily implemented in commercially available parallel computers. Experimentation on nCUBE parallel computers show that our algorithms run efficiently.< >
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. the paper presents task assignment h...
详细信息
Assignment of tasks of a parallel program onto processors of a distributed-memory system is critical to obtain minimal program completion time by minimizing communication overhead. the paper presents task assignment heuristics for wormhole-routed systems. A Temporal Communication Graph is used to model task graphs and to identify spatial and temporal link contention. the interplay between degree of routing adaptivity, topology, application characteristics, and task assignment is studied by evaluating random task graphs using an event-driven simulator. the study indicates that even for systems supporting fully-adaptive routing, efficient task assignment is necessary to reduce program completion time, especially for communication-bound applications.< >
this paper deals with barrier synchronization in wormhole routed distributed-memory multiprocessors. New rendezvous and multirendezvous synchronization primitives are proposed to implement a barrier between two and mu...
详细信息
Endowing a communication network withthe ability to realize arbitrary communication patterns is an expensive proposition, both in hardware and in system software. One might instead ask whether a system can be built t...
详细信息
Endowing a communication network withthe ability to realize arbitrary communication patterns is an expensive proposition, both in hardware and in system software. One might instead ask whether a system can be built that performs well for a given application program. In this paper we look at the question of when a set of communication patterns is suitable for fast realization on a given network. In particular we look at which patterns are realizable quickly on a mesh. Contrary to common wisdom, transpose is efficiently realizable on a mesh. However, some other important patterns, such as perfect shuffle, are not.< >
Lattice basis reduction has important applications in the areas of computer algebra, cryptography and combinatorial optimization. Several efficient sequential algorithms are known. Recently, parallel algorithms have b...
详细信息
Lattice basis reduction has important applications in the areas of computer algebra, cryptography and combinatorial optimization. Several efficient sequential algorithms are known. Recently, parallel algorithms have been developed but until now a formal proof for the efficiency of parallel algorithms with n/sup 2/ processors has been omitted, where n denotes the dimension of the lattice. In this paper, a variant of the well-known basis reduction algorithms is presented that is well suited for the computation with fast floating point arithmetic and for the implementation on a mesh-connected array of n/sup 2/ processors. In addition, an error analysis and a proof of the parallel efficiency is provided.< >
We present an asynchronous distributed algorithm to determine an ear decomposition of an arbitrary, connected, bi-directional network containing n-nodes and m-links using O(m) messages and O(n) time. Using the ear dec...
详细信息
We present an asynchronous distributed algorithm to determine an ear decomposition of an arbitrary, connected, bi-directional network containing n-nodes and m-links using O(m) messages and O(n) time. Using the ear decomposition we obtain the following results: 1. the distributed ear decomposition algorithm can be used to test biconnectivity, determine biconnected components, find outpoints and bridges using O(m) messages in O(n) time. 2. the distributed ear decomposition algorithm can be used to test if a network is outerplanar using O(n) messages in O(n) time and if the network is outerplanar the embedding is also given using the same message and time complexity.< >
We investigate the problem of parallel evaluation of functional programs. We have developed a novel approach to deal with sharing in graph reduction. Share nodes are introduced to explicitly handle sharing. By using s...
详细信息
We investigate the problem of parallel evaluation of functional programs. We have developed a novel approach to deal with sharing in graph reduction. Share nodes are introduced to explicitly handle sharing. By using share nodes, we have a garbage collection method that is on-the-fly (real time), parallel, distributed, and incremental. In our parallel graph reducer, copying can be done in parallel to make an exponential growth of program tree nodes. Since each node represents a simple operation, the growth of program trees is a natural distribution of computation work. Load balancing becomes very easy and can be done automatically. A simulator is implemented to simulate a tree structured parallel computer to run our graph reducer. Examples such as parallel matrix addition and multiplication are tested.< >
暂无评论