This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential progr...
详细信息
This paper presents semantically-based explicit parallel structuring for rule-based programming systems. Explicit parallel structuring appears to be necessary since compile-time dependency analysis of sequential programs has not yielded large scale parallelism and run-time analysis for parallelism is restricted by the execution cost of the analysis. Simple language extensions specifying semantics of rules are used to define parallel execution behavior at the rule level. Type definitions for working memory elements are extended to include relationships within and among objects which define the parallelism allowed on instances of object types. The first result presented is that the algorithms implemented by commonly used benchmark rule-based programs contain scalable parallelism. The second result is that much of that parallelism can be captured by simple and modest extensions of rule-based languages which are analogies of models and constructs used for specification of parallel structures in imperative programming languages. A sketch is given for a comprehensive language system which exploits specification of semantics defining parallel structures in both object-definition and executable segments of rule-based programs.< >
There has been concern in the architectural community regarding the scalability of shared memory parallelarchitectures owing to the potential for large latencies for remote memory accesses. KSR-1 is a re cently intro...
详细信息
There has been concern in the architectural community regarding the scalability of shared memory parallelarchitectures owing to the potential for large latencies for remote memory accesses. KSR-1 is a re cently introduced commercial shared memory parallel architecture, and the scalability of KSR-1 is the focus of this research. Our key conclusions are as follows: The communication network of KSR-1 is fairly re silient in supporting simultaneous remote memory ac cesses from several processors. The multiple communi cation paths realized through this pipelining help in the efficient implementation of tournament-style barrier synchronization algorithms. The architectural features of KSR-1 such as the poststore and prefetch are useful for boosting the performance of parallel applications. The network does saturate when there are simultane ous remote memory accesses from a fully populated (32 node) ring.
The author presents a methodology to design an efficient communication reconfigurable network of processor using a circuit switching environment. He assumes that the operation is synchronous and reconfigurations occur...
详细信息
The author presents a methodology to design an efficient communication reconfigurable network of processor using a circuit switching environment. He assumes that the operation is synchronous and reconfigurations occur at pre-specified times. This network is based on two architectural concepts, the generalized folding cube and the enhanced hypercube architectures. The author demonstrates the effectiveness, versatility, and flexibility of his approach.< >
The authors present a block data decomposition algorithm for two-dimensional grid problems. Their method includes local balancing to accommodate heterogeneous processors, and they characterize the conditions that must...
详细信息
The authors present a block data decomposition algorithm for two-dimensional grid problems. Their method includes local balancing to accommodate heterogeneous processors, and they characterize the conditions that must be met for their partitioning strategy to be of value. While they concentrate on the workstation network model of parallel processing because of its high communication costs and inherent heterogeneity, their method is applicable to other parallelarchitectures.< >
We investigate synchronization activities in application executing on distributed-memory MIMD architectures. Three applications are used to quantify the performance impact of synchronization as the number of processor...
详细信息
We investigate synchronization activities in application executing on distributed-memory MIMD architectures. Three applications are used to quantify the performance impact of synchronization as the number of processors is increased. We also investigate the performance improvement possible when synchronization is supported in hardware. The results show that significant performance improvement can be achieved. The hardware support should include barrier synchronization, operate-and-broadcast, and operations over subsets of processors.< >
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different...
详细信息
Some parallel applications do not require a precise imitation of the behavior of the physically shared memory programming model. Consequently, certain parallel machine architectures have elected to emphasize different required coherency properties because of possible efficiency gains. This has led to various definitions of models of store coherency. These definitions have not been amenable to detailed analysis and, consequently, inconsistencies have resulted. In this paper a unified framework is proposed in which different models of store coherency are developed systematically by progressively relaxing the constraints that they have to satisfy. A demonstration is given of how formal reasoning can be carried out to compare different models. Some real-life systems are considered and a definition of a version of weak coherency is found to be incomplete.< >
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. This paper examines parallel dynamic storage allocation algorithms. Three new algorithm...
详细信息
A dynamic storage allocator manages space for objects whose lifetimes are not known by the system at the time for their creation. This paper examines parallel dynamic storage allocation algorithms. Three new algorithms, multiple free list fit I, modified quick fit, and multiple free list fit II, are presented which outperform all previous algorithms we tested. The performances of the three algorithms differ when the percentage of requests for large blocks is high. Multiple free list fit I is the fastest algorithm when large blocks of many different sizes are requested. Multiple free list fit II is the fastest algorithm when large blocks of only a few different sizes are requested.< >
The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented ...
详细信息
The authors are concerned with dynamic programming (DP) algorithms whose solution is given by a recurrence relation similar to that for the matrix parenthesization problem. Guibas, Kung and Thompson (1979), presented a systolic array algorithm for this problem that uses O(n/sup 2/) processing cells and solves the problem in O(n) time. The authors present three different mappings of this systolic algorithm on a mesh connected parallel computer. The first two mappings use commonly known techniques for mapping systolic arrays to mesh computers. Both of them are able to obtain only a fraction of maximum possible performance. The primary reason for the poor performance of these formulations is that different nodes at different levels in the multistage graph in the DP formulation require different amounts of computation. Any adaptation has to take this into consideration and evenly distribute the work among the processors. The third mapping balances the work load among processors and thus is capable of providing efficiency approximately equal to 1 (i.e., speedup approximately equal to the number of processors) for any number of processors and sufficiently large problem. They experimentally evaluate these mappings on a mesh embedded onto a 256 processor nCUBE/2.< >
A partitionable hypercube allows simultaneous execution of multiple tasks, where each task can be executed on a choice of subcubes. This paper considers the problem of static nonpreemptive scheduling of w independent ...
详细信息
A partitionable hypercube allows simultaneous execution of multiple tasks, where each task can be executed on a choice of subcubes. This paper considers the problem of static nonpreemptive scheduling of w independent tasks on a n processor partitionable hypercube system to minimize the overall finishing time of the w tasks. Each task can be executed on subcubes of different sizes, with smaller execution times on larger subcubes. A schedule determines the size of the subcube to be assigned to each task and schedules these tasks on the processors in the hypercube system. The problem of finding the optimal schedule, with minimum finishing time, is known to be NP-hard. This paper presents a fast polynomial time approximation algorithm for the problem, and derives a tight worst-case performance bound of 2 for the algorithm.< >
暂无评论