There are two approaches to processing Datalog programs in parallel. One is to decompose the rules of a program into concurrent modules, and then assign them to processors. The other is to partition data between proce...
详细信息
There are two approaches to processing Datalog programs in parallel. One is to decompose the rules of a program into concurrent modules, and then assign them to processors. The other is to partition data between processors, so that each processor evaluates the same program, but with less data. The authors propose a third approach which combines the two methods in a single framework. In this approach, rules are decomposed into segments and data is partitioned among the segments. There are a number of advantages of this approach. Most importantly, it provides good focus on processing the tuples that are relevant to queries, and allows data to be partitioned and balanced dynamically at different levels. An analytic performance study is also presented to illustrate the usefulness of the proposed approach.< >
parallel execution of database queries in a shared nothing multiprocessor environment, needs control mechanisms which guarantee a correct global execution of each query. In the authors system, the control is integrate...
详细信息
parallel execution of database queries in a shared nothing multiprocessor environment, needs control mechanisms which guarantee a correct global execution of each query. In the authors system, the control is integrated into the query code at compile time. According to the required schedule for the query, the compilation process combines a set of basic control operations into a control graph mapped on the query graph. This control which allows pipelining between data operations, achieves the dataflow control and enforces the schedule of program. Using fine grain control operations combined into control schemas makes the control flexible, as new strategies could be easily supported. In their prototype the parallel LERA language is used.< >
The paper presents the results of a statistical analysis by which the blocking behavior is investigated of interconnection structures that are major candidates for large distributed memory systems. The analysis answer...
详细信息
The paper presents the results of a statistical analysis by which the blocking behavior is investigated of interconnection structures that are major candidates for large distributed memory systems. The analysis answers important questions such as: how many logical connections can exist simultaneously, when will the network saturate, how well are the physical links utilized, and what is the cost of realization of the network. The network topologies considered are the simple 2D-mesh, the hypercube, and an innovative interconnection structure called TICNET. It is shown that the TICNET, which can be realized as a hierarchy of crossbars, is similar in behavior to the hypercube but much more cost-effective. Compared to the 2D-mesh, the TICNET has a much better blocking behavior and is still more cost-effective. The simulation results guide the designer of distributed memory architectures in selecting the most suitable interconnection network.< >
The authors present a new parallel algorithm, based on the divide-and-conquer (DC) strategy, for solving tridiagonal systems. Through a comparative study between their DC method and other well known tridiagonal solver...
详细信息
The authors present a new parallel algorithm, based on the divide-and-conquer (DC) strategy, for solving tridiagonal systems. Through a comparative study between their DC method and other well known tridiagonal solvers: cyclic odd-even reduction (CR), recursive doubling (RD), and the partition method, they show that for the binary hypercube architecture, the communication complexity of their DC method is the lowest among all, and therefore the most efficient tridiagonal solver for communication-expensive massively parallel hypercube computers.< >
Presents an efficient protocol to totally order the delivery of broadcast messages in a distributed system with N sites. The protocol exploits a symmetric communication scheme based on finite projective planes and req...
详细信息
Presents an efficient protocol to totally order the delivery of broadcast messages in a distributed system with N sites. The protocol exploits a symmetric communication scheme based on finite projective planes and requires N+2 square root N messages to create a total order to broadcast messages, that is consistent across all sites.< >
The authors present efficient parallel algorithms for balancing load in circuit-switched hypercubes, which use the e-cube algorithm to route messages between the processing elements. Performance of the hypercube syste...
详细信息
The authors present efficient parallel algorithms for balancing load in circuit-switched hypercubes, which use the e-cube algorithm to route messages between the processing elements. Performance of the hypercube system can be significantly improved by transferring load, at regular intervals of time, from the overloaded processors to the underloaded processors. A centralized algorithm, based on a minimum-cost network flow model, for balancing load in the hypercube systems was developed by Bokhari. This algorithm requires O(N/sup 2/n/sup 2/) time to balance load in an n-cube with N=2/sup n/ processing elements. The authors develop efficient network models-network models in which every arc has unit capacity-to solve the load balancing problem. They also present distributed algorithms for balancing load in the hypercube systems. They show that the distributed algorithms perform better than the centralized algorithms in most practical situations.< >
Conventional approaches for implementing barrier synchronization in large scale systems result in long synchronization times due to memory and network contention. The authors present some applications of a very simple...
详细信息
Conventional approaches for implementing barrier synchronization in large scale systems result in long synchronization times due to memory and network contention. The authors present some applications of a very simple barrier hardware that allows barrier synchronizations to be performed within a few processor cycles. The low latency of this barrier synchronizer and its ability to allow dynamic participation permits several rather unconventional applications like low-latency software combining, scans and determination of extremum values in constant time. This paper presents these applications.< >
A Datalog program consists of function-free Horn clause rules. There are, however, some situations described more easily and more naturally by the use of general terms as arguments. A term is built up from variables, ...
详细信息
A Datalog program consists of function-free Horn clause rules. There are, however, some situations described more easily and more naturally by the use of general terms as arguments. A term is built up from variables, constants, and function symbols. A set of rules where the rules may contain function symbols is called a Datalog/sup Fun/ program. The paper discusses parallel processing of decomposable Datalog/sup Fun/ programs to overcome the performance problem. A decomposable program can be evaluated in parallel such that neither a communication nor a synchronization of the processors has to be established. The author and G. Lausen (1991) proposed the concept of generalized pivoting as a sufficient condition for decomposability of arbitrary but function-free Datalog programs. The current paper extends the concept of generalized pivoting to programs which may contain function symbols.< >
This paper discusses the effect of processor failures on computation performed on two-dimensional VLSI processor arrays. Previously established properties of catastrophic fault patterns are used to study inherent limi...
详细信息
This paper discusses the effect of processor failures on computation performed on two-dimensional VLSI processor arrays. Previously established properties of catastrophic fault patterns are used to study inherent limits to reconfigurability of these regular architectures. Bounds on number of faults the system can tolerate to provide guaranteed performance are derived. These results are the generalization of the results obtained in the case of one-dimensional processor arrays.< >
Each vertex of an undirected graph possesses a piece of information which must be sent to every other vertex. The method of communication is to send bounded size packets of messages from one vertex to another. We desc...
详细信息
Each vertex of an undirected graph possesses a piece of information which must be sent to every other vertex. The method of communication is to send bounded size packets of messages from one vertex to another. We describe parallel algorithms to accomplish the desired tasks for five prominent architectures. The algorithms are optimal, or nearly so, in every case.< >
暂无评论