Database systems are increasing in importance as stage repositories for applications like CAD and AI. These applications demand efficient support for queries that are more complex than the common business queries that...
详细信息
Database systems are increasing in importance as stage repositories for applications like CAD and AI. These applications demand efficient support for queries that are more complex than the common business queries that are supported well by most systems. More queries are also of an ad-hoc nature, limiting the utility of precompilation approaches. The need for fast response to make AI and CAD applications feasible also has further motivated research into techniques for exploiting resources in a distributed system for improved performance. This paper explores query decomposition techniques for complex query processing in distributed database systems and exposes the deficiencies associated with current algorithms.< >
The authors introduce three page replacement, and page out policies, in distributed virtual memory systems. Two of the replacement policies, the least recently brought and the global recently used or brought, are adap...
详细信息
The authors introduce three page replacement, and page out policies, in distributed virtual memory systems. Two of the replacement policies, the least recently brought and the global recently used or brought, are adapted versions of the least recently used policy, which is well known in conventional virtual memory systems. Trace driven simulation was used to evaluate the performance of the replacement policies and the RR (round robin), LAN (least active neighbor), and LLN (least loaded neighbor) page out policies. The results suggest that when the cost of internode faults is considerably higher than local memory access, global and remote policies are superior to the local one. When the cost of bringing a page from the immediate neighbor is considerably low compared to the cost of accessing the local memory, the local policy performs as well as the global and the remote. Among the page out policies, round robin is the least efficient. LLN generates lower cost than LAN when the size of the local memory is relatively large. Under high memory contention, LAN shows better performance.< >
Public key cryptography and parallel algorithms are considered. Special attention is paid to algorithms using long integer modulo arithmetic. A modification of the commonly known RSA algorithm is taken as a candidate....
详细信息
Public key cryptography and parallel algorithms are considered. Special attention is paid to algorithms using long integer modulo arithmetic. A modification of the commonly known RSA algorithm is taken as a candidate. So far all implementations have been more or less sequential in the sense that no partitions of a long integer among various processing elements have been performed. The proposed approach allows the use of a dedicated processor for each group of about 30 to 50 bits of a long integer. Efficiency is primarily gained when special-purpose processors are used. In this regard this work is the basis of a VLSI approach to a multiprocessor-based cryptographic design with 15 to 100 processors involved.< >
The author introduces and explains two algorithms, OFCup and OFCdown, allowing one to calculate the global state of a decentralized distributed system by interpreting measurements that are easy to obtain to facilitate...
详细信息
The author introduces and explains two algorithms, OFCup and OFCdown, allowing one to calculate the global state of a decentralized distributed system by interpreting measurements that are easy to obtain to facilitate cooperative optimal load balancing without a central job dispatcher. The information required is exchanged using the communication protocol of a receiver-initiated load balancing policy and does not induce any additional message transmission overhead. The author presents and interprets measurements from simulation. These studies show that the performance of systems applying any of the OFCx algorithms is significantly better than a no-information policy called 'random routing' and induces only little additional waiting time compared to the M/D/n model. This is true even for high transmission times relative to the mean time between system state changes. Both algorithms are shown to perform equally well under normal conditions with better variance values of OFC-down, but the degradation of OFCdown is significantly worse than that of OFCup, if the not-accept-counter is not incremented at the time expected.< >
The authors present seven algorithms for multiprocessor scheduling of task trees. The objective function of the algorithms is to minimize parallel time (the time between the start of the first processor and the comple...
详细信息
The authors present seven algorithms for multiprocessor scheduling of task trees. The objective function of the algorithms is to minimize parallel time (the time between the start of the first processor and the completion of the last processor) in an environment where interprocessor communication costs are significant. Test results are given for implementations of (1) an optimal algorithm that produces a schedule that cannot be improved upon, (2) a greedy algorithm that has minimal overhead, and (3) a 'light load' algorithm which combines the best features of the optimal algorithm and the greedy one. The authors illustrate the trade-off between generating optimal schedules and creating scheduling programs that perform their allocation in a reasonable amount of time. They also give a new NP-complete result.< >
Describes the design of an interconnection network switch VLSI for parallel digital signal processing (DSP) systems. This VLSI features 6*400 Mb/s bidirectional communication ports and 1.44-Gb/s internal switching cap...
详细信息
Describes the design of an interconnection network switch VLSI for parallel digital signal processing (DSP) systems. This VLSI features 6*400 Mb/s bidirectional communication ports and 1.44-Gb/s internal switching capability for realizing efficient packet-switched communication among many processing elements. To achieve this performance with existing CMOS technology, the transport line transfers 8 b in parallel at 50 MHz; input data are first multiplexed up to 64 b and switched at 25 MHz. An automatic routing function is implemented on each switch using a distributed routing table scheme. A high-level behavioral description based CAD (computer-aided design) system called PARTHENON is used to design the functions and logic circuits of this VLSI. The suitability and effectiveness of PARTHENON for switch design are also shown in terms of parallel operation and finite state machine design.< >
A new task migration strategy is presented. The input program is supplied as a task graph and an initial static assignment of tasks to processors is given. The dynamic behaviour of the static allocation is then profil...
详细信息
A new task migration strategy is presented. The input program is supplied as a task graph and an initial static assignment of tasks to processors is given. The dynamic behaviour of the static allocation is then profiled, and a set of migration destinations and migration thresholds are identified. During any subsequent execution, the profiling based algorithm makes migration decisions solely on the basis of the profiled data and the local processor load. No neighboring processor load information is required. Results are presented for the algorithm, and compared against a dynamic mapping scheme with global knowledge of the multiprocessor state.< >
The distributed consensus problem assumes that all processors in the system have some initial values; the goal is to make all non-faulty processors agree on one of these values. This paper investigates the time needed...
详细信息
The distributed consensus problem assumes that all processors in the system have some initial values; the goal is to make all non-faulty processors agree on one of these values. This paper investigates the time needed to reach consensus in a partially synchronous model with omission failures. In this model, the processors have no direct knowledge about time, but the time between consecutive steps of each processor is always between two known constants c/sub 1/ and c/sub 2/; the ratio C=/sup c2///sub c1/ measures the timing uncertainty in the system. Moreover, messages are delivered within time d. This paper provides an improved protocol for the above problem. When the majority of the processors are fault-free, the protocol achieves consensus in time 3( phi +1)d+Cd, where phi is the actual number of faults in a specific execution of the protocol. This allows an increase in efficiency up to 25% over the existing protocol which requires time 4( phi +1)d+Cd.< >
With recent improvements in single CPU performance, several issues become more important in multiprocessor design. Two of these are interprocessor communication and locality. In parallelsystems with fast CPUs, locali...
详细信息
With recent improvements in single CPU performance, several issues become more important in multiprocessor design. Two of these are interprocessor communication and locality. In parallelsystems with fast CPUs, locality is vital to performance. However, traditional parallel programming models such as shared memory or message passing do not naturally lead to programs that exhibit locality. In the paper, the Seamless model for interprocessor communication is presented which is based on locality and that allows the programmer to explicitly manipulate a program's locality to optimize performance. Additionally, this model can support latency tolerance with proper hardware support. Extensions to the C programming language that support this model are also presented. Finally, a parallel program utilizing this model is provided to illustrate the paradigm.< >
WRV256, an experimental waveform-relaxation-based parallel circuit simulator for the Victor V256 distributed-memory parallel machine, was used to study performance trade-offs between Gauss-Seidel and bounded-chaotic r...
详细信息
WRV256, an experimental waveform-relaxation-based parallel circuit simulator for the Victor V256 distributed-memory parallel machine, was used to study performance trade-offs between Gauss-Seidel and bounded-chaotic relaxation algorithms. Several subcircuit scheduling alternatives within the bounded-chaotic framework were also investigated. The simulator has been exercised on a suite of circuits ranging from 16000 to over 93000 FETs. Several of the circuits were extracted directly from a 16-Mb DRAM (dynamic random-access memory) memory design. It is concluded that the bounded-chaotic algorithm offers a reasonable compromise for a relaxation algorithm used for parallel machines. It offers one way to attain both the numerical efficiency of Gauss-Seidel for larger jobs and the Gauss-Jacobi parallelism for smaller ones.< >
暂无评论