An external parallel merge-sort and sort-merge join on tightly coupled processors is considered. The issue of whether significant speedup can be achieved with good CPU efficiency is addressed . A pure sort query and a...
详细信息
ISBN:
(纸本)0818620250
An external parallel merge-sort and sort-merge join on tightly coupled processors is considered. The issue of whether significant speedup can be achieved with good CPU efficiency is addressed . A pure sort query and a five-relation join query using a sort-merge-join algorithm are examined. It is found that the external sort is readily parallelizable. In the absence of skew, a speedup linear in the number of tightly coupled processors can be obtained. However, it is shown that skew can reduce the speedup significantly. An examination is made of how important types of skew can be handled to yield close to linear speedup. The effect on the speedup and CPU efficiency of the database size, memory constraints, CPU MIPS, query selectivity, I/O striping, and skew is shown.
A method of distributed garbage collection using reference counting had been developed previously (see D. I. Bevan, Proc. of Parallel Architectures and Languages Europe, June 1987). This method is correct but has one ...
详细信息
A method of distributed garbage collection using reference counting had been developed previously (see D. I. Bevan, Proc. of Parallel Architectures and Languages Europe, June 1987). This method is correct but has one severe drawback: the time overhead caused by the use of indirection cells. In the present work, the authors describe an alternative: the reference weight table method. It is shown that this method does not suffer from as much time overhead and offers general improvements.
An efficient algorithm for merging two memory-resident sorted lists is described. The algorithm is based on a novel low-cost partitioning algorithm that is used to split the two lists among an arbitrary number of proc...
详细信息
An efficient algorithm for merging two memory-resident sorted lists is described. The algorithm is based on a novel low-cost partitioning algorithm that is used to split the two lists among an arbitrary number of processors in a way that ensures load balance during the merge. The algorithm has direct applications in memory-resident databases, as well as for handling record pointers in disk-resident databases. It may be used for parallel sorting, table access using multiple indexes, and parallel sort-merge joins. A feature of the partitioning algorithm is that it may itself be parallelized efficiently;the parallel implementation reduces partitioning time, which may become significant if the number of processors gets large. If p is the number of processors and N is the total number of elements in both runs combined, the serial and parallel versions of the partitioning algorithm require time O(p log(N/p)), and O(log p + log(N/p)), respectively. The algorithm can be implemented on a model in which concurrent memory accesses by the processors for both reading and writing are to distinct memory locations. The overall time complexity for the merging is O(N/p), 1 &le p &le N/log N. The parallel merging algorithm has been successfully implemented on a 20-node Sequent Symmetry multiprocessor.
The simulation of large communication networks provides an invaluable analysis tool for the designer of large multi-processor computersystems. As multi-processor computersystems increase in size and complexity the n...
详细信息
The simulation of large communication networks provides an invaluable analysis tool for the designer of large multi-processor computersystems. As multi-processor computersystems increase in size and complexity the need for discrete-event simulations of these systems become greater. Unfortunately, the execution of such simulations is very costly in computer resources. In this paper, we present a discrete-event simulation of a large multi-processor communications system (HYPERSIM). This simulation is designed to execute on a parallel processor running the Time Warp Operating System (TWOS). We demonstrate that complex simulations such as HYPERSIM can achieve significant speedup in execution time using parallel processors as compared to traditional sequential implementations. This paper presents performance data which compares the runtime of HYPERSIM on a parallel processor (Caltech/JPL Mark III hypercube) to the run time on a high performance single processor (CRAY/XMP). In addition we demonstrate that the object-oriented programming environment that TWOS supports allows for ease and flexibility in the implementation of such simulations.
Execution-driven simulation is a new technique for the performance evaluation of computersystems. It uses the actual execution of a real program to drive a simulation model of the system architecture. The major advan...
详细信息
ISBN:
(纸本)0897913590
Execution-driven simulation is a new technique for the performance evaluation of computersystems. It uses the actual execution of a real program to drive a simulation model of the system architecture. The major advantages of this approach are that it offers accuracy close to that of instruction-level simulation but with much less overhead. This paper describes extensions to the execution-driven approach for the purpose of modeling multiprogrammed processors and interrupts, with priorities. The primary goal was to develop a method which allowed these features to be simulated as accurately as possible with as little additional overhead as possible.
A vectorized algorithm for entering data into a hash table is presented. A program that enters multiple data could not be executed on vector processors by conventional vectorization techniques because of data dependen...
详细信息
A vectorized algorithm for entering data into a hash table is presented. A program that enters multiple data could not be executed on vector processors by conventional vectorization techniques because of data dependences. The proposed method enables execution of multiple data entry by conventional vector processors and improves the performance by a factor of 12.7, compared with the normal sequential method, when 4099 pieces of data are entered on the Hitachi S-810. This method is applied to address calculation sorting and the distribution counting sort, whose main part was unvectorizable by previous techniques. It improves performance by a factor of 12.8 when n = 214 on the S-810.
A description is given of Siftsort, a highly parallel algorithm, which takes O(log22 n) steps to sort n items of data on fewer than n log2 n processors. Although a comparison-exchange algorithm, it separates out the c...
详细信息
A description is given of Siftsort, a highly parallel algorithm, which takes O(log22 n) steps to sort n items of data on fewer than n log2 n processors. Although a comparison-exchange algorithm, it separates out the comparisons from the exchanges, with an intermediate logical step of selectively annihilating--that is, sifting--an array representing the results of comparisons. The algorithm proceeds by successive sweeps, with each sweep consisting of three stages: comparison, sifting, and exchange. Unlike Batcher's sort, the method is robust against random errors occurring in the sorting process.
Summary form only given. The design of a real-time executive that embodies some novel concepts and the rationale of choices in the design of that executive are the focus of this work. The application is a data communi...
详细信息
ISBN:
(纸本)0818620307
Summary form only given. The design of a real-time executive that embodies some novel concepts and the rationale of choices in the design of that executive are the focus of this work. The application is a data communication environment, in which the executive resides on an M68000-based PROgrammable COMmunications (PROCOM) board. One or more PROCOM boards will augment Concurrent 3200-series computers functioning as nodal elements in ARINC's packet switching network. The PROCOM board will support the link-level component of the network protocol, adding capacity to process and switch message traffic. A major design objective was to reduce the executive overhead below the level usually present in traditional multitasking executives. The kernel uses thread dispatching and no-waited I/O to maximize real-time performance. The design details of the following real-time operating system components are outlined: (a) the kernel and its main services, (b) the event subsystems servicing the host side and the network side, and (c) the device drivers servicing the DMA transfer controllers, the serial communication controllers, and the host-PROCOM interface.
The authors describe parallel algorithm performance evaluation in a programming and instrumentation environment (PIE), an environment geared toward efficient parallel programming and the prediction, implementation, me...
详细信息
ISBN:
(纸本)0818620080
The authors describe parallel algorithm performance evaluation in a programming and instrumentation environment (PIE), an environment geared toward efficient parallel programming and the prediction, implementation, measurement, and evaluation of parallel fast Fourier transform (FFT) algorithms. An example of a mature technology for evaluating parallel applications is provided, emphasizing the need for integration between modeling and measurements. Performance tradeoffs for a range of parallel fast Fourier transform algorithms are explored.
A computationally simple algorithm that measures one or two frequencies using lookup tables and simple adders is presented. The algorithm is based on a real-time processing of instantaneous frequency and envelope. The...
详细信息
A computationally simple algorithm that measures one or two frequencies using lookup tables and simple adders is presented. The algorithm is based on a real-time processing of instantaneous frequency and envelope. The algorithm provides a maximum-likelihood estimate in the single-frequency case. Oversampling is required and the algorithm cannot estimate three or more frequencies. A complete error analysis is presented along with simulation results.
暂无评论