We present a fast parallel algorithm running in O(log(2) n) time on a CREW PRAM with O(n) processors for finding the kth longest path in a given tree of n vertices (with Theta(n(2)) intervertex distances). Our algorit...
详细信息
ISBN:
(纸本)0818678763
We present a fast parallel algorithm running in O(log(2) n) time on a CREW PRAM with O(n) processors for finding the kth longest path in a given tree of n vertices (with Theta(n(2)) intervertex distances). Our algorithm is obtained by efficient parallelization of a sequential algorithm which is a variant of both Megiddo et al's algorithm [12] and Fredrickson et al's algorithm [3] based on centroid decomposition of tree and succinct representation of the set of intervertex distances. With the same time and space bound as the best known result, our sequential algorithm maintains a shorter length of the decomposition tree.
Exact direction and distance vectors are essential for detecting hierarchical parallelism and examining legality of loop transformation for a multiple level loop nest. Much of this work has been concentrated on array ...
详细信息
ISBN:
(纸本)0818678763
Exact direction and distance vectors are essential for detecting hierarchical parallelism and examining legality of loop transformation for a multiple level loop nest. Much of this work has been concentrated on array references. Little has been done to address the problems of finding precise dependences between scalar references, except to use extended SSA form with factored use-def links. In this paper, ne present a technique for calculating precise direction and distance vectors for scalar references within nested loops without using any forms of SSA. To do this, we use conventional use-def links in combination with joint dominator and joint postdominator relationships, which are extended front dominator and postdominator respectively in standard data flow analysis. The precision of dependence information gathered hv our algorithm can not be achieved by traditional analysis of dominator or reaching definitions.
Delaunay triangulation has been much used in such applications as volume rendering, shape representation, terrain modeling and so on. The main disadvantage of Delaunay triangulation is large computation time required ...
详细信息
Delaunay triangulation has been much used in such applications as volume rendering, shape representation, terrain modeling and so on. The main disadvantage of Delaunay triangulation is large computation time required to obtain the triangulation on an input points set. This time can be reduced by using more than one processors, and several parallel algorithms for Delaunay triangulation have been proposed. In this paper, we propose an improved parallel algorithm for Delaunay triangulation, which partitions the bounding convex region of the input points set into a number of regions by using Delaunay edges and generates Delaunay triangles in each region by applying an incremental construction approach. Partitioning by Delaunay edges makes it possible to eliminate merging step required for integrating subresults. It is shown from the experiments that the proposed algorithm has good load balance and is more efficient than Cignoni et al.'s algorithm[8] and our previous algorithm[9].
Although highly paralleldistributed memory computers exist for several years, the operating systems used on them did nor fit the requirements very well. Most of them are designed for sequential, shared memory paralle...
详细信息
ISBN:
(纸本)0818678763
Although highly paralleldistributed memory computers exist for several years, the operating systems used on them did nor fit the requirements very well. Most of them are designed for sequential, shared memory parallel, or distributed computers. Examples are Unix on the IBM SP/2 [17] and Mach on the Intel Paragon. This results in poor scalability caused by inefficient communication primitives designed for wide area networks or by waste of resources due to huge kernels (e.g. 8 MB per node are reported for Mach on the Paragon, [16]), which is harmful especially in highly parallel systems with hundreds or thousands of nodes. With Cosy (Concurrent Operating System) we have shown that a well structured and carefully designed system can be small (70 Kb for the kernel, 372 total memory usage per node), efficient (33 mu s for communication), and scalable (applications run efficient on up to 1024 processors).
In the course of the development of reactive systems often real rime constraints have to be met. In such time critical applications heterogeneous multi-processor systems are used in order to fulfill these time constra...
详细信息
ISBN:
(纸本)0818678763
In the course of the development of reactive systems often real rime constraints have to be met. In such time critical applications heterogeneous multi-processor systems are used in order to fulfill these time constraints. This paper presents a hybrid partitioning method that uses a stochastic algorithm together with mixed integer linear programming. This method supports the development of time critical systems. We assume that the algorithm which has to be analyzed is given inform of a so-called task-graph. The goal of the overall method is to fix for every task the processor that will execute it and the starting time of this execution. The main issue is a high-level-synthesis-like method for constructing a problem-specific multi-processor board. The presented methods have been fully implemented and tested.
With the advances of wireless communication technology, using the wireless LAN as a platform to perform distributed network computing becomes feasible. In this paper, we studied the characteristics of the end-to-end c...
详细信息
With the advances of wireless communication technology, using the wireless LAN as a platform to perform distributed network computing becomes feasible. In this paper, we studied the characteristics of the end-to-end communication over wireless links. With the advantage of reduced bandwidth competition in each LAN segment separated by the wireless bridges, and with the overlap of wireless and wired communications, an analytical comparison showed that the group communications over wireless links can be more efficient than over a single segment wired LAN. We also conducted experiments of running distributed applications and the results showed that with the support of threads, wireless network computing can achieve the same performance as the wired networks. Furthermore, the statistical results from our survey showed that the users cannot tell the difference between wireless and wired settings in terms of the data accessing speed.
In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in...
详细信息
ISBN:
(纸本)0818678763
In this paper we present the results of a parallel implementation of a heart field simulation algorithm. The application of biomagnetic fields offers a wide range for using parallel algorithms. Pathological changes in the human body, especially in the heart muscle, can be diagnosed and localised by means of biomagnetic field parameters. The gain of this diagnose method is to fit an individual reference modell of the heart field of a patient. Based on differences between the reference modell and the real measured biomagnetic field parameters, the type and the position of defects in the heart can be located. The most time consuming components of the whole algorithm are the matrix computations, especially the matrix inversion. The matrix inversion can be implemented on a paralleldistributed memory system. In this paper we discuss the routing, the parallel matrix inversion, and the speed up for different network topologies that depends on the number of processors and different problem sizes.
On the point of that it is very difficult to keep load balancing among processors for the nonuniform loop in compile-time and it must be at the price of extra overhead to use dynamic methods, this paper has proposed a...
详细信息
ISBN:
(纸本)0818678763
On the point of that it is very difficult to keep load balancing among processors for the nonuniform loop in compile-time and it must be at the price of extra overhead to use dynamic methods, this paper has proposed an adaptive hybrid scheduling way, in which the processes of distribution of loop are divided into a few rounds and the block size in each round is determined adaptively according to the average overhead due to dynamic scheduling. Several experiment results have also exposed the effect of scheduling parameter, which could be selected by programmers according to the probability that a fetching processor may not perform an additional task fetching.
This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit r...
详细信息
ISBN:
(纸本)0818678763
This paper presents a new rapid thread replacement mechanism which is important in multithread technology. Analysis to the memory system indicates that the memory utilization decreases with the increase of cache hit ratio. The parallelism between thread computation and thread replacement is found by analyzing their working processes. Based on these, we advance a rapid multithread replacement mechanism which overlaps the thread replacement with thread computation. More especially, with finite hardware contexts, this mechanism can play the same role of infinite contexts by tolerating the replacement overhead. By modifing the general thread switching model, we bulid the thread replacement model and evaluate this mechanism in theory and experiment methods. At last, we discuss the hardware implementation and put forward the problems to be resolved in the future.
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL (1) features...
详细信息
ISBN:
(纸本)0818678763
Fast and efficient communication is one of the major design goals not only for parallel systems but also for clusters of workstations. The proposed model of the high performance communication device ATOLL (1) features very low latency for the start of communication operations and reduces the software overhead for communication specific functions. To close the gap between off-the-shelf microprocessors and the communication system a highly sophisticated processor interface implements atomic start of communication, MMU support, and a flexible event scheduling scheme. The interconnectivity of ATOLL provided by four independent network ports combined with cut-through routing allows the configuration of a large variety of network topologies. A software transparent error correction mechanism significantly reduces the required protocol overhead. The presented simulation results promise high performance and low-latency communication.
暂无评论