A new approach to parallel sorting called parallel Sorting by OverPartitioning (PSOP) is presented. The approach limits the communication cost by moving each element between processors at most once, and leads to good ...
详细信息
In this paper we present the Segment router, a novel router design for the interconnection networks of massively parallel computers. The design decisions of the Segment router are motivated by the need to improve the ...
详细信息
ISBN:
(纸本)9780897916714
In this paper we present the Segment router, a novel router design for the interconnection networks of massively parallel computers. The design decisions of the Segment router are motivated by the need to improve the network performance when the traffic consists of messages with widely different *** key novelty of the Segment router is that it provides different queueing policies for the two classes of messages it services. Short messages are stored in a centralized, dynamically allocated queue that guarantees storage availability for the whole packet of the short message. Long messages on the other hand are stored in small FIFO buffers associated with the input channels of the router. Thus when a long message becomes blocked, it is stored in segments in the FIFO buffers of a number of routing elements in the network. Furthermore, the physical channels of the Segment router are fairly multiplexed between the two supported classes of messages without the potential for *** simulations we compare the performance of our technique to the performance of multipacket messages and networks implementing lanes. The results clearly demonstrate the performance advantages of our technique.
This paper presents a comparison of the pragmatic aspects of some parallelalgorithms for finding connected components, together with optimizations on these algorithms. The algorithms being compared are two similar al...
详细信息
We present a high-level parallel calculus for nested sequences, NSC, offered as a possible theoretical “core” of an entire class of collection-oriented parallel languages. NSC is based on while-loops as opposed to g...
详细信息
ISBN:
(纸本)9780897916714
We present a high-level parallel calculus for nested sequences, NSC, offered as a possible theoretical “core” of an entire class of collection-oriented parallel languages. NSC is based on while-loops as opposed to general recursion. A formal, machine independent definition of the parallel time complexity and the work complexity of programs in NSC is given. Our main results are: (1) We give a translation method for a particular form of recursion, called map-recursion, into NSC, that preserves the time complexity and adds an arbitrarily small overhead to the work complexity, and (2) We give a compilation method for NSC into a very simple vector parallel machine, which preserves the time complexity and again adds an arbitrarily small overhead to the work complexity.
Recently, several theoretical models of parallelarchitectures have been proposed to replace the PRAM as the model that is presented to an algorithm designer. A primary focus of the new models is to include the cost o...
详细信息
ISBN:
(纸本)9780897916714
Recently, several theoretical models of parallelarchitectures have been proposed to replace the PRAM as the model that is presented to an algorithm designer. A primary focus of the new models is to include the cost of interprocessor communication, which is increasingly important in modern parallelarchitectures. We argue that modeling the communication costs in the architecture or system is only one part of the problem. The other, and usually much more difficult, part is modeling the communication properties of the algorithm itself, which provides necessary inputs into the architectural model to determine overall complexity. In this context, we make three main points in this paper: (i) It is incomplete to describe communication without regard to its relationship with replication. We propose a description of the communication-replication relationship in terms of the working set hierarchy of an algorithm. (ii) Both inherent communication and the communication-replication relationship can be very difficult to model in irregular, dynamic computations that are crucial in many real-world applications. We present some examples that demonstrate this difficulty. (iii) We believe that substantial leverage can be obtained in this effort from the computer systems community, which can provide a hierarchy of simulation and profiling tools—from abstract to detailed—tailored to the needs of the algorithm designers. We propose an initial set of simulation tools, and we discuss possible future refinements to this set.
A malleable parallel task is one that can be executed on any number of processors, with its execution time being a function of the number of processors allotted to it. A nonmalleable parallel task is one that requires...
详细信息
ISBN:
(纸本)0898713293
A malleable parallel task is one that can be executed on any number of processors, with its execution time being a function of the number of processors allotted to it. A nonmalleable parallel task is one that requires a specific number of processors. Given n independent parallel tasks and m identical processors, we consider the problem of scheduling these tasks to minimize makespan. We show that any algorithm for scheduling nonmalleable tasks can be extended to an algorithm for scheduling malleable tasks. The approximation factor of the new algorithm is identical to that of the original algorithm. Meanwhile, only an O(mn) term is added to the running time. Thus we get the best known approximation factor of a polynomial-time algorithm for scheduling malleable parallel tasks, and the fastest running time in which that factor can be achieved. These results can be extended to specific parallelarchitectures, where not only the number of processors allotted to a task, but also their configuration is a factor. Furthermore, we provide an algorithm for scheduling malleable tasks on a linear array of processors. In this case the addresses of the processors assigned to each task must be contiguous. Our algorithm has an approximation factor of 2, under a natural assumption on the task execution times. In contrast, the best known approximation factor for nonmalleable tasks in this case is 2.5. To the best of our knowledge, this is the only case where the best known approximation factor for malleable tasks is strictly less than that for nonmalleable tasks.
List ranking and list scan are two primitive operations used in many parallelalgorithms that use list, trees, and graph data structures. But vectorizing and parallelizing list ranking is a challenge because it is hig...
详细信息
In this paper we study the overheads arising in our algorithm for distributed evaluation of Min/Max trees. The overheads are classified into search overhead, performance loss, and decrease of work load. Several mechan...
详细信息
暂无评论