Recently, the study of I/O-efficient algorithms has moved beyond fundamental problems of sorting and permuting and into wider areas such as computational geometry and graph algorithms. Withthis expansion has come a n...
详细信息
Recently, the study of I/O-efficient algorithms has moved beyond fundamental problems of sorting and permuting and into wider areas such as computational geometry and graph algorithms. Withthis expansion has come a need for new algorithmic techniques and data structures. In this paper, we present I/O-efficient analogues of well-known data structures that we show to be useful for obtaining simpler and improved algorithms for several graph problems. Our results include improved algorithms for minimum spanning trees, breadth-first search, and single-source shortest paths. the descriptions of these algorithms are greatly simplified by their use of well-defined I/O-efficient data structures with good amortized performance bounds. We expect that I/O-efficient data structures such as these will be a useful tool for the design of I/O-efficient algorithms.
In a distributed system, executing a program often requires the access of remote data files. An efficient data transmission strategy is thus important for real-time applications. Since data files may be replicated and...
详细信息
In a distributed system, executing a program often requires the access of remote data files. An efficient data transmission strategy is thus important for real-time applications. Since data files may be replicated and their locations are transparent to the executed program, it becomes system's responsibility to select a proper file server such that data can be transmitted in an effective way. In this paper, we consider the scenario that a particular program needs several data files simultaneously for its execution. Each data file may has replicas across the network. the problem is how to find a collection of file servers and the routing paths such that the data transmission time is minimum. An exhaustive approach could of course find the optimal solution. However, it pays for high computation price. We thus proposes a heuristic method. Results obtained from the proposed method are encouraging.
this paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like i...
详细信息
this paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like in invalidation and update messages in distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. the mechanism is a variation of tree-based multicast with pruning to avoid deadlocks. the new scheme does not require that the destination addresses in a given multicast message be ordered, thereby avoiding any ordering overhead. It allows messages to use any deadlock-free routing function and only requires one startup for each multicast message. the new scheme has been evaluated on several k-ary n-cube networks under synthetic loads. the results show that the proposed scheme is faster than other multicast mechanisms when the multicast traffic is composed of short messages.
this paper provides necessary and sufficient conditions for deadlock-free unicast and multicast routing withthe path-based routing model in interconnection networks which use the wormhole switching technique. the the...
详细信息
this paper provides necessary and sufficient conditions for deadlock-free unicast and multicast routing withthe path-based routing model in interconnection networks which use the wormhole switching technique. the theory is developed around three central concepts: channel waiting, False Resource Cycles, and valid destination sets. the first two concepts are suitable extensions to those developed for unicast routing by two authors of this paper;the third concept has been developed by Lin and Ni. the necessary and sufficient conditions can be applied in a straightforward manner to prove deadlock freedom and to find more adaptive routing algorithms for collective communication. the latter point is illustrated by developing two routing algorithms for multicast communication in 2-D mesh architectures. the first algorithm uses fewer resources (channels) than an algorithm proposed in the literature but achieves the same adaptivity. the second achieves full adaptivity for multicast routing.
parallelprocessing is recognized as a practical way to achieve high performance in logic simulation. Instead of using high cost parallel computers or special purpose hardware simulation engines, we explore the implem...
详细信息
parallelprocessing is recognized as a practical way to achieve high performance in logic simulation. Instead of using high cost parallel computers or special purpose hardware simulation engines, we explore the implementation of parallel logic simulation on an existing network of workstations using parallel Virtual Machine (PVM). We carry out a novel parallel implementation of an output event-driven logic simulation algorithm such that a global control processor or workstation is not needed to synchronize the advancement of simulation time to the next time step. Further advantages of our new approach include a random partitioning of the circuit on to available workstations and a pipelined execution of the different phases of the simulation algorithm. To achieve a better load balance we employ a semi-optimistic scheme for gate evaluations such that no rollback is required. the performance of our implementation has been evaluated in real time using the ISCAS combinational and sequential benchmark circuits. Speedups obtained improve withthe size of the circuit and the activity level in the circuit. Analyses of the communication overhead shows that the techniques developed here will yield even higher gains as newer networking technologies such as fast and switched Ethernet or ATM are employed to connect workstations.
We present a novel declustering scheme (ACATS) for reliability stripes in an orthogonal disk array. Our scheme is deterministic, run-time efficient, and provides frequently the best possible and always an almost best ...
详细信息
We present a novel declustering scheme (ACATS) for reliability stripes in an orthogonal disk array. Our scheme is deterministic, run-time efficient, and provides frequently the best possible and always an almost best possible distribution of failure-induced incremental rebuild workloads. Our scheme provides protection against single disk as well as single string failures within the disk array. Our approach presents a framework in which the Level 5 RAID organization logically appears as a Level 4 RAID;it facilitates the provision of distributed sparing in exactly the same manner. ACATS does not require the existence of a suitably configured block design or of a run-time efficient pseudo-random number generator;it is applicable to arbitrarily configured orthogonal disk arrays. Our scheme is faster than declustering schemes that use pseudo-random permutations and achieves better uniformity of disk loads during a disk rebuild;it is simpler than schemes based on block designs. ACATS provides a rich spectrum of declustering schemes.
Two and three dimensional k-tori are among the most used topologies in the designs of new parallel computers. Traditionally (withthe exception of the Tera parallel computer), these networks have been used as fully-po...
详细信息
Two and three dimensional k-tori are among the most used topologies in the designs of new parallel computers. Traditionally (withthe exception of the Tera parallel computer), these networks have been used as fully-populated networks, in the sense that every routing node in the topology is subjected to message injection. However, fully-populated tori and meshes exhibit a theoretical throughput which degrades as the network size increases. In contrast, multistage networks (that are partially populated) scale well withthe network size. Introducing slackness in fully-populated tori, i.e., reducing the number of processors, and studying optimal routing strategies for the resulting interconnections are the central subjects of this paper. the key concept that we study is the placement of the processors in a network together with a routing algorithm between them, where a placement is the subset of the nodes in the interconnection network that are attached to processors. Our main contribution is the construction of optimal placements for d-dimensional k-tori networks, of sizes k and k2 and the corresponding routing algorithms for the cases d = 2 and d = 3, respectively.
A method for mapping arrays into parallel memories so that to minimize serialization and network conflicts for lock-step systems will be presented. Each array is associated an arbitrary number of data access patterns ...
详细信息
A method for mapping arrays into parallel memories so that to minimize serialization and network conflicts for lock-step systems will be presented. Each array is associated an arbitrary number of data access patterns that can be identified following compiler data-dependence analysis. Conditions for conflict-free access of parallel memories and network are derived for arbitrary power-of-2 data patterns and arbitrary multistage networks. We propose an efficient heuristic to synthesize combined address transformation (NP-complete) which applies to arbitrary linear patterns, arbitrary multistage networks, and arbitrary number of power-of-2 memories. Our method can be implemented as part of the address transformation (Xor and And) or through compiler emulation. Performance of optimized storage schemes is presented for FFT, arbitrary sets of data patterns, non power-of-2 stride access in vector processors, interleaving, and static row-column storages. Our approach is profitable in all the above cases and provides a systematic method for coverting array-memory mapping and network aspects of algorithms from one network topology to another.
Using runtime information of load distributions and processor affinity, we propose an adaptive scheduling algorithm and its variations from different control mechanisms. the proposed algorithm applies different degree...
详细信息
Using runtime information of load distributions and processor affinity, we propose an adaptive scheduling algorithm and its variations from different control mechanisms. the proposed algorithm applies different degrees of aggressiveness to adjust loop scheduling granularities, aiming at improving the execution performance of parallel loops by making scheduling decisions that match the real workload distributions at runtime. To verify the effectiveness of the algorithm and its variations, we implemented them on the KSR-1 and on the Convex Exemplar. We experimentally compared the performance of our algorithm and its variations with several existing scheduling algorithms on the two machines. the kernel application programs we used for performance evaluation were carefully selected for different classes of parallel loops. Our results show that using runtime information to adaptively adjust scheduling granularity is an effective way to handle loops with a wide range of load distributions when no prior knowledge of the execution can be used. the overhead caused by collecting runtime information is insignificant in comparison withthe performance improvement. Our experiments show that the adaptive algorithm and its five variations outperformed the existing scheduling algorithms.
Multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. Hardware implementation is a viable solution to develop a low l...
详细信息
Multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. Hardware implementation is a viable solution to develop a low latency multicast algorithm. In this paper, we present a new multicast routing algorithm for two-dimensional meshes. the algorithm uses worm-hole routing mechanism and can send messages to any number of destinations within two start-up communication phases;hence the name, two-phase multicast (TPM) algorithm. the algorithm uses the base routing scheme used for unicast communication, thus minimizing the additional hardware support. the algorithm allows some intermediate nodes that are not in the destination set to perform multicast functions. this feature allows flexibility in multicast path selection and therefore improves the performance. A simulation study has been conducted to investigate the performance of the purposed TPM algorithm. the simulation results show that the proposed TPM algorithm performs significantly better for both contention-free and mixed-traffic conditions, when compared withthe previously proposed multicast algorithms.
暂无评论