Future scalable, highthroughput, and highperformance applications are. likely to execute on platforms constructed by clustering multiple autonomous distributed servers, with resource access governed by agreements be...
详细信息
Future scalable, highthroughput, and highperformance applications are. likely to execute on platforms constructed by clustering multiple autonomous distributed servers, with resource access governed by agreements between the owners and users of these servers. Such systems raise several new resource management challenges, chief amongst which is the enforcement of agreements to ensure that, despite the distributed nature of both requests and resources, user requests only receive a predetermined share of the aggregate resource. Current solutions only enforce such agreements at a coarse granularity and in a centralized fashion, limiting their applicability. this paper presents an architecture for the distributed enforcement of resource sharing agreements. Our approach exploits a uniform application-independent representation of agreements, and combines it with efficient tune-window based coordinated queuing algorithms running on multiple nodes. We have successfully implemented this general strategy in two different network layers: a Layer-7 HTTP redirector and a Layer-4 IP packet redirector; which redirect connection requests from distributed clients to a cluster of distributed servers. Our measurements of both implementations verify that our approach is general and effective.
Minimising the communication latency and achieving considerable scalability are of paramount importance when designing highperformance broadcast algorithms. Many algorithms for wormhole-switched meshes have been wide...
详细信息
Minimising the communication latency and achieving considerable scalability are of paramount importance when designing highperformance broadcast algorithms. Many algorithms for wormhole-switched meshes have been widely reported in the literature. However, most of these algorithms handle broadcast in a sequential manner and do not scale well withthe network size. As a consequence, many parallel applications cannot be efficiently supported using existing algorithms. Motivated by these observations, this paper presents a new broadcast algorithm for the all-port mesh networks. the unique feature of the proposed algorithm is its capability of handling broadcast in only one message-passing step irrespective of the network size. Results from a comparative analysis and simulation reveal that the proposed algorithm exhibits superior performance characteristics over those of the well-known Recursive Doubling, Extending Dominating Node and Network Partitioning algorithms.
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce ...
详细信息
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce HACS (Hardware Accelerated Cache Simulator), and describe the validation methods we used to demonstrate functionality. We also present some initial cache simulation results from SPECint 2000. We then propose future directions for research with HACS.
A heterogeneous cluster system consisting of different types of workstations and communication links plays an important role in parallel computing. In many applications on the system, collective communication operatio...
详细信息
A heterogeneous cluster system consisting of different types of workstations and communication links plays an important role in parallel computing. In many applications on the system, collective communication operations are commonly used as communication primitives. thus, design of the efficient collective communication operations is the key to achieve high-performance parallel computing. But the heterogeneity of the system complicates the design. In this paper, we consider design of an efficient gather operation, one of the most important collective operations. We show that an optimal gather schedule is found in O(n/sup 2k-1/) time for the heterogeneous cluster system with n processors of k distinct types, and that a nearly-optimal schedule is found in O(n) time if k = 2.
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require...
详细信息
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require a very irregular pattern of communications. We report the initial findings from a series of experiments with some well known benchmarks to determine CFD application sensitivity to machine communication parameters. this is done by running these benchmarks on a cluster in which the communication network has been modified to allow an increase of the bandwidth by adding multiple channels and a reduction on the latency by using a lightweight protocol like the M-Via.
Heterogeneous computing (HC) environments composed of interconnected machines with varied computational capabilities are well suited to meet the computational demands of large, diverse groups of tasks. the problem, of...
In this paper, we describe an implementation of MPI-IO on top of the Direct Access File System (DAFS) standard. the implementation is realized by porting ROMIO on top of DAFS. We identify one of the main mismatches be...
We identify the class of optimization problem expressible as independence systems that can be solved in real time using a parallel machine with polynomially bounded resources as being exactly the class of matroid for ...
详细信息
We identify the class of optimization problem expressible as independence systems that can be solved in real time using a parallel machine with polynomially bounded resources as being exactly the class of matroid for which the size of the optimal solution can be computed in parallel real time. We also extend previous results, showing that the solution obtained by a parallel algorithm is arbitrarily better than the solution reported by a sequential one not only for the real-time minimum-weight spanning tree (as previously known). Indeed, we show that, for all practical purposes, such a property does in fact hold for any optimization problem that falls into the aforementioned class.
Overlap of computations and communications is an effective mechanism to improve the performance of parallel/distributed applications significantly. this overlap can be achieved efficiently by using data partitioning a...
详细信息
Overlap of computations and communications is an effective mechanism to improve the performance of parallel/distributed applications significantly. this overlap can be achieved efficiently by using data partitioning and properly scheduling the data transfer. Various asynchronous communication primitives, that are provided by most message passing tools (e.g. PVM, MPI), can be used to implement the required. Here, we present a design model, the Distributed Software Design Model (DSDM) and show how it can be applied to optimize parallel/distributed applications. We show through several examples, the Master-Slave Merge Sorting Application and the astrophysical N-Body Problem, how the DSDM can be used to develop efficient and optimized implementations of parallel and distributed algorithms.
暂无评论