high code efficiency (operations per instruction) combined with a high degree of instruction level parallelism can rarely be obtained by hardwired microprocessor designs for a broad application domain. the implementat...
the subject of this paper is to show the very high power of asynchronism for iterative algorithms in the context of global computing, that is to say, with machines scattered all around the world. the question is wheth...
详细信息
the subject of this paper is to show the very high power of asynchronism for iterative algorithms in the context of global computing, that is to say, with machines scattered all around the world. the question is whether or not asynchronism helps to reduce the communication penalty and the overall computation time of a given parallel algorithm. the asynchronous programming model is applied to a given problem implemented with a multi-threaded environment and tested over two kinds of clusters of workstations; a homogeneous local cluster and a heterogeneous non-local one. the main features of this programming model are exhibited and the high efficiency and interest of such algorithms is pointed out.
We describe a novel method for scheduling highspeed network switches. the targeted architecture is an input-buffered switch with a non-blocking switch fabric. the input buffers are organized as virtual output queues t...
Future scalable, highthroughput, and highperformance applications are. likely to execute on platforms constructed by clustering multiple autonomous distributed servers, with resource access governed by agreements be...
详细信息
Future scalable, highthroughput, and highperformance applications are. likely to execute on platforms constructed by clustering multiple autonomous distributed servers, with resource access governed by agreements between the owners and users of these servers. Such systems raise several new resource management challenges, chief amongst which is the enforcement of agreements to ensure that, despite the distributed nature of both requests and resources, user requests only receive a predetermined share of the aggregate resource. Current solutions only enforce such agreements at a coarse granularity and in a centralized fashion, limiting their applicability. this paper presents an architecture for the distributed enforcement of resource sharing agreements. Our approach exploits a uniform application-independent representation of agreements, and combines it with efficient tune-window based coordinated queuing algorithms running on multiple nodes. We have successfully implemented this general strategy in two different network layers: a Layer-7 HTTP redirector and a Layer-4 IP packet redirector; which redirect connection requests from distributed clients to a cluster of distributed servers. Our measurements of both implementations verify that our approach is general and effective.
Minimising the communication latency and achieving considerable scalability are of paramount importance when designing highperformance broadcast algorithms. Many algorithms for wormhole-switched meshes have been wide...
详细信息
Minimising the communication latency and achieving considerable scalability are of paramount importance when designing highperformance broadcast algorithms. Many algorithms for wormhole-switched meshes have been widely reported in the literature. However, most of these algorithms handle broadcast in a sequential manner and do not scale well withthe network size. As a consequence, many parallel applications cannot be efficiently supported using existing algorithms. Motivated by these observations, this paper presents a new broadcast algorithm for the all-port mesh networks. the unique feature of the proposed algorithm is its capability of handling broadcast in only one message-passing step irrespective of the network size. Results from a comparative analysis and simulation reveal that the proposed algorithm exhibits superior performance characteristics over those of the well-known Recursive Doubling, Extending Dominating Node and Network Partitioning algorithms.
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce ...
详细信息
Trace-driven simulation is a commonly used tool to evaluate memory-hierarchy designs. Unfortunately, trace collection is very expensive, and storage requirements for traces are very large. In this paper, we introduce HACS (Hardware Accelerated Cache Simulator), and describe the validation methods we used to demonstrate functionality. We also present some initial cache simulation results from SPECint 2000. We then propose future directions for research with HACS.
A heterogeneous cluster system consisting of different types of workstations and communication links plays an important role in parallel computing. In many applications on the system, collective communication operatio...
详细信息
A heterogeneous cluster system consisting of different types of workstations and communication links plays an important role in parallel computing. In many applications on the system, collective communication operations are commonly used as communication primitives. thus, design of the efficient collective communication operations is the key to achieve high-performance parallel computing. But the heterogeneity of the system complicates the design. In this paper, we consider design of an efficient gather operation, one of the most important collective operations. We show that an optimal gather schedule is found in O(n/sup 2k-1/) time for the heterogeneous cluster system with n processors of k distinct types, and that a nearly-optimal schedule is found in O(n) time if k = 2.
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require...
详细信息
In this work we investigate the feasibility of using a cluster of PCs built with mass market networks to deal withthe necessities of the CFD community, in particular for unstructured implicit CFD solvers that require a very irregular pattern of communications. We report the initial findings from a series of experiments with some well known benchmarks to determine CFD application sensitivity to machine communication parameters. this is done by running these benchmarks on a cluster in which the communication network has been modified to allow an increase of the bandwidth by adding multiple channels and a reduction on the latency by using a lightweight protocol like the M-Via.
Heterogeneous computing (HC) environments composed of interconnected machines with varied computational capabilities are well suited to meet the computational demands of large, diverse groups of tasks. the problem, of...
暂无评论