Declustering is a well known technique to achieve high performance for queries on paralleldatabases. We propose novel General Disk Module (GDM) based declustering algorithms, GDM Cartesian and GDM Circle, for distrib...
详细信息
Declustering is a well known technique to achieve high performance for queries on paralleldatabases. We propose novel General Disk Module (GDM) based declustering algorithms, GDM Cartesian and GDM Circle, for distributing uniformly distributed multidimensional datasets to parallel disks, for datasets of any dimension. We compare the performance of the new approaches with several existing declustering algorithms, using variable numbers of disks, and with variable shapes and dimensions of the datasets. Our results show that the new approaches significantly outperform the others for almost all configurations tested.
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization ...
详细信息
In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between multiprocessors. It is implemented via a translator that converts OpenMP directives to appropriate calls to a modified version of the TreadMarks software distributed memory system (SDSM). In contrast to previous SDSM systems for SMPs, the modified TreadMarks uses POSIX threads for parallelism within an SMP node. This approach greatly simplifies the changes required to the SDSM in order to exploit the intra-node hardware shared memory. We present performance results for six applications (SPLASH-2 Barnes-Hut and Water, NAS 3D-FFT, SOR, TSP and MGS) running on an SP2 with four four-processor SMP nodes. A comparison between the threaded implementation and the original implementation of TreadMarks shows that using the hardware shared memory within an SMP node significantly reduces the amount of data and the number of messages transmitted between nodes, and consequently achieves speedups up to 30% better than the original versions. We also compare SDSM against message passing. Overall, the speedups of multithreaded TreadMarks programs are within 7-30% of the MPI versions.
An often-claimed benefit of mobile agent technology is the reduction in communication costs. In particular, the area of information filtering has been proposed for the application of mobile filter agents. However, the...
详细信息
An often-claimed benefit of mobile agent technology is the reduction in communication costs. In particular, the area of information filtering has been proposed for the application of mobile filter agents. However, the effective coordination of agents, which takes into account the current network conditions, is difficult to achieve. This paper analyses the situation where data distributed among various remote data servers is examined with mobile filter agents. We present an approach for coordinating the agents' employment, which minimizes communication costs. Validation studies on the possible cost savings for various configurations show that savings of up to 90% can be achieved in the face of actual Internet conditions.
A framework to integrate negotiation capabilities-particularly components implementing a negotiation strategy-into mobile agents is introduced. The framework is based on a plug-in mechanism enabling a dynamic composit...
详细信息
A framework to integrate negotiation capabilities-particularly components implementing a negotiation strategy-into mobile agents is introduced. The framework is based on a plug-in mechanism enabling a dynamic composition of negotiating agents from pluggable modules. Strategy modules are further decomposed into parallel execution units based on the notion of an actor system.
This paper presents efficient all-to-all broadcast algorithms for arbitrary irregular networks with switch-based wormhole interconnection and unicast message passing. first, all-to-all broadcast is considered within a...
详细信息
This paper presents efficient all-to-all broadcast algorithms for arbitrary irregular networks with switch-based wormhole interconnection and unicast message passing. first, all-to-all broadcast is considered within a single switch cluster. Both combining and non-combining algorithms are compared via analytical modeling and simulation. The characteristics of optimal all-to-all broadcast operation are considered and applied to development of multi-switch algorithms. The single switch algorithms are considered on two switch clusters and a near-optimal algorithm is developed which schedules use of interconnecting links where the potential for link contention exists. Finally, the Link Scheduling concept is extended to handle arbitrary irregular networks. Operation of this algorithm is simulated on a 128-node irregular network, and shows a 27.1% improvement in performance compared to other algorithms.
We explore the creation of a metacomputer by the aggregation of independent sites. Joining a metacomputer is voluntary, and hence it has to be an endeavor that mutually benefits all parties involved. We identify propo...
详细信息
We explore the creation of a metacomputer by the aggregation of independent sites. Joining a metacomputer is voluntary, and hence it has to be an endeavor that mutually benefits all parties involved. We identify proportional-share allocation as a key component of such a mutual benefit. Proportional-share allocation is the basis for enforcing the agreement reached among the sites on how to use the metacomputer's resources. We introduce a resource manager that provides proportional-share allocation over a cluster of workstations, assuming applications to be master-slave. This manager is novel because it performs non-preemptive proportional scheduling of multiple processors. A prototype has been implemented and we report on preliminary results. Finally, we discuss how tickets (first-class entities that encapsulate allocation endowments) can be used in practice to enforce the metacomputer agreement, and also how they can ease the site selection to be performed by the application.
In order to extract high levels of performance from modern parallel architectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilizati...
详细信息
In order to extract high levels of performance from modern parallel architectures, the effective management of deep memory hierarchies is very important. While architectural advances in caches help in better utilization of the memory hierarchy, compiler-directed locality enhancement techniques are also important. In this paper we propose a locality improvement technique that uses data space (array layout) transformations in contrast to most of the previous work based on iteration space (loop) transformations. In other words, rather than changing the order of loop iterations, our technique modifies the memory layouts of multi-dimensional arrays. In comparison with previous work on data transformations it brings two novelties. first, we formulate the problem on a special graph structure called the layout graph (LG) and use integer linear programming (ILP) methods to determine optimal layouts. Second, in addition to static layout detection, our approach also enables the compiler to determine optimal dynamic layouts;that is, the layouts that can be changed across loop nest boundaries. We believe that this is the first attempt to determine optimal dynamic memory layouts. We also present preliminary experimental results on the SGI Origin 2000 distributed shared memory multiprocessor. Our results so far are encouraging and indicate that the additional compilation time taken by the solver is tolerable.
This paper presents a new algorithm called List-based Load Balancing (LLB) for compile-time task scheduling on distributed-memory machines. LLB is intended as a cluster-mapping and task-ordering step in the multi-step...
详细信息
This paper presents a new algorithm called List-based Load Balancing (LLB) for compile-time task scheduling on distributed-memory machines. LLB is intended as a cluster-mapping and task-ordering step in the multi-step class of scheduling algorithms. Unlike current multistep approaches, LLB integrates cluster-mapping and task-ordering in a single step. The benefits of this integration are twofold. first, it allows dynamic load balancing in time, because only the ready tasks are considered in the mapping process. Second, communication is also considered, as opposed to algorithms like WCM and GLB. The algorithm has a low time complexity of O(E+V(log V+log P)), where E is the number of dependences, V is the number of tasks and P is the number of processors. Experimental results show that LLB outperforms known cluster-mapping algorithms of comparable complexity, improving the schedule lengths up to 42%. Furthermore, compared with LCA, a much higher-complexity algorithm, LLB obtains comparable results for fine-grain graphs and yields improvements up to 16% for coarse-grain graphs.
Efficient performance tuning of parallel programs is often hard. We present a performance prediction and visualization tool called VPPB. Based on a monitored uni-processor execution, VPPB shows the (predicted) behavio...
详细信息
ISBN:
(纸本)0769501433
Efficient performance tuning of parallel programs is often hard. We present a performance prediction and visualization tool called VPPB. Based on a monitored uni-processor execution, VPPB shows the (predicted) behaviour of a multithreaded program using any number of processors and the program behaviour is visualized as a graph. The first version of VPPB was unable to handle I/O operations. This version has, by an improved tracing technique, added the possibility to trace activities at the kernel level as well. Thus, VPPB is now able to trace various I/O activities, e.g., manipulation of OS internal buffers, physical disk I/O, socket I/O, and RPC. VPPB allows flexible performance tuning of parallel programs developed for shared memory multiprocessors using a standardized environment;C/C++ programs that uses the thread package in Solaris 2.X.
We consider the problem of routing in an asynchronous dynamically changing ring of processors using schemes that minimize the storage space for the routing information. In general, applying static techniques to a dyna...
详细信息
We consider the problem of routing in an asynchronous dynamically changing ring of processors using schemes that minimize the storage space for the routing information. In general, applying static techniques to a dynamic network would require significant re-computation. Moreover, the known dynamic techniques applied to the ring lead to inefficient schemes. In this paper we introduce a new technique, Dynamic Interval Routing, and we show trade offs between the stretch, the adaptation cost and the size of the update messages used by routing schemes based upon it. We give three algorithms for rings of maximum size N: the first two are deterministic, one with adaptation cost zero but stretch N/2, the other with adaptation cost O(N) messages of O(log N) bits and stretch 1. The third algorithm is randomized, uses update messages of size O(k log N), has adaptation cost O(k) and expected stretch 1+1/k, for any k≥1. All schemes require O(log N) bits per node for the routing information and all messages headers are of O(log N) bits.
暂无评论