With the proliferation of workstation clusters connected by high-speed networks, providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two p...
详细信息
With the proliferation of workstation clusters connected by high-speed networks, providing efficient system support for concurrent applications engaging in nontrivial interaction has become an important problem. Two principal barriers to harnessing parallelism are: one, efficient mechanisms that achieve transparent dependency maintenance while preserving semantic correctness, and two, scheduling algorithms that match coupled processes to distributed resources while explicitly incorporating their communication costs. This paper describes a set of performance features, their properties, and implementation in a system support environment called DUNES that achieves transparent dependency maintenance - IPC, file access, memory access, process creation/termination, process relationships - under dynamic load balancing. The two principal performance features are push/pull-based active and passive end-point caching and communication-sensitive load balancing. Collectively, they mitigate the overhead introduced by the transparent dependency maintenance mechanisms. Communication-sensitive load balancing, in addition, affects the scheduling of distributed resources to application processes where both communication and computation costs are explicitly taken into account. DUNES' architecture endows commodity operating systems with distributed operating system functionality while achieving transparency with respect to their existing application base. DUNES also preserves semantic correctness with respect to single processor semantics. We show performance measurements of a UNIX based implementation on Sparc and x86 architectures over high-speed LAN environments. We show that significant performance gains in terms of system throughput and parallel application speed-up are achievable.
parallelprogramming models should attempt to satisfy two conflicting goals. On one hand, they should hide architectural details so that algorithm designers can write simple, portable programs. On the other hand, mode...
详细信息
parallelprogramming models should attempt to satisfy two conflicting goals. On one hand, they should hide architectural details so that algorithm designers can write simple, portable programs. On the other hand, models must expose architectural details so that designers can evaluate and optimize the performance of their algorithms. In this paper we experimentally examine the trade-offs made by a simple shared-memory model, QSM, to address this dilemma. The results indicate that analysis under the QSM model yields quite accurate results for reasonable input sizes and that algorithms developed under QSM achieve performance close to that obtainable through more complex models, such as BSP and LogP.
Most parallelprogramming models for distributed-memory architectures are based on individual threads interacting via send and receive operations. We show that a more structured model, BSP, gains substantial performan...
详细信息
Most parallelprogramming models for distributed-memory architectures are based on individual threads interacting via send and receive operations. We show that a more structured model, BSP, gains substantial performance improvements by exploiting the extra information implicit in its structure. In particular each thread learns something about global state whenever it receives a message. This information can be used to modify its own behavior to improve collective use of the communication system. The programming model's semantics also provides implicit knowledge that can be exploited to increase performance. We show that these effects are useful at the application level by comparing the performance of BSP and MPI implementations of the NAS parallel benchmarks.
The main contribution of this work is to propose a number of broadcast efficient VLSI architectures for computing the sum and the prefix sums of a wk-bit, k ≥ 2, binary sequence using, as basic building blocks, linea...
详细信息
In this paper we present a load adaptive parallel algorithm and implementation to compute 2D Discrete Wavelet Transform (DWT) on multithreading machines. In a 2D DWT computation, the problem sizes reduces at every dec...
详细信息
In this paper we present a load adaptive parallel algorithm and implementation to compute 2D Discrete Wavelet Transform (DWT) on multithreading machines. In a 2D DWT computation, the problem sizes reduces at every decomposition level and the lengths of the emerging computation paths also vary. The parallel algorithm proposed in this paper, dynamically scales itself to the varying problem size. Experimental results are reported based on the implementations of the proposed algorithm on a 20 node multithreading emulation platform, EARTH-MANNA. We show that multithreading implementations of the proposed algorithm are at least 2 times faster than the MPI based message passing implementations reported in the literature. We further show that the proposed algorithm and implementations scale linearly with respect to problem and machine sizes.
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of...
详细信息
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of parallelprogramming primitives. Within PM-PVM, PVM tasks are mapped onto threads and the message passing functions are implemented using shared memory. Three implementation approaches of the PVM message passing functions have been adopted. In the first one, a single message copy in memory is shared by all destination tasks. The second one replicates the message for every destination task but requires less synchronization. Finally, the third approach uses a combination of features from the two previous ones. Experimental results comparing the performance of PM-PVM and PVM applications running on a 4-processor Sparcstation 20 under Solaris 2.5 show that PM-PVM can produce execution times up to 54% smaller than PVM.
We present practical parallel algorithms using prefix computations for various problems that arise in pairwise comparison of biological sequences. We consider both constant and affine gap penalty functions, full-seque...
详细信息
We present practical parallel algorithms using prefix computations for various problems that arise in pairwise comparison of biological sequences. We consider both constant and affine gap penalty functions, full-sequence and subsequence matching and space-saving algorithms. The best known sequential algorithms solve these problems in O(mn) time and O(m+n) space, where m and n are the lengths of the two sequences. All the algorithms presented in this paper are time optimal with respect to the best known sequential algorithms and can use O(n/log n) processors where n is the length of the larger sequence. While optimal parallel algorithms for many of these problems are known, we use a simple framework and demonstrate how these problems can be solved systematically using repeated parallel prefix operations. We also present a space-saving algorithm that uses O(m+n/p) space and runs in optimal time where p is the number of the processors used.
OptSolve++ is a set of C++ class libraries for nonlinear optimization and root-finding. The primary components include TxOptSlv (optimizer and solver classes), TxFunc (functor classes used to wrap user- defined functi...
详细信息
In this work we present a procedure for automatic parallel code generation in the case of algorithms described through Set of Affine Recurrence Equations (SARE); starting from the original SARE description in an N-dim...
详细信息
In this work we present a procedure for automatic parallel code generation in the case of algorithms described through Set of Affine Recurrence Equations (SARE); starting from the original SARE description in an N-dimensional iteration space, the algorithm is converted into a parallel code for an m-dimensional distributed memory parallel machine (m
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of...
详细信息
PM-PVM is a portable implementation of PVM designed to work on SMP architectures supporting multithreading. PM-PVM portability is achieved through the implementation of the PVM functionality on top of a reduced set of parallelprogramming primitives. Within PM-PVM, PVM tasks are mapped onto threads and the message passing functions are implemented using shared memory. Three implementation approaches of the PVM message passing functions have been adopted. In the first one, a single message copy in memory is shared by all destination tasks. The second one replicates the message for every destination task but requires less synchronization. Finally, the third approach uses a combination of features from the two previous ones. Experimental results comparing the performance of PM-PVM and PVM applications running on a 4-processor Sparcstation 20 under Solaris 2.5 show that PM-PVM can produce execution times up to 54% smaller than PVM.
暂无评论