This paper examines the performance of distributed-shared-memory (DSM) systems based on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) using queuing network models and develops theoretical results whi...
详细信息
ISBN:
(纸本)0769521320
This paper examines the performance of distributed-shared-memory (DSM) systems based on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) using queuing network models and develops theoretical results which predict processor utilization, message latency and other useful measures. It also presents simulation results which compare the performance of the SOME-Bus, the crossbar and the torus using queuing-network models. The SOME-Bus is a broadcast-based, fiber-optic interconnection network which contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. The effect of collective communications due to cache coherence is examined. Results reveal that the performance of the SOME-Bus interconnection network is the least affected by large communication times, compared to the other two architectures considered here. Even in the presence of intense coherence traffic, processor utilization and message latency is much less affected than in the other architectures.
Most of the task allocation models & algorithms in distributed Computing System (DCS) require a priori knowledge of its execution time on the processing nodes. Since the task assignment is not known in advance, th...
详细信息
ISBN:
(纸本)0769521320
Most of the task allocation models & algorithms in distributed Computing System (DCS) require a priori knowledge of its execution time on the processing nodes. Since the task assignment is not known in advance, this time is quite difficult to estimate. We propose a cluster-based dynamic allocation scheme, in a distributed computing system, which eliminate this time requirement. Further, as opposed to a single task allocation, generally proposed in most of the models, we consider multiple tasks. A fuzzy function is used for both the module clustering and processor clustering. Dynamic invocation of clustering and assignment is considered. Experimental results show the efficacy of the proposed model.
Computational Grids have been proposed as the next generation computing platform for solving large-scale problems in science, engineering, and commerce. There is an enormous amount of interest in applications, called ...
详细信息
ISBN:
(纸本)0769521320
Computational Grids have been proposed as the next generation computing platform for solving large-scale problems in science, engineering, and commerce. There is an enormous amount of interest in applications, called Grid Workflows in which a number of otherwise independent programs are run in a "pipeline". In practice, there are a number of different mechanisms that can be used to couple the models, ranging from loosely coupled file based IO to tightly coupled message passing. In this paper we propose a flexible IO architecture that provides a wide range of mechanisms for building Grid Work/lows without the need for any source code modification and without the need to fix them at design time. Further, the architecture works with legacy applications. We evaluate the performance of our prototype system using a workflow in computational mechanics.
Pricing of derivatives is one of the central problems in Computational Finance. Since the theory of derivative pricing is highly mathematical, numerical techniques such as binomial lattice, finite-differencing and fas...
详细信息
ISBN:
(纸本)0769521320
Pricing of derivatives is one of the central problems in Computational Finance. Since the theory of derivative pricing is highly mathematical, numerical techniques such as binomial lattice, finite-differencing and fast Fourier transform (FFT) among others have been used for derivative or option pricing. Based on a recent work on FFT for VLSI circuits, we develop a parallel algorithm in the current work, which improves data locality and hence reduce communication overheads. Our main aim is to study the performance of this algorithm. Compared to the traditional butterfly network, the current algorithm with data swap network performs better by more than 15% for large data sizes.
The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power ...
详细信息
There is today an increasing diversity of parallel execution supports. Solving a target problem by using a single algorithm is not always efficient on any computational support. We present in this paper a polyalgorith...
详细信息
ISBN:
(纸本)0769521320
There is today an increasing diversity of parallel execution supports. Solving a target problem by using a single algorithm is not always efficient on any computational support. We present in this paper a polyalgorithmic approach for selecting the most suitable algorithm among various ones for given problem size and available resources. Our principal objective here is to illustrate such an approach on the well-known matrix multiplication problem which is one of the most important basic numerical kernels. More precisely, we propose a poly-algorithm which uses both advantages of standard and fast algorithms which is able to automatically choose the right and suitable algorithm for computing the matrix multiplication of any dimension on a particular parallel system. We target this approach on homogeneous clusters of PCs while providing some experiments.
In this paper, we design parallel Monte Carlo algorithms for the Ising spin model on a hierarchical cluster. A hierarchical cluster can be considered as a cluster of homogeneous nodes which are partitioned into multip...
详细信息
ISBN:
(纸本)0769521320
In this paper, we design parallel Monte Carlo algorithms for the Ising spin model on a hierarchical cluster. A hierarchical cluster can be considered as a cluster of homogeneous nodes which are partitioned into multiple supernodes such that communication across homogenous clusters are represented by a supernode topological network. We consider different data layouts and provide equations for choosing the best data layout under such a network paradigm. We show that the data layouts designed for a homogeneous cluster will not yield as good results as layouts designed for a hierarchical cluster. We derive theoretical results of the performance of the algorithms on a modified version of the LogP model that represents such tiered networking, and present simulation results to analyze the utility of the theoretical design and analysis. Furthermore, we consider the 3-D Ising model and design parallel algorithms for sweep spin selection for them on both homogeneous and hierarchical clusters.
The automatic parallelization of loops that contain complex computations is still a challenge for current parallelizing compilers. The main limitations are related to the analysis of expressions that contain subscript...
详细信息
ISBN:
(纸本)0769521320
The automatic parallelization of loops that contain complex computations is still a challenge for current parallelizing compilers. The main limitations are related to the analysis of expressions that contain subscripted subscripts, and the analysis of conditional statements that introduce complex control flows at run-time. We use the term complex loop to designate loops with such characteristics. In this paper, we focus on the generation of parallel code for sequential complex loop nests using a generic compiler framework (proposed in an earlier paper [3]) that accomplishes kernel recognition through the analysis of the Gated Single Assignment program representation. Specifically, we present an extension of this framework that enables its use as a powerful tool for gathering source code information that is relevant for the parallelization of each computational kernel. A set of example codes are analyzed in detail to illustrate the potential of our approach. Experimental results using a benchmark suite of complex loop nests are also presented.
In this paper, we discuss an efficient parallel implementation of the treecode Ewald method for fast evaluation of long-range Coulomb interactions in a periodic system for molecular dynamics simulations. The paralleli...
详细信息
ISBN:
(纸本)0769521320
In this paper, we discuss an efficient parallel implementation of the treecode Ewald method for fast evaluation of long-range Coulomb interactions in a periodic system for molecular dynamics simulations. The parallelization is based on an adaptive decomposition scheme using the Morton order of the particles. This decomposition scheme takes advantage of the data locality and involves minimum changes to the original sequential code. The Message Passing Interface (MPI) is used for inter-processor communications, making the code portable to a variety of parallel computing platforms. We also discuss communication and performance models for our parallel algorithm. The predicted communication time and parallel performance from these models match the measured results well. Timing results obtained using a system of water molecules on the IA32 Cluster at the Ohio Supercomputer Center demonstrate high speedup and efficiency of the parallel treecode Ewald method.
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon's algorithm. It is suitable for clusters and scalabl...
详细信息
ISBN:
(纸本)0769521320
This paper describes a novel parallel algorithm that implements a dense matrix multiplication operation with algorithmic efficiency equivalent to that of Cannon's algorithm. It is suitable for clusters and scalable shared memory systems. The current approach differs from the other parallel matrix multiplication algorithms by the explicit use of shared memory and remote memory access (RMA) communication rather than message passing. The experimental results on clusters (IBM SP, Linux-Myrinet) and shared memory systems (SGI Altix, Cray X1) demonstrate consistent performance advantages over pdgemm from the ScaLAPACK/PBBLAS suite, the leading implementation of the parallel matrix multiplication algorithms used today. In the best case on the SGI Altix, the new algorithm performs 20 times better than pdgemm for a matrix size of 1000 on 128 processors. The impact of zero-copy nonblocking RMA communications and shared memory communication on matrix multiplication performance on clusters are investigated.
暂无评论