This paper presents the parallel implementation of a boundary element code for the solution of 2D elastostatic problems using linear elements. The original code is described in detail in a reference text in the area [...
详细信息
This paper presents the parallel implementation of a boundary element code for the solution of 2D elastostatic problems using linear elements. The original code is described in detail in a reference text in the area [Boundary elements techniques: theory and applications in engineering, 1984]. The Fortran code is reviewed and rewritten to run on shared and distributed memory systems using standard and portable libraries: OpenMP, LAPACK and ScaLAPACK. The implementation process provides guidelines to develop parallel applications of the Boundary Element Method, applicable to many science and engineering problems. Numerical experiments on a SGI Origin 2000 shows the effectiveness of the proposed approach. (C) 2004 Elsevier Ltd. All rights reserved.
Large-scale parallelized distributed computing has been implemented in the message passing interface (MPI) environment to solve numerically, eight reaction-diffusion equations representing the anatomy and treatment of...
详细信息
Large-scale parallelized distributed computing has been implemented in the message passing interface (MPI) environment to solve numerically, eight reaction-diffusion equations representing the anatomy and treatment of breast cancer. The numerical algorithm is perturbed functional iterations (PFI) which is completely matrix-free. Fully distributed computations with multiple processors have been implemented on a large scale in the serial PFI-code in the MPI environment. The technique of implementation is general and can be applied to any serial code. This has been validated by comparing the computed results from the serial code and those from the MPI-version of the parallel code.
Beowulf systems, and other proprietary approaches, are placing systems with four or more CPUs in the hands of many researchers and commercial users. In the near future, systems with hundreds of CPUs will become common...
详细信息
Beowulf systems, and other proprietary approaches, are placing systems with four or more CPUs in the hands of many researchers and commercial users. In the near future, systems with hundreds of CPUs will become commonly available, with some programmers dealing with tens of thousands of CPUs. The debugging methods used on these systems are a combination of the traditional methods used for debugging single processes and ad-hoc methods to help the user cope with the multitudes of processes. Programmers are usually familiar with a single-process debugger and would like to use it (with minimal user-visible extensions) to debug their distributed program. We present a set of modifications to a traditional debugger that makes it capable of debugging applications running on thousands of processes. Our parallel debugger is composed of individual fully functional debuggers connected with an n-nary aggregating network. This permits us to present to users the results from each debugger at the same time in an aggregated fashion. Users get a global view of the application and can easily see if a given parameter has a different value from either what they expect it to be or from the other processes. Users can then focus on the process sets of interest and investigate the problem. One challenge when debugging thousands of processes is to deal with the amount of output coining from all the debuggers. We present methods to aggregate the overwhelming amount of output from the debuggers into a more manageable subset, which is presented to the user without losing information. Experiments show that the debugger is scalable to thousands of processors. The startup mechanism, as well as users' command response time scale well. The conclusions presented regarding the architecture and the new parallel debugger's scalability are not specific to the serial debugger we are using in our example implementation. (C) 2004 Elsevier Inc. All rights reserved.
We present solutions to statically load-balance scatter operations in parallel codes run on grids. Our load-balancing strategy is based on the modification of the data distributions used in scatter operations. We stud...
详细信息
We present solutions to statically load-balance scatter operations in parallel codes run on grids. Our load-balancing strategy is based on the modification of the data distributions used in scatter operations. We study the replacement of scatter operations with parameterized scatters, allowing custom distributions of data. The paper presents: (1) a general algorithm which finds an optimal distribution of data across processors;(2) a quicker guaranteed heuristic relying on hypotheses on communications and computations;(3) a policy on the ordering of the processors. Experimental results with an MPI scientific code illustrate the benefits obtained from our load-balancing. (C) 2004 Elsevier B.V. All rights reserved.
The execution of a client/server application involving database access requires a sequence of database transaction events (or, T-events), called a transaction sequence (or, T-sequence). A client/server database applic...
详细信息
The execution of a client/server application involving database access requires a sequence of database transaction events (or, T-events), called a transaction sequence (or, T-sequence). A client/server database application may have nondeterministic behavior in that multiple executions thereof with the same input may produce different T-sequences. We present a framework for testing all possible T-sequences of a client/server database application. We first show how to define a T-sequence in order to provide sufficient information to detect race conditions between T-events. Second, we design algorithms to change the outcomes of race conditions in order to derive race variants, which are prefixes of other T-sequences. Third, we develop a prefix-based replay technique for race variants derived from T-sequences. We prove that our framework can derive all the possible T-sequences in cases where every execution of the application terminates. A formal proof and an analysis of the proposed framework are given. We describe a prototype implementation of the framework and present experimental results obtained from it.
With the progress of research on cluster computing, many universities have begun to offer various courses covering cluster computing. A wide variety of content can be taught in these courses. Because of this variation...
详细信息
With the progress of research on cluster computing, many universities have begun to offer various courses covering cluster computing. A wide variety of content can be taught in these courses. Because of this variation, a difficulty that arises is the selection of appropriate course material. The selection is complicated because some content in cluster computing may also be covered by other courses in the undergraduate curriculum, and the background of students enrolled in cluster computing courses varies. These aspects of cluster computing make the development of good course material difficult. Combining experiences in teaching cluster computing at universities in the United States and Australia, this piper presents prospective topics in cluster computing and A wide variety of information sources from which instructors can choose. The course material is described in relation to the knowledge units of the Joint IEEE Computer Society and the Association for Computing Machinery (ACM) Computing Curricula 2001 and, includes system architecture, parallel programming, algorithms, and applications. Instructors can select units in each of the topical areas and develop their own syllabi to meet course objectives. The authors share their experiences in teaching cluster computing and the topics chosen, depending on course objectives.
The present paper introduces the main steps towards the parallelization of existing boundary element codes, using standard and portable libraries for writing shared memory parallel programs, OpenMP and LAPACK. Paralle...
详细信息
The present paper introduces the main steps towards the parallelization of existing boundary element codes, using standard and portable libraries for writing shared memory parallel programs, OpenMP and LAPACK. parallel programming techniques can have a great impact on application performance and OpenMP facilitates these improvements. Since, such procedures are not widespread among BEM practitioners, the authors introduce these techniques into an well-known BEM program, described in detail by Brebbia and Dominguez [Boundary Elements: An Introductory Course. CMP, Southampton, 1992]. The code is herein reviewed and rewritten to achieve high performance on shared memory systems. The step-by-step implementation process provides guidelines to develop efficient parallel BEM codes, applicable to many science and engineering problems. Numerical experiments on a SGI Origin 2000 and a NEC SX-6 show the effectiveness of the proposed approach. (C) 2004 Elsevier Ltd. All rights reserved.
Processor idling due to communication delays and load imbalances are among the major factors that affect the performance of parallel programs. Need to optimize performance often forces programmers to sacrifice modular...
详细信息
Processor idling due to communication delays and load imbalances are among the major factors that affect the performance of parallel programs. Need to optimize performance often forces programmers to sacrifice modularity. This paper focuses on the performance benefits of message-driven execution, particularly for large parallel programs composed of multiple libraries and modules. We examine message-driven execution in the context of a parallel object-based language, but the analysis applies to other models such as multithreading as well. We argue that modularity and efficiency, in the form of overlapping communication latencies and processor idle times, can be achieved much more easily in message-driven execution than in message-passing SPMD style. Message-driven libraries are easier to compose into larger programs and they do not require one to sacrifice performance in order to break a program into multiple modules. One can overlap the idle times across multiple independent modules. We demonstrate performance and modularity benefits of message-driven execution with simulation studies. We show why it is not adequate to emulate message-driven execution with the message-passing SPMD style. During these studies, it became clear that the usual criteria of minimizing the completion time and reducing the critical path that are used in SPMD programs are not exactly suitable for message-driven programs. (C) 2004 Elsevier Inc. All rights reserved.
The FETI method with the natural coarse grid is combined with the penalty method to develop an efficient solver for elliptic variational inequalities. A proof is given that a prescribed bound on the norm of feasibilit...
详细信息
The FETI method with the natural coarse grid is combined with the penalty method to develop an efficient solver for elliptic variational inequalities. A proof is given that a prescribed bound on the norm of feasibility of solution may be achieved with a value of the penalty parameter that does not depend on the discretization parameter and that an approximate solution with the prescribed bound on violation of the Karush-Kuhn-Tucker conditions may be found in a number of steps that does not depend on the discretization parameter. Results of numerical experiments with parallel solution of a model problem discretized by up to more than eight million of nodal variables are in agreement with the theory and demonstrate numerically both optimality of the penalty and scalability of the algorithm presented. Copyright (C) 2004 John Wiley Sons, Ltd.
Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote ...
详细信息
Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation achieves a latency of 6.8 musec for small messages and a peak bandwidth of 871 million bytes/sec. Performance evaluation shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22% compared with the original design. For large data transfers, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS parallel Benchmarks.
暂无评论