This paper presents a multi-level frontal algorithm and its implementation and applications on parallel computation A multi-frontal program is given which may be used for unsymmetric finite element matrix equations. T...
详细信息
This paper presents a multi-level frontal algorithm and its implementation and applications on parallel computation A multi-frontal program is given which may be used for unsymmetric finite element matrix equations. The parallel program is developed on a cluster of workstations. The PVM (parallel virtual machine) system is used to handle communications among networked workstations. The method has advantages such as numbering of the finite element mesh in an arbitrary manner, simple programming organisation, smaller core requirements and computation times. An implementation of this parallel method on workstations is discussed, the speedup and efficiency of this method being demonstrated and compared with general domain decomposition method based on band matrix methods by numerical examples.
Software parallelization is required to contend with the increasing scale and complexity of High-Energy Physics experiments. The authors have developed a programming model, Communication Capability (CoCa), which allow...
详细信息
Software parallelization is required to contend with the increasing scale and complexity of High-Energy Physics experiments. The authors have developed a programming model, Communication Capability (CoCa), which allows this parallelization at several levels of granularity and reduces software complexity.
Based on the framework of BSP, a Hierarchical Bulk Synchronous parallel (HBSP) performance model is introduced in this paper to capture the per formance optimization problem for various stages in parallel program deve...
详细信息
Based on the framework of BSP, a Hierarchical Bulk Synchronous parallel (HBSP) performance model is introduced in this paper to capture the per formance optimization problem for various stages in parallel program development and to accurately predict the performance of a parallel program by considering fac tors causing variance at local computation and global communication. The related methodology has been applied to several real applications and the results show that HBSP is a suitable model for optimizing parallel programs.
Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible, We discuss an empirical method of estimating runtime...
详细信息
Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible, We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases. (C) 1999 Elsevier Science B.V. All rights reserved.
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency...
详细信息
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency. We show that a second-order property, the standard deviation of the delivery times is also of interest. Most importantly, the average performance of a large communication system depends not only on the average performance of its components, but also on the standard deviation of these performances. In other words, building a high-performance system requires components that are themselves performing high-performance, but their performance must also have small variance. We illustrate this effect using distributions of the BSP g parameter. Lower bounds in the time per unit transfer of communication in large systems can be derived from data measured over single links. (C) 1999 Elsevier Science B.V. All rights reserved.
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The pre...
详细信息
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The present paper describes a way to generate a model of a given program automatically from the source code where the user has to provide only values for a small number of variables, This model contains the control structure of the original program and timing information for generalized basic blocks. We consider Fortran programs which are parallelized using the message passing paradigm. A prototype of a tool for automatic model generation has been developed which is able to treat examples of moderate size. (C) 1999 Elsevier Science B.V. All rights reserved.
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify p...
详细信息
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks, In this paper we present two profiling techniques for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split-C applications showing that the profiler is useful in determining performance bottlenecks. Copyright (C) 1999 John Whey & Sons, Ltd.
High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel mess...
详细信息
High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel message-passing programs. This paper provides an overview of HPF compilation and runtime technology for distributed-memory architectures, and deals with a number of topics in some detail. In particular, we discuss distribution and alignment processing, the basic compilation scheme and methods for the optimization of regular computations. A separate section is devoted to the transformation and optimization of independent loops with irregular data accesses. The paper concludes with a discussion of research issues and outlines potential future development paths of the language. (C) 1999 Elsevier Science B.V. All rights reserved.
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel...
详细信息
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel implementation based on the concurrency control theory of database systems is presented;the parallelised annealing process is serialisable. Concurrent updates to the base solution are allowed provided that they do not have data conflict. Using the travelling salesman problem as the example application, the parallel simulated annealing algorithm is implemented on a Motorola Delta 3000 shared-memory multiprocessor system with eight processors. With a moderate problem size of 400 cities, a speedup efficiency of over 90% is achieved at high annealing temperature and close to 100% at a low annealing temperature.
暂无评论