The compilation of high-level programming languages for parallel machines faces two challenges: maximizing data/process locality and balancing load. No solutions for the general case are known that solve both problems...
详细信息
The compilation of high-level programming languages for parallel machines faces two challenges: maximizing data/process locality and balancing load. No solutions for the general case are known that solve both problems at once. The present paper describes a programming model that allows to solve both problems for the special case of neural network learning algorithms, even for irregular networks with dynamically changing topology (constructive neural algorithms). The model is based on the observation that such algorithms predominantly execute local operations (on nodes and connections of the network), reductions, and broadcasts. The model is concretized in an object-centered procedural language called CuPit. The language is completely abstract: No aspects of the parallel implementation such as number of processors, data distribution, process distribution, execution model etc. are visible in user programs. The compiler can derive most information relevant for the generation of efficient code from unannotated source code. Therefore, CuPit programs are efficiently portable. A compiler for CuPit has been built for the MasPar MP-1/MP-2 using compilation techniques that can also be applied to most other parallel machines. The paper shortly presents the main ideas of the techniques used and results obtained by the various optimizations.
The mpC language is an ANSI C superset supporting modular parallel programming for distributed memory machines. It allows the user to specify dynamically an application topology, and the mpC programming environment us...
详细信息
The mpC language is an ANSI C superset supporting modular parallel programming for distributed memory machines. It allows the user to specify dynamically an application topology, and the mpC programming environment uses this information in run time to provide the most efficient execution of the program on any particular distributed memory machine. The paper describes the features of mpC and its programming environment which allow to use them for developing libraries of parallel programs.
The complexity of characterizing both parallel hardware and software makes it very difficult to explain and predict the performances of parallel programs for real industrial CFD applications. A performance model based...
详细信息
The complexity of characterizing both parallel hardware and software makes it very difficult to explain and predict the performances of parallel programs for real industrial CFD applications. A performance model based on a generalized Amdahl's formulation has been developed and, applied to a flow solver. The present formulation allows us to explain the behavior of a typical CFD explicit multiblock solver when the program is run on a multiprocessor distributed-memory system. Using this approach, it is possible to gain an insight on the performance limitations of this class of parallel solvers, by considering the impact of larger and larger number of processors on fixed-scaled problems, (C) 1999 Academic Press, Inc.
This paper presents a multi-level frontal algorithm and its implementation and applications on parallel computation A multi-frontal program is given which may be used for unsymmetric finite element matrix equations. T...
详细信息
This paper presents a multi-level frontal algorithm and its implementation and applications on parallel computation A multi-frontal program is given which may be used for unsymmetric finite element matrix equations. The parallel program is developed on a cluster of workstations. The PVM (parallel virtual machine) system is used to handle communications among networked workstations. The method has advantages such as numbering of the finite element mesh in an arbitrary manner, simple programming organisation, smaller core requirements and computation times. An implementation of this parallel method on workstations is discussed, the speedup and efficiency of this method being demonstrated and compared with general domain decomposition method based on band matrix methods by numerical examples.
Software parallelization is required to contend with the increasing scale and complexity of High-Energy Physics experiments. The authors have developed a programming model, Communication Capability (CoCa), which allow...
详细信息
Software parallelization is required to contend with the increasing scale and complexity of High-Energy Physics experiments. The authors have developed a programming model, Communication Capability (CoCa), which allows this parallelization at several levels of granularity and reduces software complexity.
Based on the framework of BSP, a Hierarchical Bulk Synchronous parallel (HBSP) performance model is introduced in this paper to capture the per formance optimization problem for various stages in parallel program deve...
详细信息
Based on the framework of BSP, a Hierarchical Bulk Synchronous parallel (HBSP) performance model is introduced in this paper to capture the per formance optimization problem for various stages in parallel program development and to accurately predict the performance of a parallel program by considering fac tors causing variance at local computation and global communication. The related methodology has been applied to several real applications and the results show that HBSP is a suitable model for optimizing parallel programs.
Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible, We discuss an empirical method of estimating runtime...
详细信息
Performance modeling for large industrial or scientific codes is of value for program tuning or for selection of new machines when benchmarking is not yet possible, We discuss an empirical method of estimating runtime for certain large parallel programs where computational work is estimated by regression functions based on measurements and time cost of communication is modeled by program analysis and benchmarks for communication primitives. The method is demonstrated with the local weather model (LM) of the German Weather Service (DWD) on SP-2, T3E, and SX-4. The method is an economic way of developing performance models because only a moderate number of measurements is required. The resulting model is sufficiently accurate even for very large test cases. (C) 1999 Elsevier Science B.V. All rights reserved.
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency...
详细信息
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency. We show that a second-order property, the standard deviation of the delivery times is also of interest. Most importantly, the average performance of a large communication system depends not only on the average performance of its components, but also on the standard deviation of these performances. In other words, building a high-performance system requires components that are themselves performing high-performance, but their performance must also have small variance. We illustrate this effect using distributions of the BSP g parameter. Lower bounds in the time per unit transfer of communication in large systems can be derived from data measured over single links. (C) 1999 Elsevier Science B.V. All rights reserved.
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The pre...
详细信息
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The present paper describes a way to generate a model of a given program automatically from the source code where the user has to provide only values for a small number of variables, This model contains the control structure of the original program and timing information for generalized basic blocks. We consider Fortran programs which are parallelized using the message passing paradigm. A prototype of a tool for automatic model generation has been developed which is able to treat examples of moderate size. (C) 1999 Elsevier Science B.V. All rights reserved.
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify p...
详细信息
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks, In this paper we present two profiling techniques for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split-C applications showing that the profiler is useful in determining performance bottlenecks. Copyright (C) 1999 John Whey & Sons, Ltd.
暂无评论