We deal with the computational aspects of a numerical method for solving the electric field integral equation (EFIE) for the analysis of the interaction of electromagnetic signals with thin-wires structures. Our inter...
详细信息
We deal with the computational aspects of a numerical method for solving the electric field integral equation (EFIE) for the analysis of the interaction of electromagnetic signals with thin-wires structures. Our interest is mainly to device an efficient parallel implementation of this numerical method which helps physicist to solve the electric field integral equation for very complex and large thin-wires structures The development of this parallel implementation has been carried out on distributed memory multiprocessors, with the use of the parallel programming library MPI and routines of PETSc (portable, extensible toolkit for scientific computation). These routines can solve sparse linear systems in parallel. Appropriate data partitions have been designed in order to optimize the performance of the parallel implementation. A parameter named relative efficiency has been defined to compare two parallel executions with different number of processors. This parameter allows us to better describe the superlinear performance behavior of our parallel implementation. Evaluation of the parallel implementation is given in terms of the values of the speed-up and the relative efficiency. Moreover, a discussion about the requirements of memory versus the number of processors is included. It will be shown that memory hierarchy management improves substantially as the number of processors increases and that this is the reason why superlinear speed-up is obtained.
This paper reviews the challenges in the design of emerging complex systems-on-a-chip (SoC) at STMicroelectronics, from the perspective of our customers' requirements. We then present an approach to effectively in...
详细信息
This paper reviews the challenges in the design of emerging complex systems-on-a-chip (SoC) at STMicroelectronics, from the perspective of our customers' requirements. We then present an approach to effectively integrate heterogenous parallel components - H/W or S/W - into a homogeneous programming environment. This approach, supported by ST's MultiFlex multi-processing SoC environment, allows for the combination of a range of heterogeneous processing elements, supported by high-level programming models. Two programming models are supported: a distributed system object component (DSOC) message passing model, and a symmetrical multi-processing (SMP) model using shared memory. To illustrate the concepts discussed in this paper, we have applied the MultiFlex technology to the mapping of a high-level MPEG4 video encoder (VGA resolution at 30 frames per second) onto a mixed multi-processor and hardware platform.
The Monte Carlo (MC) method is a simple but effective way to perform simulations involving complicated or multivariate functions. The Quasi-Monte Carlo (QMC) method is similar but replaces independent and identically ...
详细信息
ISBN:
(纸本)9783540241287
The Monte Carlo (MC) method is a simple but effective way to perform simulations involving complicated or multivariate functions. The Quasi-Monte Carlo (QMC) method is similar but replaces independent and identically distributed (i.i.d.) random points by low discrepancy points. Low discrepancy points are regularly distributed points that may be deterministic or randomized. The digital net is a kind of low discrepancy point set that is generated by number theoretical methods. A software library for low discrepancy point generation has been developed. It is thread-safe and supports MPI for parallel computation. A numerical example from physics is shown.
The strong focus of recent high end computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented pr...
详细信息
The strong focus of recent high end computing efforts on performance has resulted in a low-level parallel programming paradigm characterized by explicit control over message-passing in the framework of a fragmented programming model. In such a model, object code performance is achieved at the expense of productivity, conciseness, and clarity. This paper describes the design of Chapel, the cascade high productivity language, which is being developed in the DARPA-funded HPCS project Cascade led by Cray Inc. Chapel pushes the state-of-the-art in languages for HEC system programming by focusing on productivity, in particular by combining the goal of highest possible object code performance with that of programmability offered by a high-level user interface. The design of Chapel is guided by four key areas of language technology: multithreading, locality-awareness, object-orientation, and generic programming. The Cascade architecture, which is being developed in parallel with the language, provides key architectural support for its efficient implementation.
The OpenMP API is an emerging standard for parallel programming on shared memory multiprocessors. In order to run an OpenMP program on a cluster, one feasible scheme is to translate the OpenMP program into a software ...
详细信息
ISBN:
(纸本)9780780384309
The OpenMP API is an emerging standard for parallel programming on shared memory multiprocessors. In order to run an OpenMP program on a cluster, one feasible scheme is to translate the OpenMP program into a software DSM program, then execute it on the cluster. Evaluating the performance of OpenMP programs and analyzing their behavior will help support the OpenMP programming model on a software DSM cluster efficiently. In this paper, we use an experimental approach to investigate how the characteristics of the software DSM cluster and the translation together with the original program behavior determine the performance of OpenMP programs on software DSM clusters.
Advances in communication for parallel programming have yielded one-sided messaging systems. The MPI bindings for Ruby have been augmented to include the remote memory access functions of MPI-2.
Advances in communication for parallel programming have yielded one-sided messaging systems. The MPI bindings for Ruby have been augmented to include the remote memory access functions of MPI-2.
Recently, the graphical parallel programming environment P-GRADE based on message passing has been extended by advanced synchronization and control mechanisms (synchronizers) based on predicates computed on consistent...
详细信息
Recently, the graphical parallel programming environment P-GRADE based on message passing has been extended by advanced synchronization and control mechanisms (synchronizers) based on predicates computed on consistent application global states (PS-GRADE). In the new P-GRADE Workflow system, the application workflow has been introduced to enable designing control flow between otherwise independent applications. In this paper, we propose that program execution control by synchronizers can be extended and used to coordinate parallel applications executed on GRID in the frame of a PS-GRADE tool. We introduce explicit inter-application, or GRID-level, synchronizers. They are connected to application level synchronizers, which are able to send state reports and receive control signals over GRID communication network. Open Grid Services Infrastructure, implemented using Globus toolkit, is proposed to be used for these purposes.
Exploiting multilevel parallelism using processor groups is becoming increasingly important for programming high-end systems. This paper describes a group-aware run-time support for shared-/global- address space progr...
详细信息
Exploiting multilevel parallelism using processor groups is becoming increasingly important for programming high-end systems. This paper describes a group-aware run-time support for shared-/global- address space programming models. The current effort has been undertaken in the context of the Aggregate Remote Memory Copy Interface (ARMCI) [1], a portable runtime system used as a communication layer for Global Arrays [2], Co-Array Fortran (CAF) [3], GPSHMEM [4], Co-Array Python [5], and also end-user applications. The paper describes the management of shared memory, integration of shared memory communication and remote direct memory access (RDMA) on clusters with SMP nodes, and registration. These are all required for efficient multi- method and multi-protocol communication on modern systems. Focus is placed on techniques for supporting process groups while maximizing communication performance and efficiently managing global memory system-wide.
This paper describes an approach for automatically generating optimized parallel code from serial Fortran programs annotated with high level directives. A preprocessor analyzes both the program and the directives and ...
详细信息
This paper describes an approach for automatically generating optimized parallel code from serial Fortran programs annotated with high level directives. A preprocessor analyzes both the program and the directives and generates efficient parallel Fortran code that runs on a number of parallel architectures, such as clusters or SMPs. The unique aspect of this approach is that the directives and optimizations can be customized and extended by the expert programmers who would be using them in their applications. This approach enables the creation of parallel extensions to Fortran that are specific to individual applications or science domains.
It is important to systematically assess the features and performance of the new interconnects for high performance clusters. This work presents the performance of the two-port Myrinet networks at the GM2 and MPI laye...
详细信息
It is important to systematically assess the features and performance of the new interconnects for high performance clusters. This work presents the performance of the two-port Myrinet networks at the GM2 and MPI layers using a complete set of microbenchmarks. We also present the communication characteristics and the performance of the NAS multi-zone benchmarks and SMG2000 application under the MPI and MPI-OpenMP programming paradigms. We found that the host overhead is very small in our cluster, and the Myrinet is sensitive to the buffer reuse patterns. Our applications achieved a better performance for MPI than the mixed-mode. All the applications studied use only nonblocking communications, thus are able to overlap their communications with the computations. Our experiments show that the two-port communication at the GM and MPI levels (except for the RDMA read, and overlap) outperforms the one-port communication for the bandwidth. However, this did not translate into a considerable improvement at least for our applications.
暂无评论