This paper describes a novel spatial-force decomposition for N-body simulations for which we observe O(sqrt(p)) communication scaling. This has enabled Blue Matter to approach the effective limits of concurrency for m...
详细信息
This paper describes a novel spatial-force decomposition for N-body simulations for which we observe O(sqrt(p)) communication scaling. This has enabled Blue Matter to approach the effective limits of concurrency for molecular dynamics using particle-mesh (FFT-based) methods for handling electrostatic interactions. Using this decomposition, Blue Matter running on Blue Gene/L has achieved simulation rates in excess of 1000 time steps per second and demonstrated significant speed-ups to O(1) atom per node. Blue Matter employs a communicating sequential process (CSP) style model with application communication state machines compiled to hardware interfaces. The scalability achieved has enabled methodologically rigorous biomolecular simulations on biologically interesting systems, such as membrane-bound proteins, whose time scales dwarf previous work on those systems. Major scaling improvements require exploration of alternative algorithms for treating the long range electrostatics
This paper describes the integration of nested data parallelism into imperative languages using the example of C. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular da...
详细信息
This paper describes the integration of nested data parallelism into imperative languages using the example of C. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms of control parallelism, such as divide-and-conquer algorithms, thus enabling the programmer to express such algorithms far more naturally. Existing work deals with nested data parallelism in a functional environment, which does help avoid a set of problems, but makes efficient implementations more complicated. Moreover functional languages are not readily accepted by programmers used to languages, such as Fortran and C, which are currently predominant in programmingparallel machines. In this paper, we introduce the imperative data-parallel language V and give an overview of its implementation.
Recently there was an active development of parallel programming methods concerning implementation of general-purpose algorithms on graphical processing units (GPUs). Using this specialized hardware allows increasing ...
详细信息
Recently there was an active development of parallel programming methods concerning implementation of general-purpose algorithms on graphical processing units (GPUs). Using this specialized hardware allows increasing performance significantly, but requires low-level programming and understanding details of underlying hardware and software platform. Therefore there is a need for automating development process. This paper presents a technique for automating GPU application development, based on rewriting rules approach. An example is given demonstrating the possibilities of our approach when migrating from sequential C# program to its parallel analog running on GPU, as well as optimization of parallel applications. Using our approach we obtained performance speedup of 25X, while preserving the benefits of *** platform.
Based on biological immune theory, a new immune algorithm is presented. Compared with the classical evolutionary programming and evolutionary algorithms with chaotic mutations, experimental results show that the propo...
详细信息
ISBN:
(纸本)9781424447947
Based on biological immune theory, a new immune algorithm is presented. Compared with the classical evolutionary programming and evolutionary algorithms with chaotic mutations, experimental results show that the proposed algorithm, parallel chaos immune evolutionary programming, is of high efficiency and can effectively prevent premature convergence. A three-layer feed-forward neural network is designed to predict the state of charge (SOC) of Ni-MH batteries. Initially, partial least square regression is used to select input variables. Then, five variables, battery terminal voltage, voltage derivative, voltage second derivative, discharge current and battery temperature, are selected as the inputs of NN. In order to overcome the weakness of BP algorithm, the proposed algorithm is adopted to train weights. Finally, under the state of dynamic power cycle, the estimated SOC from NN model and the measured SOC from experiments are compared, and the results conform that the proposed approach can provide an accurate estimation of the SOC.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimiza...
详细信息
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new imp...
详细信息
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new implementation of the UPC compiler on the SGI Origin family of NUMA architectures. These systems offer many opportunities for the high-performance implantation of UPC They also offer, due to their many hardware monitoring counters, the opportunity for low-level performance measurements to guide compiler implementations. Early, UPC compilers have the challenge of meeting the syntax and semantics requirements of the language. As a result, such compilers tend to focus on correctness rather than on performance. In this paper, we report on the performance of selected applications and kernels under this new compiler. The measurements were designed to help shed some light on the next steps that should be taken by UPC compiler developers to harness the full performance and usability potential of UPC under these architectures.
Processor layout and data distribution are important to performance-oriented parallel computation, yet high-level language support that helps programmers address these issues is often inadequate. This paper presents a...
详细信息
Processor layout and data distribution are important to performance-oriented parallel computation, yet high-level language support that helps programmers address these issues is often inadequate. This paper presents a trio of abstract high-level language constructs - grids, distributions, and regions - that let programmers manipulate processor layout and data distribution. Grids abstract processor sets, regions abstract index sets, and distributions abstract mappings from index sets to processor sets; each of these is a first-class concept, supporting dynamic data reallocation and redistribution as well as dynamic manipulation of the processor set. This paper illustrates uses of these constructs in the solutions to several motivating parallel programming problems.
The software product set of the MasPar computer is examined, the key issues being the programming model, software philosophy, parallel virtuality, programming languages and their compilers, application porting and ada...
详细信息
The software product set of the MasPar computer is examined, the key issues being the programming model, software philosophy, parallel virtuality, programming languages and their compilers, application porting and adaptation, and programming support. VLSI technology and massively parallel architecture are combined in the MP-1 to offer what would be considered supercomputer performance at a minicomputer price.< >
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encodin...
详细信息
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encoding and decoding procedures involve heavy array, matrix, and table-lookup compute intensive operations. With today's advanced CPU design technologies such as multi-core, many-core, and streaming SIMD instruction sets we can effectively and efficiently adapt the erasure code technology in cloud storage systems and apply it to handle very large-scale date sets. Current solutions of the erasure coding process are based on single process approach which is not capable of processing very large data sets efficient and effectively. To prevent the bottleneck of a single process erasure encoding process, we utilize the task parallelism property from a multicore computing system and improve erasure coding process with parallel processing capability. We have leveraged open source erasure coding software and implemented a concurrent and parallel erasure coding software, called parEC. The proposed parEC process is realized through MPI run time parallel I/O environment and then data placement process is applied to distribute encoded data blocks to their destination storage devices. In this paper, we present the software architecture of parEC. We conduct various performance testing cases on parEC's software components. We present our early experience of using parEC, and address parEC's current status and future development works.
The authors describe DSVM6K, an implementation of distributed shared virtual memory (DSVM) in AIX v3 on the IBM RISC (reduced instruction set computer) System/6000 workstation. The design and implementation exploit th...
详细信息
The authors describe DSVM6K, an implementation of distributed shared virtual memory (DSVM) in AIX v3 on the IBM RISC (reduced instruction set computer) System/6000 workstation. The design and implementation exploit the high-speed fiber links and low-overhead link protocols on the RISC System/6000 workstation. The results demonstrate that technology like this is required for a high-performance DSVM system. DSVM6K achieves the highest performance of any DSVM system reported in the literature. Descriptions of the overall design, programming interfaces, performance measurement data, and application results are presented.< >
暂无评论