Passing messages through shared memory plays an important role in symmetric multiprocessors and on Clumps. The management of concurrent access to message queues is an important aspect of design for shared memory messa...
详细信息
Passing messages through shared memory plays an important role in symmetric multiprocessors and on Clumps. The management of concurrent access to message queues is an important aspect of design for shared memory message passing systems. Using both microbenchmarks and applications, the paper compares the performance of concurrent access algorithms for passing active messages on a Sun Enterprise 5000 server. The paper presents a new lock free algorithm that provides many of the advantages of non blocking algorithms while avoiding the overhead of true non blocking behavior. The lock free algorithm couples synchronization tightly to the data structure and demonstrates application performance superior to all others studied. The success of this algorithm implies that other practical problems might also benefit from a reexamination of the non blocking literature.
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encodin...
详细信息
Erasure code based object storage systems are becoming popular choices for archive storage systems due to cost-effective storage space saving schemes and higher fault-resilience capabilities. Both erasure code encoding and decoding procedures involve heavy array, matrix, and table-lookup compute intensive operations. With today's advanced CPU design technologies such as multi-core, many-core, and streaming SIMD instruction sets we can effectively and efficiently adapt the erasure code technology in cloud storage systems and apply it to handle very large-scale date sets. Current solutions of the erasure coding process are based on single process approach which is not capable of processing very large data sets efficient and effectively. To prevent the bottleneck of a single process erasure encoding process, we utilize the task parallelism property from a multicore computing system and improve erasure coding process with parallel processing capability. We have leveraged open source erasure coding software and implemented a concurrent and parallel erasure coding software, called parEC. The proposed parEC process is realized through MPI run time parallel I/O environment and then data placement process is applied to distribute encoded data blocks to their destination storage devices. In this paper, we present the software architecture of parEC. We conduct various performance testing cases on parEC's software components. We present our early experience of using parEC, and address parEC's current status and future development works.
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimiza...
详细信息
Data parallel languages like High Performance Fortran (HPF) are emerging as the architecture independent mode of programming distributed memory parallel machines. In this paper, we present the interprocedural optimizations required for compiling applications having irregular data access patterns, when coded in such data parallel languages. We have developed an Interprocedural Partial Redundancy Elimination (IPRE) algorithm for optimized placement of runtime preprocessing routine and collective communication routines inserted for managing communication in such codes. We also present two new interprocedural optimizations, placement of scatter routines and use of coalescing and incremental routines. We then describe how program slicing can be used for further applying IPRE in more complex scenarios. We have done a preliminary implementation of the schemes presented here using the Fortran D compilation system as the necessary infrastructure. We present experimental results from two codes compiled using our system to demonstrate the efficacy of the presented schemes.
Processor layout and data distribution are important to performance-oriented parallel computation, yet high-level language support that helps programmers address these issues is often inadequate. This paper presents a...
详细信息
Processor layout and data distribution are important to performance-oriented parallel computation, yet high-level language support that helps programmers address these issues is often inadequate. This paper presents a trio of abstract high-level language constructs - grids, distributions, and regions - that let programmers manipulate processor layout and data distribution. Grids abstract processor sets, regions abstract index sets, and distributions abstract mappings from index sets to processor sets; each of these is a first-class concept, supporting dynamic data reallocation and redistribution as well as dynamic manipulation of the processor set. This paper illustrates uses of these constructs in the solutions to several motivating parallel programming problems.
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new imp...
详细信息
UPC is an explicit parallel extension of ANSI C, which has been gaining rising attention from vendors and users. In this paper, we consider the low-level monitoring and experimental performance evaluation of a new implementation of the UPC compiler on the SGI Origin family of NUMA architectures. These systems offer many opportunities for the high-performance implantation of UPC They also offer, due to their many hardware monitoring counters, the opportunity for low-level performance measurements to guide compiler implementations. Early, UPC compilers have the challenge of meeting the syntax and semantics requirements of the language. As a result, such compilers tend to focus on correctness rather than on performance. In this paper, we report on the performance of selected applications and kernels under this new compiler. The measurements were designed to help shed some light on the next steps that should be taken by UPC compiler developers to harness the full performance and usability potential of UPC under these architectures.
The software product set of the MasPar computer is examined, the key issues being the programming model, software philosophy, parallel virtuality, programming languages and their compilers, application porting and ada...
详细信息
The software product set of the MasPar computer is examined, the key issues being the programming model, software philosophy, parallel virtuality, programming languages and their compilers, application porting and adaptation, and programming support. VLSI technology and massively parallel architecture are combined in the MP-1 to offer what would be considered supercomputer performance at a minicomputer price.< >
This paper describes the integration of nested data parallelism into imperative languages using the example of C. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular da...
详细信息
This paper describes the integration of nested data parallelism into imperative languages using the example of C. Unlike flat data parallelism, nested data parallelism directly provides means for handling irregular data structures and certain forms of control parallelism, such as divide-and-conquer algorithms, thus enabling the programmer to express such algorithms far more naturally. Existing work deals with nested data parallelism in a functional environment, which does help avoid a set of problems, but makes efficient implementations more complicated. Moreover functional languages are not readily accepted by programmers used to languages, such as Fortran and C, which are currently predominant in programmingparallel machines. In this paper, we introduce the imperative data-parallel language V and give an overview of its implementation.
Recently there was an active development of parallel programming methods concerning implementation of general-purpose algorithms on graphical processing units (GPUs). Using this specialized hardware allows increasing ...
详细信息
Recently there was an active development of parallel programming methods concerning implementation of general-purpose algorithms on graphical processing units (GPUs). Using this specialized hardware allows increasing performance significantly, but requires low-level programming and understanding details of underlying hardware and software platform. Therefore there is a need for automating development process. This paper presents a technique for automating GPU application development, based on rewriting rules approach. An example is given demonstrating the possibilities of our approach when migrating from sequential C# program to its parallel analog running on GPU, as well as optimization of parallel applications. Using our approach we obtained performance speedup of 25X, while preserving the benefits of *** platform.
This paper describes a novel spatial-force decomposition for N-body simulations for which we observe O(sqrt(p)) communication scaling. This has enabled Blue Matter to approach the effective limits of concurrency for m...
详细信息
This paper describes a novel spatial-force decomposition for N-body simulations for which we observe O(sqrt(p)) communication scaling. This has enabled Blue Matter to approach the effective limits of concurrency for molecular dynamics using particle-mesh (FFT-based) methods for handling electrostatic interactions. Using this decomposition, Blue Matter running on Blue Gene/L has achieved simulation rates in excess of 1000 time steps per second and demonstrated significant speed-ups to O(1) atom per node. Blue Matter employs a communicating sequential process (CSP) style model with application communication state machines compiled to hardware interfaces. The scalability achieved has enabled methodologically rigorous biomolecular simulations on biologically interesting systems, such as membrane-bound proteins, whose time scales dwarf previous work on those systems. Major scaling improvements require exploration of alternative algorithms for treating the long range electrostatics
A programming tool, called parallelizer, for the static optimization of concurrent programs is considered. The tool partitions the alternative command lists of a nondeterministic iterative command into distinct elemen...
详细信息
A programming tool, called parallelizer, for the static optimization of concurrent programs is considered. The tool partitions the alternative command lists of a nondeterministic iterative command into distinct elements that are concurrently executed. To improve the program's performance, the tool determines a decomposition where the granularity of the resulting processes is close to optimal for the target parallel architecture. This requires that some parameters of the target architecture are taken into account. Search techniques traditionally used in artificial intelligence are exploited to determine an optimal alternative assignment. The implementation of the parallelizer is described and an example of its application is considered.< >
暂无评论