This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, ...
详细信息
This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (upc), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.
The Partitioned Global Address Space (PGAS) model of Unified Parallel C (upc) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get goo...
详细信息
The Partitioned Global Address Space (PGAS) model of Unified Parallel C (upc) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several upc program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix-matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for upc applications. Our results show that the optimized upc programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.
Provision of Quality-of-Service (QoS) guarantees is an important and challenging issue in the design of Asynchronous Transfer Mode (ATM) networks. Call Admission Control (CAC) is an integral part of the challenge and ...
详细信息
Provision of Quality-of-Service (QoS) guarantees is an important and challenging issue in the design of Asynchronous Transfer Mode (ATM) networks. Call Admission Control (CAC) is an integral part of the challenge and is closely related to other aspects of network designs such as traffic characterization and QoS specification. Since the Usage Parameter Control (upc) parameters are the only standardized traffic characterizations, developing efficient CAC schemes based on upc parameters is significant for the implementation of CAC on ATM switches. In this paper, we develop a CAC algorithm called TAP (derived from TAgged Probability) as well as two other CAC algorithms using the upc parameters. These CAC algorithms are based on our observation that the loss-probability-to-overflow-probability ratio tends to decrease as the number of sources increases. By introducing the loss-probability-to-overflow-probability ratio K, we find that this ratio sheds light on increasing resource utilization while still guaranteeing QoS. Analysis, simulation, and numerical results have shown that the proposed TAP algorithm is simple and efficient. Copyright (C) 2000 John Wiley & Sons, Ltd.
The estimation of unsaturated permeability coefficients (upc) is of great importance on the properties of the matrix. Aim to study the relationship between the upc and oxidation -reduction potential (ORP) in different...
详细信息
The estimation of unsaturated permeability coefficients (upc) is of great importance on the properties of the matrix. Aim to study the relationship between the upc and oxidation -reduction potential (ORP) in different matrix depths in the subsurface wastewater infiltration system (SWIS) and provide scientific basis for regulating SWIS and increasing pollutant removal, a test experiment of simulating the SWIS which included an inflow period (12 h) and a drying period (12 h) in one cycle was designed with a hydraulic load of 0.10 m3 center dot (m2 center dot d)-1. Results investigated that ORP could increase with upc increasing in 70 and 115 cm and decrease with the upc increasing in 100 and 130 cm matrix depths. Phenomena indicated that capillary action could affect upc and ORP obviously. Moreover, the existence of oxygen and low volumetric water contents could impose upc and ORP. upc in the 100 cm matrix depth below proved that anaerobic area could be found in aerobic environment under alternation conditions. upc in different matrix depths of a satisfactory SWIS could change from 2.49 x 10-7 to 1.16 x 10-3 cm center dot s-1. Treated water met reused requirements and no clogging was found.
Developments in high performance computing (HPC) has today transformed the manner of how computational hydrodynamic (CHD) simulations are performed. Till now, the message passing interface (MPI) remains the common par...
详细信息
ISBN:
(纸本)9783319936987;9783319936970
Developments in high performance computing (HPC) has today transformed the manner of how computational hydrodynamic (CHD) simulations are performed. Till now, the message passing interface (MPI) remains the common parallelism architecture and has been adopted widely in CHD simulations. However, its bottleneck problem remains for some large-scale simulation cases due to delays during message passing whereby the total communication time may exceed the total simulation runtime with an increasing number of computer processers. In this study, we utilise an alternative parallelism architecture, known as PGAS-upc, to develop our own upc-CHD model with a 2-step explicit scheme from the Lax-Wendroff family of predictors-correctors. The model is evaluated on three incompressible, adiabatic viscous 2D flow cases having moderate flow velocities. Model validation is achieved by the reasonably good agreement between the predicted and respective analytical values. We then compare the computational performance between upc-CHD and that of MPI in its base design in a SGI UV-2000 server till 100 processers maximum in this study. The former achieves a near 1:1 speedup which demonstrates its efficiency potential for very large-scale CHD simulations, while the later experiences slowdown at some point. Extension of upc-CHD remains our main objective which can be achieved by the following additions: (a) inclusions of other numerical schemes to accommodate for other types of fluid simulations, and (b) coupling upc-CHD with Amazon Web Service (AWS) to further exploit its parallelism efficiency as a viable alternative.
Unified Parallel C (upc) is a Partitioned Global Address Space (PGAS) language that exhibits high performance and portability on a broad class of shared and distributed memory parallel architectures. This paper descri...
详细信息
ISBN:
(纸本)9783642038686
Unified Parallel C (upc) is a Partitioned Global Address Space (PGAS) language that exhibits high performance and portability on a broad class of shared and distributed memory parallel architectures. This paper describes the design and implementation of a parallel numerical library for upc built on top of the sequential BLAS routines. The developed library exploits the particularities of the PEAS paradigm, taking into account data locality in order to guarantee a good performance. The library was experimentally validated;demonstrating scalability and efficiency.
upc is designed to improve user productivity when programming distributed-memory machines. Yet the shared-memory abstraction also makes performance analysis hard as it introduces extra overhead with local accesses and...
详细信息
ISBN:
(纸本)9780769547497
upc is designed to improve user productivity when programming distributed-memory machines. Yet the shared-memory abstraction also makes performance analysis hard as it introduces extra overhead with local accesses and implicit communication with remote ones. As far as we know, there are no mature software utilities for systematic analysis and tuning of shared-memory access performance in upc programs. We develop a mechanism to track shared memory accesses and correlate them to the upc source lines, functions, and data structures. We then apply tool-assisted analysis to a set of upc programs. For the NAS upc benchmark we achieve dramatic performance improvement over the unoptimized implementation as well as up to two times speedups over the fully hand-tuned implementation. We expect our approach effective in tuning a wide range of upc programs.
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. T...
详细信息
ISBN:
(纸本)9781479941162
Message Passing Interface (MPI) has been the defacto programming model for scientific parallel applications. However, data driven applications with irregular communication patterns are harder to implement using MPI. The Partitioned Global Address Space (PGAS) programming models present an alternative approach to improve programmability. PGAS languages like upc are growing in popularity because of their ability to provide shared-memory programming model over distributed memory machines. However, since upc is an emerging standard, it is unlikely that entire applications will be re-written with it. Instead, unified communication runtimes have paved the way for a new class of hybrid applications that can leverage the benefits of both MPI and PGAS models. Such unified runtimes need to be designed in a high performance, scalable manner to improve the performance of emerging hybrid applications. Collective communication primitives offer a flexible, portable way to implement group communication operations and are supported in both MPI and PGAS programming models. Owing to their advantages, they are also widely used across various scientific parallel applications. Over the years, MPI libraries have relied upon aggressive software-/hardware-based and kernel-assisted optimizations to deliver low communication latency for various collective operations. However, there is much room for improvement for collective operations in state-of-the-art, open-source implementations of upc. In this paper, we address the challenges associated with improving the performance of collective primitives in upc. Further, we also explore design alternatives to enable collective primitives in upc to directly leverage the designs available in the MVAPICH2 MPI library. Our experimental evaluations show that our designs improve the performance of the upc broadcast and all-gather operations, by 25X and 18X respectively for 128KB message at 2,048 processes. Our designs improve the performance of the upc 2D
Bit-reversal is widely known being an important program, as essential part of Fast Fourier Transform. If not carefully and well designed, it may easily take large portion of FFT application's total execution time....
详细信息
ISBN:
(纸本)9783642130663
Bit-reversal is widely known being an important program, as essential part of Fast Fourier Transform. If not carefully and well designed, it may easily take large portion of FFT application's total execution time. In this paper, we present a parallel implementation of Bit-reversal for FFT using Cilk and upc. Based on our previous work of creating parallel Bit-reversal using OpenMP in SPMD style from an unparallelized and sequential algorithm, we could note that keeping the existing parallelism by reorganizing the same program using Cilk and upc libraries is possible yet achieving good performance. Experimental results were obtained by executing these parallel codes on two multi-core SMP platforms, and they show to be very promising.
Unified Parallel C (upc) is a Partitioned Global Address Space (PGAS) language whose popularity has increased during the last years owing to its high programmability and reasonable performance through an efficient exp...
详细信息
Unified Parallel C (upc) is a Partitioned Global Address Space (PGAS) language whose popularity has increased during the last years owing to its high programmability and reasonable performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. However, the performance issues that arise in this language due to the irregular structure of sparse matrix operations have not yet been studied. Among them, the selection of an adequate storage format for the sparse matrices can significantly improve the efficiency of the parallel codes. This paper presents an evaluation, using upc, of the most common sparse storage formats with different implementations of the matrix-vector and matrix-matrix products, which are key kernels in many scientific applications.
暂无评论