Future multiscale and multiphysics models that support research into human disease, translational medical science, and treatment can utilize the power of high-performance computing (HPC) systems. We anticipate that co...
详细信息
Future multiscale and multiphysics models that support research into human disease, translational medical science, and treatment can utilize the power of high-performance computing (HPC) systems. We anticipate that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message-passing processes [e. g., the message-passing interface (MPI)] with multithreading (e. g., OpenMP, Pthreads). The objective of this study is to compare the performance of such hybrid programming models when applied to the simulation of a realistic physiological multiscale model of the heart. Our results show that the hybridmodels perform favorably when compared to an implementation using only the MPI and, furthermore, that OpenMP in combination with the MPI provides a satisfactory compromise between performance and code complexity. Having the ability to use threads within MPI processes enables the sophisticated use of all processor cores for both computation and communication phases. Considering that HPC systems in 2012 will have two orders of magnitude more cores than what was used in this study, we believe that faster than real-time multiscale cardiac simulations can be achieved on these systems.
The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary bo...
详细信息
The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary both to achieve technological combinations that allow taking advantage of each programming model and device, and to define more effective load distribution strategies that consider the simulation conditions. In this work, a new load balancing algorithm is presented, together with a set of optimizations to support hybrid co-execution in a runtime system for heterogeneous computing. The new extended design enables the exploitation of custom kernels and acceleration technologies altogether, being encapsulated for the rest of the runtime and its scheduling system. With this support, Mash algorithm allows to simultaneously leverage different workload distribution strategies, benefiting from the most advantageous one per device and technology. Experiments show that these proposals achieve an efficiency close to 0.90 and an energy efficiency improvement around 1.80 over the original optimized version.
The increasing complexity of modern and future computing systems makes it challenging to develop applications that aim for maximum performance. hybrid parallel programmingmodels offer new ways to exploit the capabili...
详细信息
ISBN:
(纸本)9781728165820
The increasing complexity of modern and future computing systems makes it challenging to develop applications that aim for maximum performance. hybrid parallel programmingmodels offer new ways to exploit the capabilities of the underlying infrastructure. However, the performance gain is sometimes accompanied by increased programming complexity. We introduce an extension to PyCOMPSs, a high-level task-based parallel programming model for Python applications, to support tasks that use MPI natively as part of the task model. Without compromising application's programmability, using Native MPI tasks in PyCOMPSs offers up to 3x improvement in total performance for compute intensive applications and up to 1.9x improvement in total performance for 110 intensive applications over sequential implementation of the tasks.
Applications to process seismic data are computationally expensive and, therefore, employ scalable parallel systems to produce timely results. Here we describe our experiences of using performance analysis tools to ga...
详细信息
Applications to process seismic data are computationally expensive and, therefore, employ scalable parallel systems to produce timely results. Here we describe our experiences of using performance analysis tools to gain insight into an MPI+OpenMP code developed by Shell that performs Reverse Time Migration on a cluster to produce models of the subsurface. Tuning MPI+OpenMP programs for modern platforms is difficult, and, therefore, assistance is required from performance analysis tools. These tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from Rice University's HPCToolkit and hardware performance counters, we were able to improve the performance of Shell's prototype distributed-memory Reverse Time Migration code by roughly 30 percent.
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe...
详细信息
ISBN:
(纸本)9781605583976
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell (TM) 8i accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.
This paper describes the use of statistical tools to model the performance of mixed device (hosts and devices) programs where hosts are CPUs and devices are GPUs. The purpose of GPUs is to accelerate compute-intensive...
详细信息
ISBN:
(纸本)9781665435772
This paper describes the use of statistical tools to model the performance of mixed device (hosts and devices) programs where hosts are CPUs and devices are GPUs. The purpose of GPUs is to accelerate compute-intensive sections of a program, thereby reducing total execution time, with side-effects including reduced machine usage and energy consumption. To model major and minor factors that affect the execution time of offloaded programs, we used a compute-intensive program with several GPU kernels. We have abstracted the hybrid program as a sequence of computations that access various types of memories (device caches, device shared memory, memory of other devices and host memory). In the programming model discussed, the role of a host is reduced to scheduling and coordinating execution of kernels across devices and communicating with other hosts. It can be extended to models where hosts perform computations or are obliterated. Experiments were designed to include a range of memory sizes and types. The performance models were trained, and their predictions were verified using test data.
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe...
详细信息
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell (TM) 8i accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.
Applications to process seismic data are computationally expensive and, therefore, employ scalable parallel systems to produce timely results. Here we describe our experiences of using performance analysis tools to ga...
详细信息
Applications to process seismic data are computationally expensive and, therefore, employ scalable parallel systems to produce timely results. Here we describe our experiences of using performance analysis tools to gain insight into an MPI+OpenMP code developed by Shell that performs Reverse Time Migration on a cluster to produce models of the subsurface. Tuning MPI+OpenMP programs for modern platforms is difficult, and, therefore, assistance is required from performance analysis tools. These tools provided us with insights into the effectiveness of the domain decomposition strategy, the use of threaded parallelism, and functional unit utilization in individual cores. By applying insights obtained from Rice University's HPCToolkit and hardware performance counters, we were able to improve the performance of Shell's prototype distributed-memory Reverse Time Migration code by roughly 30 percent.
暂无评论