task-based programming models have enabled the optimized execution of the computation workloads of applications. These programmingmodels can take advantage of large-scale distributed infrastructures by allowing the p...
详细信息
task-based programming models have enabled the optimized execution of the computation workloads of applications. These programmingmodels can take advantage of large-scale distributed infrastructures by allowing the parallel and distributed execution of applications in high-level work components called tasks. Nevertheless, in the era of Big Data and Exascale, the amount of data produced by modern scientific applications has already surpassed terabytes and is rapidly increasing. Hence, I/O performance became the bottleneck to overcome in order to achieve more total performance improvement. New storage technologies offer higher bandwidth and faster solutions than traditional Parallel File Systems (PFS). Such storage devices are deployed in modern day infrastructures to boost I/O performance by offering a fast layer that absorbs the generated data. Therefore, it is necessary for any programming model targeting more performance to manage this heterogeneity and take advantage of it to improve the I/O performance of applications. Towards this goal, we propose in this article a set of programming model capabilities that we refer to as Storage-Heterogeneity Awareness. Such capabilities include: (i) abstracting the heterogeneity of storage systems, and (ii) optimizing I/O performance by supporting dedicated I/O schedulers and an automatic data flushing technique. The evaluation section of this article presents the performance results of different applications on the MareNostrum CTE-Power heterogeneous storage cluster. Our experiments demonstrate that a storage-heterogeneity aware programming model can achieve up to almost 5x I/O performance speedup and 48% total time improvement compared to the reference PFS-based usage of the execution infrastructure.
Storage systems have not kept the same technology improvement rate as computing systems. As applications produce more and more data, I/O becomes the limiting factor for increasing application performance. I/O congesti...
详细信息
Storage systems have not kept the same technology improvement rate as computing systems. As applications produce more and more data, I/O becomes the limiting factor for increasing application performance. I/O congestion caused by concurrent access to storage devices is one of the main obstacles that cause I/O performance degradation and, consequently, total performance degradation. Although task-based programming models made it possible to achieve higher levels of parallelism by enabling the execution of tasks in large-scale distributed platforms, this parallelism only benefited the compute workload of the application. Previous efforts addressing I/O performance bottlenecks either focused on optimizing fine-grained I/O access patterns using I/O libraries or avoiding system-wide I/O congestion by minimizing interference between multiple applications. In this paper, we propose enabling I/O Awareness in task-based programming models for improving the total performance of applications. An I/O aware programming model is able to create more parallelism and mitigate the causes of I/O performance degradation. On the one hand, more parallelism can be created by supporting special tasks for executing I/O workloads, called I/O tasks, that can overlap with the execution of compute tasks. On the other hand, I/O congestion can be mitigated by constraining I/O tasks scheduling. We propose two approaches for specifying such constraints: explicitly set by the users or automatically inferred and tuned during application's execution to optimize the execution of variable I/O workloads on a certain storage infrastructure. We implement our proposal using PyCOMPSs: a task-basedprogramming model for parallelizing Python applications. Our experiments on the MareNostrum 4 Supercomputer demonstrate that using I/O aware PyCOMPSs can achieve significant performance improvement in the total execution time of applications with different I/O workloads. This performance improvement can reach up to
task-based programming models such as OpenMP 5.0 and OmpSs are simple to use and powerful enough to exploit task parallelism of applications over multicore, manycore and heterogeneous systems. However, their software-...
详细信息
task-based programming models such as OpenMP 5.0 and OmpSs are simple to use and powerful enough to exploit task parallelism of applications over multicore, manycore and heterogeneous systems. However, their software-only runtimes introduce relevant overhead when targeting fine-grained tasks, resulting in performance losses. To overcome this drawback, we present a hardware runtime Picos++ that accelerates critical runtime functions such as task dependence analysis, nested task support, and heterogeneous task scheduling. As a proof-of-concept, the Picos++ hardware runtime has been integrated with a compiler infrastructure that supports parallel task-based programming models. A FPGA SoC running Linux OS has been used to implement the hardware accelerated part of Picos++, integrated with a heterogeneous system composed of 4 symmetric multiprocessor (SMP) cores and several hardware functional accelerators (HwAccs) for task execution. Results show significant improvements on energy and performance compared to state-of-the-art parallel software-only runtimes. With Picos++, applications can achieve up to 7.6x speedup and save up to 90 percent of energy, when using 4 threads and up to 4 HwAccs, and even reach a speedup of 16x over the software alternative when using 12 HwAccs and small tasks.
We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the ...
详细信息
ISBN:
(数字)9798350352917
ISBN:
(纸本)9798350352924;9798350352917
We present the design and scalable implementation of an exascale climate emulator for addressing the escalating computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data. This provides tunable spatio-temporal resolution and significantly improves the fidelity and granularity of climate emulation, achieving an ultra-high spatial resolution of 0.034 degrees (similar to 3.5 km) in space. Our emulator, trained on 318 billion hourly temperature data points from a 35-year and 31 billion daily data points from an 83-year global simulation ensemble, generates statistically consistent climate emulations. We extend linear solver software to mixed-precision arithmetic GPUs, applying different precisions within a single solver to adapt to different correlation strengths. The PaRSEC runtime system supports efficient parallel matrix operations by optimizing the dynamic balance between computation, communication, and memory requirements. Our BLAS3-rich code is optimized for systems equipped with four different families and generations of GPUs, scaling well to achieve 0.976 EFlop/s on 9,025 nodes (36,100 AMD MI250X multi-chip module (MCM) GPUs) of Frontier (nearly full system), 0.739 EFlop/s on 1,936 nodes (7,744 Grace-Hopper Superchips (GH200)) of Alps, 0.243 EFlop/s on 1,024 nodes (4,096 A100 GPUs) of Leonardo, and 0.375 EFlop/s on 3,072 nodes (18,432 V100 GPUs) of Summit.
task-based programming models are a promising approach to exploiting complex distributed and heterogeneous systems. However, integrating different communication, offloading, and storage APIs within tasks poses perform...
详细信息
ISBN:
(数字)9783031617638
ISBN:
(纸本)9783031617621;9783031617638
task-based programming models are a promising approach to exploiting complex distributed and heterogeneous systems. However, integrating different communication, offloading, and storage APIs within tasks poses performance and deadlock risks. Several task-Aware libraries, such as TAMPI, TASIO, and TACUDA, have been developed to integrate blocking and non-blocking APIs within task-based programming models efficiently. In this paper, we introduce the Asynchronous Low-level programming Interface (ALPI) to enable the interoperability and portability of task-Aware libraries across various programmingmodels and runtime systems. We have implemented ALPI in the Nanos6 and nOS-V runtimes, enhancing the integration of task-Aware libraries with the OmpSs-2 and OpenMP programmingmodels. This work is a step towards improving the composability of parallel programmingmodels by supporting task-Aware libraries across different runtime systems.
We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-ran...
详细信息
ISBN:
(纸本)9781665454445
We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations;however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tile-based Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation.
This article presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA i...
详细信息
This article presents the new features of the OmpSs@FPGA framework. OmpSs is a data-flow programming model that supports task nesting and dependencies to target asynchronous parallelism and heterogeneity. OmpSs@FPGA is the extension of the programming model addressed specifically to FPGAs. OmpSs environment is built on top of Mercurium source to source compiler and Nanos++ runtime system. To address FPGA specifics Mercurium compiler implements several FPGA related features as local variable caching, wide memory accesses or accelerator replication. In addition, part of the Nanos++ runtime has been ported to hardware. Driven by the compiler this new hardware runtime adds new features to FPGA codes, such as task creation and dependence management, providing both performance increases and ease of programming. To demonstrate these new capabilities, different high performance benchmarks have been evaluated over different FPGA platforms using the OmpSs programming model. The results demonstrate that programs that use the OmpSs programming model achieve very competitive performance with low to moderate porting effort compared to other FPGA implementations.
We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-ran...
详细信息
We extend the capability of space-time geostatistical modeling using algebraic approximations, illustrating application-expected accuracy worthy of double precision from majority low-precision computations and low-rank matrix approximations. We exploit the mathematical structure of the dense covariance matrix whose inverse action and determinant are repeatedly required in Gaussian log-likelihood optimization. Geostatistics augments first-principles modeling approaches for the prediction of environmental phenomena given the availability of measurements at a large number of locations; however, traditional Cholesky-based approaches grow cubically in complexity, gating practical extension to continental and global datasets now available. We combine the linear algebraic contributions of mixed-precision and low-rank computations within a tile-based Cholesky solver with on-demand casting of precisions and dynamic runtime support from PaRSEC to orchestrate tasks and data movement. Our adaptive approach scales on various systems and leverages the Fujitsu A64FX nodes of Fugaku to achieve up to 12X performance speedup against the highly optimized dense Cholesky implementation.
task-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between them. Then, the runtime can schedule the tasks and data migrations e...
详细信息
ISBN:
(纸本)9783030867973;9783030504250
task-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between them. Then, the runtime can schedule the tasks and data migrations efficiently over all the available cores while reducing the waiting time between tasks. This paper focus on comparing several task-based programming models between themselves using the LU factorization as benchmark. HPX, PaRSEC, Legion and YML+XMP are task-based programming models which schedule data movement and computational tasks on distributed resources allocated to the application. YML+XMP supports parallel and distributed tasks with XscalableMP, a PGAS language. We compared their performances and scalability are compared to ScaLAPACK, an highly optimized library which uses MPI to perform communications between the processes on up to 64 nodes. We performed a block-based LU factorization with the task-basedprogramming model on up to a matrix of size 49512 x 49512. HPX is performing better than PaRSEC, Legion and YML+XMP but not better than ScaLAPACK. YML+XMP has a better scalability than HPX, Legion and PaRSEC. Regent has trouble scaling from 32 nodes to 64 nodes with our algorithm.
On-node parallelism has increased significantly in high-performance computing systems. This huge amount of parallelism can be used to speed up regular paral- lel applications relatively easily because straightforward ...
详细信息
On-node parallelism has increased significantly in high-performance computing systems. This huge amount of parallelism can be used to speed up regular paral- lel applications relatively easily because straightforward approaches usually suffice to map their computation patterns and data layouts on to available on-node parallelism. However, irregular parallel applications require considerable effort to run on the mod- ern processors with massive amounts of intra-node parallelism. Parallel programmingmodels and runtime approaches have been proposed to help programmers to write those applications quickly, but it's still not easy to write efficient irregular paral- lel applications. Two key challenges in mapping irregular applications onto on-node parallelism are load balance and computation-communication overlap. In this thesis proposal, we address these challenges through new runtime approaches and new APIs that enable users to provide minimal information for application-aware scheduling. First, we introduce new algorithms to improve the scheduling of irregular task graphs containing a mix of communication and computation tasks with data-parallelism and blocking operations. We combine gang-scheduling with work-stealing for data- parallel tasks with frequent inter/intra-node communication in the task graphs so as to reduce interference and expensive context switching operations. We also propose improved victim selection policies for work-stealing to improve the load balance and overlap of ready tasks that have child tasks. Next, we propose an efficient integrated runtime system to handle load balancing of irregular applications written in hybrid parallel programmingmodels. We introduce a unified runtime system that integrates distributed and shared-memory programming, as exemplified by the combination of Charm++ and OpenMP. In this approach, all processing resources (cores) can be used flexibly across both the distributed and shared-memory levels, thereby enabling mor
暂无评论