The increase in complexity, diversity and scale of high performance computing environments, as well as the increasing sophistication of parallel applications and algorithms call for productivity-aware programming lang...
详细信息
ISBN:
(纸本)9781450393393
The increase in complexity, diversity and scale of high performance computing environments, as well as the increasing sophistication of parallel applications and algorithms call for productivity-aware programminglanguages for high-performance computing. Among them, the Chapel programming language stands out as one of the more successful approaches based on the Partitioned Global Address Space programming model. Although Chapel is designed for productive parallel computing at scale, the question of its competitiveness with well-established conventional parallel programming environments arises. To this end, this work compares the performance of Chapel-based fractal generation on shared- and distributed-memory platforms with corresponding OpenMP and MPI+X implementations. The parallel computation of the Mandelbrot set is chosen as a test-case for its high degree of parallelism and its irregular workload. Experiments are performed on a cluster composed of 192 cores using the French national testbed Grid'5000. Chapel as well as its default tasking layer demonstrate high performance in shared-memory context, while Chapel competes with hybrid MPI+OpenMP in distributed-memory environment.
The ubiquity of distributed agreement protocols, such as consensus, has galvanized interest in verification of such protocols as well as applications built on top of them. The complexity and unboundedness of such syst...
详细信息
The ubiquity of distributed agreement protocols, such as consensus, has galvanized interest in verification of such protocols as well as applications built on top of them. The complexity and unboundedness of such systems, however, makes their verification onerous in general, and, particularly prohibitive for full automation. An exciting, recent breakthrough reveals that, through careful modeling, it becomes possible to reduce verification of interesting distributed agreement-based (DAB) systems, that are unbounded in the number of processes, to model checking of small, finite-state systems. It is an open question if such reductions are also possible for DAB systems that are doubly-unbounded, in particular, DAB systems that additionally have unbounded data domains. We answer this question in the affirmative in this work thereby broadening the class of DAB systems which can be automatically and efficiently verified. We present a novel reduction which leverages value symmetry and a new notion of data saturation to reduce verification of doubly-unbounded DAB systems to model checking of small, finite-state systems. We develop a tool, Venus, that can efficiently verify sophisticated DAB system models such as the arbitration mechanism for a consortium blockchain, a distributed register, and a simple key-value store.
In recent years, large language models (LLMs) based on the Transformer architecture have demonstrated excellent performance in code generation, but there have been fewer studies on data flow languages. This study prop...
详细信息
Very relaxed concurrency memory models, like those of the Arm-A, RISC-V, and IBM Power hardware architectures, underpin much of computing but break a fundamental intuition about programs, namely that syntactic program...
详细信息
Very relaxed concurrency memory models, like those of the Arm-A, RISC-V, and IBM Power hardware architectures, underpin much of computing but break a fundamental intuition about programs, namely that syntactic program order and the reads-from relation always both induce order in the execution. Instead, out-of-order execution is allowed except where prevented by certain pairwise dependencies, barriers, or other synchronisation. This means that there is no notion of the 'current' state of the program, making it challenging to design (and prove sound) syntax-directed, modular reasoning methods like Hoare logics, as usable resources cannot implicitly flow from one program point to the next. We present AxSL, a separation logic for the relaxed memory model of Arm-A, that captures the fine-grained reasoning underpinning the low-overhead synchronisation mechanisms used by high-performance systems code. In particular, AxSL allows transferring arbitrary resources using relaxed reads and writes when they induce inter-thread ordering. We mechanise AxSL in the Iris separation logic framework, illustrate it on key examples, and prove it sound with respect to the axiomatic memory model of Arm-A. Our approach is largely generic in the axiomatic model and in the instruction-set semantics, offering a potential way forward for compositional reasoning for other similar models, and for the combination of production concurrency models and full-scale ISAs.
In distributed systems, remote Application programming Interfaces (APIs) let architectural components such as microservices communicate with each other;interoperability and satisfactory developer experience are key st...
详细信息
ISBN:
(纸本)9798400716836
In distributed systems, remote Application programming Interfaces (APIs) let architectural components such as microservices communicate with each other;interoperability and satisfactory developer experience are key stakeholder concerns. In response to changing requirements and insights from development and operations, API endpoints and the request and response messages of the exposed operations are actively designed and then modified during the entire life cycle of the system. Refactoring is a crucial practice in agile software development, widely adopted in practice at the code level. Architectural refactoring has been researched but has not been adopted nearly as widely as code-level refactoring. This paper continues our work on refactoring remote APIs, which we introduced at EuroPLoP 2023. We present a second slice of seven API refactorings pulled from our online Interface Refactoring Catalog, many of which target API design patterns: Extract Information Holder, Inline Information Holder, Extract Operation, Rename Operation, Make Request Conditional, Encapsulate Context Representation, and Introduce Version Identifier. Besides context, problem, and step-by-step solution, we also motivate the refactorings by stakeholder concerns and identify the design smells that refactoring can address. All refactorings are illustrated with implementation code snippets, excerpts from API specification, and/or examples of messages exchanged at runtime. The paper concludes with an outlook to future work.
In this work-in-progress research paper,we make the case for using Rust to develop applications in the High Performance computing (HPC) domain which is critically dependent on native C/C++ libraries. This work explore...
详细信息
ISBN:
(纸本)9798400703805
In this work-in-progress research paper,we make the case for using Rust to develop applications in the High Performance computing (HPC) domain which is critically dependent on native C/C++ libraries. This work explores one example of Safe HPC via the design of a Rust interface to an existing distributed C++ Actors library. This existing library has been shown to deliver high performance to C++ developers of irregular Partitioned Global Address Space (PGAS) applications. Our key contribution is a proof-of-concept framework to express parallel programs safely in Rust (and potentially other languages/systems), along with a corresponding study of the problems solved by our runtime, the implementation challenges faced, and user productivity. We also conducted an early evaluation of our approach by converting C++ actor implementations of four applications taken from the Bale kernels to Rust Actors using our framework. Our results show that the productivity benefits of our approach are significant since our Rust-based approach helped catch bugs statically during application development, without degrading performance relative to the original C++ actor versions.
This special issue includes a selection of the artefacts presented at the 18th International Federated Conference on distributedcomputing Techniques (DiScoTec 2023), held at the NOVA University Lisbon (Lisbon, Portug...
详细信息
This special issue includes a selection of the artefacts presented at the 18th International Federated Conference on distributedcomputing Techniques (DiScoTec 2023), held at the NOVA University Lisbon (Lisbon, Portugal), in June 18-23, 2023. The federated conference included: COORDINATION 2023, the 25th International Conference on Coordination models and languages);DAIS 2023, the 23rd International Conference on distributed Applications and Interoperable Systems;and FORTE 2023, the 43rd International Conference on Formal Techniques for distributed Objects, Components, and Systems. All the three conferences welcomed submissions describing technological artefacts, including innovative prototypes supporting the modelling, development, analysis, simulation, or testing of systems in the broad spectrum of distributedcomputing subjects. The artefact evaluation chairs have selected a subset of high- quality accepted artefacts to be invited for submission to this special issue. Following the revision process, nine artefacts have been accepted to be part of this special issue. The published contributions include different types of artefacts, including programming libraries, frameworks, as well as tools for the analysis, verification, and simulation of distributed systems.
MultiGPU nodes are widely used in high-performance computing and data centers. However, current programmingmodels do not provide transparent, and portable support for automatically targeting multiple GPUs within a no...
详细信息
ISBN:
(纸本)9798400701696
MultiGPU nodes are widely used in high-performance computing and data centers. However, current programmingmodels do not provide transparent, and portable support for automatically targeting multiple GPUs within a node. In this paper, we describe a new application programming interface based on the Kokkos programming model to enable array computation on multiple GPUs in a transparent and portable way across both NVIDIA and AMD GPUs. We implement different variations of this technique to accommodate the exchange of stencils, and we provide autotuning to select the proper number of GPUs, depending on the computational cost of the operations to be computed on arrays. We evaluate our multiGPU extension on Summit (#5 TOP500), with six NVIDIA V100 Volta GPUs per node, and Crusher that contains identical hardware/software as Frontier (#1 TOP500), with four AMD MI250X GPUs, each with 2 Graphics Compute Dies (GCDs) for a total of 8 GCDs per node. We also compare the performance of this solution against the use of MPI + Kokkos. Our evaluation shows that the new Kokkos solution provides good scalability for many GPUs. than MPI + Kokkos.
Achieving peak throughput on modern CPUs requires maximizing the use of single-instruction, multiple-data (SIMD) or vector compute units. Single-program, multiple-data (SPMD) programmingmodels are an effective way to...
详细信息
ISBN:
(纸本)9798400701016
Achieving peak throughput on modern CPUs requires maximizing the use of single-instruction, multiple-data (SIMD) or vector compute units. Single-program, multiple-data (SPMD) programmingmodels are an effective way to use high-level programminglanguages to target these ISAs. Unfortunately, many SPMD frameworks have evolved to have either overly-restrictive language specifications or under-specified programmingmodels, and this has slowed the widescale adoption of SPMD-style programming. This paper introduces Parsimony (PARallel SIMd), a SPMD programming approach built with semantics designed to be compatible with multiple languages and to cleanly integrate into the standard optimizing compiler toolchains for those languages. We first explain the Parsimony programming model semantics and how they enable a standalone compiler IR-to-IR pass that can perform vectorization independently of other passes, improving the language and toolchain compatibility of SPMD programming. We then demonstrate a LLVM prototype of the Parsimony approach that matches the performance of ispc, a popular but more restrictive SPMD approach, and achieves 97% of the performance of hand-written AVX-512 SIMD intrinsics on over 70 benchmarks ported from the Simd Library. We finally discuss where Parsimony has exposed parts of existing language and compiler flows where slight improvements could further enable improved SPMD program vectorization.
We explore the performance and portability of the high-level programmingmodels: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processi...
详细信息
ISBN:
(纸本)9798350311990
We explore the performance and portability of the high-level programmingmodels: the LLVM-based Julia and Python/Numba, and Kokkos on high-performance computing (HPC) nodes: AMD Epyc CPUs and MI250X graphical processing units (GPUs) on Frontier's test bed Crusher system and Ampere's Arm-based CPUs and NVIDIA's A100 GPUs on the Wombat system at the Oak Ridge Leadership computing Facilities. We compare the default performance of a hand-rolled dense matrix multiplication algorithm on CPUs against vendor-compiled C/OpenMP implementations, and on each GPU against CUDA and HIP. Rather than focusing on the kernel optimization per-se, we select this naive approach to resemble exploratory work in science and as a lower-bound for performance to isolate the effect of each programming model. Julia and Kokkos perform comparably with C/OpenMP on CPUs, while Julia implementations are competitive with CUDA and HIP on GPUs. Performance gaps are identified on NVIDIA A100 GPUs for Julia's single precision and Kokkos, and for Python/Numba in all scenarios. We also comment on half-precision support, productivity, performance portability metrics, and platform readiness. We expect to contribute to the understanding and direction for high-level, high-productivity languages in HPC as the first-generation exascale systems are deployed.
暂无评论