This paper introduces the JStar parallelprogramming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallelprogramming, and giving the compil...
详细信息
This paper introduces the JStar parallelprogramming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallelprogramming, and giving the compiler and runtime maximum freedom to try alternative parallelisation strategies. We describe the execution semantics and runtime support of the language, several optimisations and parallelism strategies, with some benchmark results. (C) 2013 Elsevier B.V. All rights reserved.
StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated task...
详细信息
StarSs is a task-based programming model that allows to parallelize sequential applications by means of annotating the code with compiler directives. The model further supports transparent execution of designated tasks on heterogeneous platforms, including clusters of GPUs. This paper focuses on the methodology and tools that complements the programming model forming a consistent development environment with the objective of simplifying the live of application developers. The programming environment includes the tools TAREADOR and TEMANEJO, which have been designed specifically for StarSs. TAREADOR, a Valgrind-based tool, allows a top-down development approach by assisting the programmer in identifying tasks and their data-dependencies across all concurrency levels of an application. TEMANEJO is a graphical debugger supporting the programmer by visualizing the task dependency tree on one hand, but also allowing to manipulate task scheduling or dependencies. These tools are complemented with a set of performance analysis tools (Scalasca, Cube and Paraver) that enable to fine tune StarSs application. (C) 2013 Elsevier By. All rights reserved.
Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the marke...
详细信息
Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. parallelprogramming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives. (c) 2015 Elsevier B.V. All rights reserved.
A Polymorphic Processor Array (PPA) is a two-dimensional mesh-connected array of processors, in which each processor is equipped with a switch able to interconnect its four NEWS ports. PPA is an abstract architecture ...
详细信息
A Polymorphic Processor Array (PPA) is a two-dimensional mesh-connected array of processors, in which each processor is equipped with a switch able to interconnect its four NEWS ports. PPA is an abstract architecture based upon the experience acquired in the design and in the implementation of a VLSI chip, namely the Polymorphic Torus (PT) chip, and, as a consequence, it only includes capabilities that have been proved to be supported by cost-effective hardware structures. The main claims of PPA are that 1) it models a realistic class of parallel computers, 2) it supports the definition of high level programmingmodels, 3) it supports virtual parallelism and 4) it supports low complexity algorithms in a number of application fields. In this paper we present both the PPA computation model and the PPA programming model;we show that the PPA computation model is realistic by relating it to the design of the PT chip and show that the PPA programming model is scalable by demonstrating that any algorithm having O(p) complexity on a virtual PPA of size square-root m x square-root m, has O(kp) complexity on a PPA of size square-root n x square-root n, with m = kn and k integer. We finally show some application algorithms in the area of numerical analysis and graph processing.
The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages...
详细信息
The Partitioned Global Address Space (PGAS) programming model is one of the most relevant proposals to improve the ability of developers to exploit distributed memory systems. However, despite its important advantages with respect to the traditional message-passing paradigm, PGAS has not been yet widely adopted. We think that PGAS libraries are more promising than languages because they avoid the requirement to (re) write the applications using them, with the implied uncertainties related to portability and interoperability with the vast amount of APIs and libraries that exist for widespread languages. Nevertheless, the need to embed these libraries within a host language can limit their expressiveness and very useful features can be missing. This paper contributes to the advance of PGAS by enabling the simple development of arbitrarily complex task-parallel codes following a dataflow approach on top of the PGAS UPC++ library, implemented in C++. In addition, our proposal, called UPC++ DepSpawn, relies on an optimized multithreaded runtime that provides very competitive performance, as our experimental evaluation shows.
Technological directions for innovative HPC software environments are discussed in this paper. We focus on industrial user requirements of heterogeneous multidisciplinary applications, performance portability, rapid p...
详细信息
Technological directions for innovative HPC software environments are discussed in this paper. We focus on industrial user requirements of heterogeneous multidisciplinary applications, performance portability, rapid prototyping and software reuse, integration and interoperability of standard tools. The Various issues are demonstrated with reference to the PQE2000 project and its programming environment Skeleton-based Integrated Environment (SkIE), SkIE includes a coordination language, SkIECL, allowing the designers to express, in a primitive and structured way, efficient combinations of data parallelism and task parallelism. The goal is achieving fast development and good efficiency for applications in different areas. Modules developed with standard languages and tools are encapsulated into SkIECL structures to form the global application. Performance models associated to the coordination language allow powerful optimizations to be introduced both at run time and at compile time without the direct intervention of the programmer. The paper also discusses the features of the SkIE environment related to debugging, performance analysis tools, visualization and graphical user interface. A discussion of the results achieved in some applications developed using the environment concludes the paper. (C) 1999 Elsevier Science B.V. All rights reserved.
Heterogeneous systems coupling a main host processor with one or more manycore accelerators are being adopted virtually at every scale to achieve ever-increasing GOps/Watt targets. The increased hardware complexity of...
详细信息
Heterogeneous systems coupling a main host processor with one or more manycore accelerators are being adopted virtually at every scale to achieve ever-increasing GOps/Watt targets. The increased hardware complexity of such systems is paired at the application level by a growing number of applications concurrently running on the system. Techniques that enable efficient accelerator resources sharing, supporting multiple programmingmodels will thus be increasingly important for future heterogeneous SoCs. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of offloaded computation kernels from different programmingmodels. The runtime supports spatial partitioning, where clusters can be grouped into several virtual accelerator instances. Our runtime design is modular and relies on a generic component for resource (cluster) scheduling, plus specialized components which deploy generic offload requests into the target programming model semantics. We evaluate the proposed runtime system on two real heterogeneous systems, focusing on two concrete use cases: i) single-user, multi-application high-end embedded systems and ii) multi-user, multi-workload low-power microservers. In the first case, our approach achieves 93 percent efficiency in terms of available accelerator resource exploitation. In the second case, our support allows 47 percent performance improvement compared to single-programming model systems.
The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data wi...
详细信息
The next-generation of partially and fully autonomous cars will be powered by embedded many-core platforms. Technologies for Advanced Driver Assistance Systems (ADAS) need to process an unprecedented amount of data within tight power budgets, making those platform the ideal candidate architecture. Integrating tens-to-hundreds of computing elements that run at lower frequencies allows obtaining impressive performance capabilities at a reduced power consumption, that meets the size, weight and power (SWaP) budget of automotive systems. Unfortunately, the inherent architectural complexity of many-core platforms makes it almost impossible to derive real-time guarantees using "traditional" state-of-the-art techniques, ultimately preventing their adoption in real industrial settings. Having impressive average performances with no guaranteed bounds on the response times of the critical computing activities is of little if no use in safety-critical applications. Project Hercules will address this issue, and provide the required technological infrastructure to exploit the tremendous potential of embedded many-cores for the next generation of automotive systems. This work gives an overview of the integrated Hercules software framework, which allows achieving an order-of-magnitude of predictable performance on top of cutting edge Commercial-Off-The-Shelf components (COTS). The proposed software stack will let both real-time and non real-time application coexist on next-generation, power-efficient embedded platforms, with preserved timing guarantees. (C) 2017 Elsevier B.V. All rights reserved.
In this paper we describe the parallelization of the multi-zone code versions of the NAS parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of O...
详细信息
In this paper we describe the parallelization of the multi-zone code versions of the NAS parallel Benchmarks employing multi-level OpenMP parallelism. For our study, we use the NanosCompiler that supports nesting of OpenMP directives and provides clauses to control the grouping of threads, load balancing, and synchronization. We report the benchmark results, compare the timings with those of different hybrid parallelization paradigms (MPI+OpenMP and PLP) and discuss OpenMP implementation issues that affect the performance of multi-level parallel applications. (c) 2005 Elsevier Inc. All rights reserved.
The authors present the parallel-hybrid recursive least square (PH-RLS) algorithm for an accurate self-mixing interferometric laser vibration sensor coupled with an accelerometer under industrial conditions. Previousl...
详细信息
The authors present the parallel-hybrid recursive least square (PH-RLS) algorithm for an accurate self-mixing interferometric laser vibration sensor coupled with an accelerometer under industrial conditions. Previously, this was achieved by using a conventional RLS algorithm to cancel the parasitic vibrations where the sensor itself is not in the stationary environment. This algorithm operates in sequential mode and due to its compute and data-intensive nature, the algorithm does not work for real-time applications, hence requires parallel computing. Therefore, the existing conventional RLS C program is parallelized by using hybrid OpenACe C/MPI (Open Accelerators/Message Passing Interface) parallel programming models and tested on Barcelona Supercomputing Center CTE-Power9 Supercomputer. The computational performance of the proposed PH-RLS algorithm is compared with the existing conventional RLS code by executing on multi distributed processors and uni-core processor architecture, respectively. While comparing the performance of conventional RLS with a PH-RLS algorithm on eight nodes of CTE-Power9 supercomputer, the results show that the PH-RLS algorithm gets 5857 times of performance improvement as compared to the conventional RLS implementation on a single node system. The results show that the proposed PH-RLS also gives a scalable performance for a different range of vibration signals, making it a suitable choice for real-time self-mixing interferometer sensing systems working under industrial conditions.
暂无评论