Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. In the last years the traditional ways to keep the increase of hardware perf...
详细信息
Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. In the last years the traditional ways to keep the increase of hardware performance to the rate predicted by the Moore's Law vanished. When uni-cores were the norm, hardware design was decoupled from the software stack thanks to a well-defined Instruction Set Architecture (ISA). This simple interface allowed developing applications without worrying too much about the underlying hardware, while computer architects proposed techniques to aggressively exploit Instruction-Level parallelism (ILP) in superscalar processors. Current multi-cores are designed as simple symmetric multiprocessors on a chip. While these designs are able to compensate the clock frequency stagnation, they face multiple problems in terms of power consumption, programmability, resilience or memory. The solution is to give more responsibility to the runtime system and to let it tightly collaborate with the hardware. The runtime has to drive the design of future multi-cores architectures. In this talk, we introduce an approach towards a Runtime-Aware Architecture (RAA), a massively parallel architecture designed from the runtime's perspective. RAA aims at supporting the activity the parallel runtime system in three ways: First, to enable fine-grain tasking and support the opportunities it offers; second, to improve the performance of the memory subsystem by exposing hybrid hierarchies to the runtime system and, third, to improve performance by using vector units. During the talk, we will give a general overview of the problems RAA aims to solve and provide some examples of hardware components supporting the activity of the runtime system in the context of multi-core chips.
A case study is presented that provides computation caching (memoization) through a microservice architecture to high-performance computing (HPC) applications, particularly the ExMatEx proxy application CoEVP (Co-desi...
详细信息
A case study is presented that provides computation caching (memoization) through a microservice architecture to high-performance computing (HPC) applications, particularly the ExMatEx proxy application CoEVP (Co-designed Embedded ViscoPlasticity Scale-bridging). CoEVP represents a class of multiscale physics methods in which inexpensive coarse-scale models are combined with expensive fine-scale models to simulate physical phenomena scalably across multiple time and length scales. Recently, CoEVP has employed interpolation based on previously executed fine-scale models in order to reduce the number of fine-scale evaluations needed to advance the simulation. Building on this work, we envision that distributed microservices composed to provide new capabilities to large-scale parallel applications can be an important component in simulating ever-larger systems at ever-greater fidelities. We explore three aspects of a microservice composition for interpolation-based memoization in our study. First, we present a cost assessment of CoEVP's current fine-scale modeling and interpolation approach. Second, we present an alternative interpolation strategy in which interpolation models are directly constructed on demand from previous fine-scale evaluations: a "database of points" rather than a "database of models." third, we evaluate the characteristics of the two approaches with and without cross-process sharing of database entries. Lessons learned from the study are used to inform designs for future work in developing distributed, large-scale memoization services for HPC.
We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious...
详细信息
We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious execution. In particular, our first variant corresponds to the classical implementation of LUpp in the legacy version of LAPACK, which constrains the concurrency exploited to that intrinsic to the basic linear algebra kernels that appear during the factorization, but exerts an strict control of the cache memory and a static mapping of kernels to cores. A second variant relaxes this task-constrained scenario by introducing a look-ahead of depth one to increase task-parallelism, increasing the pressure on the cache system in terms of cache misses. Finally, the third variant orchestrates an execution where the degree of concurrency is only limited by the actual data dependencies in LUpp, potentially yielding to a higher volume of conflicts due to competition for the cache memory resources. The target platform for our implementations and experiments is a specific asymmetric multicore processor (AMP) from ARM, which introduces the additional scheduling complexity of having to deal with two distinct types of cores; and an L2-shared cache per cluster of the AMP, which results in more conflictivity in the access to this key cache level.
The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then t...
详细信息
The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.
The proceedings contain 26 papers. The topics discussed include: getting ready for approximate computing: trading parallelism for accuracy for DSS workloads;dataClay: the integration of persistent data, parallel progr...
详细信息
ISBN:
(纸本)9781467371483
The proceedings contain 26 papers. The topics discussed include: getting ready for approximate computing: trading parallelism for accuracy for DSS workloads;dataClay: the integration of persistent data, parallel programming models, and true sharing;Intel architecture and technology for future HPC system building blocks;personalized motion sensor driven gesture recognition in the FIWARE cloud platform;a simulator for analysis of opportunistic routing algorithms;multilevel task parallelism exploitation on asymmetric sets of tasks and when using third-party tools;cache affinity optimization techniques for scaling software transactional memory systems on multi-CMP architectures;high-speed security analytics powered by in-memory machine learning engine;GPU-accelerated digital halftoning by the local exhaustive search;analyzing memory access on CPU-GPGPU shared LLC architecture;and schedule dynamic multiple parallel jobs with precedence-constrained tasks on heterogeneous distributed computing systems.
Linked data mining has become one of the key questions in HPC graph mining in recent years. However, the existing RDF database engines are not scalable and are less reliable in heterogeneous clouds. In this paper we d...
详细信息
ISBN:
(纸本)9781509036820
Linked data mining has become one of the key questions in HPC graph mining in recent years. However, the existing RDF database engines are not scalable and are less reliable in heterogeneous clouds. In this paper we describe the design and implementation of Acacia-RDF which is a scalable distributed RDF graph database engine developed with X10 programming language to solve this issue. Acacia-RDF partitions the RDF data sets into subgraphs following vertex cut paradigm. The partitioned data sets are persisted on secondary storage across X10 places. We developed a scalable SPARQL processor for Acacia-RDF which operates on top of partitioned RDF data. Furthermore, we demonstrate the implementation of scalable graph algorithms such as Triangle counting with such partitioned data sets. We present performance results gathered from Acacia with different scales of LUBM RDF benchmark data sets and make a comparison of Acacia's performance against Neo4j graph database server. From the scalability experiments conducted upto 16 X10 places, we observed that Acacia-RDF scales well with LUBM data sets. Acacia-RDF reported approximately 2 seconds elapsed time on 4 places for running the first and third queries of the LUBM benchmark on LUBM scale 40 data set. Through this work we introduce the use of X10 language for scalable RDF graph data management.
暂无评论