Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism;while the latter relies on fine-grained ...
详细信息
ISBN:
(纸本)9781728145358
Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism;while the latter relies on fine-grained synchronization among tasks and a flexible data-flow execution model to exploit dynamic, irregular, and nested parallelism. On applications that show both structured and unstructured parallelism, both worksharing and task constructs can be combined. However, it is difficult to mix both execution models without penalizing the data-flow execution model. Hence, on many applications structured parallelism is also exploited using tasks to leverage the full benefits of a pure data-flow execution model. However, task creation and management might introduce a non-negligible overhead that prevents the efficient exploitation of fine-grained structured parallelism, especially on many-core processors. In this work, we propose worksharing tasks. These are tasks that internally leverage worksharing techniques to exploit fine-grained structured loop-based parallelism. The evaluation shows promising results on several benchmarks and platforms.
The first exascale supercomputers are expected by the end of this decade and will presumably feature an increase in core count, but a decrease in the amount of memory available per core. As of now, it is still unclear...
详细信息
ISBN:
(纸本)9781479941162
The first exascale supercomputers are expected by the end of this decade and will presumably feature an increase in core count, but a decrease in the amount of memory available per core. As of now, it is still unclear if the current programming models will provide high performance on exascale systems. One programming model considered to be an alternative to MPI is the so-called partitioned global address space (PGAS) model. Within this paper we evaluate a relatively new PGAS API: the Global Address Space programming Interface (GASPI) and compare it to MPI on the basis of microbenchmarks. These benchmarks show that GASPI provides about the same level of performance for single-threaded communication, but is up to an order of magnitude faster than both Intel and IBM MPI for multi-threaded communication. Hereafter, we discuss the different features of GASPI in comparison to two main PGAS languages, namely UPC and CAF. In addition, we present a basic numerical algorithm, a dense matrix-matrix multiplication, as an example on how an implementation can make efficient use of GASPI's features, especially the asynchronous and one-sided communication mechanisms.
This work presents a novel approach to synthesize approximate circuits for the ansatze of variational quantum algorithms (VQA) and demonstrates its effectiveness in the context of solving integer linear programming (I...
详细信息
ISBN:
(纸本)9798331541378
This work presents a novel approach to synthesize approximate circuits for the ansatze of variational quantum algorithms (VQA) and demonstrates its effectiveness in the context of solving integer linear programming (ILP) problems. Synthesis is generalized to produce parametric circuits in close approximation of the original circuit and to do so offline. This removes synthesis from the (online) critical path between repeated quantum circuit executions of VQA. We hypothesize that this approach will yield novel high fidelity results beyond those discovered by the baseline without synthesis. Simulation and real device experiments complement the baseline in finding correct results in many cases where the baseline fails to find any and do so with on average 32% fewer CNOTs in circuits.
The exact likelihood function for a prototypal job search model is analyzed. The optimality condition implied by the dynamic programming framework is fully imposed. Using the optimality condition allows identification...
详细信息
The exact likelihood function for a prototypal job search model is analyzed. The optimality condition implied by the dynamic programming framework is fully imposed. Using the optimality condition allows identification of an offer arrival probability separately from an offer acceptance probability. The estimation problem is nonstandard. The geometry of the likelihood function in finite samples is considered, along with asymptotic properties of the maximum likelihood estimator.
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying t...
详细信息
ISBN:
(纸本)9781450351331
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of agg...
详细信息
ISBN:
(纸本)9781450340731
Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no work has been put in optimizing the aggregation of very common, non-periodic windows. Typical examples of non-periodic windows are punctuations and sessions which can implement complex business logic and are often expressed as user-defined operators on platforms such as Google Dataflow or Apache Storm. The aggregation of such non-periodic or user-defined windows either falls back to expensive, best-effort aggregate sharing methods, or is not optimized at all. In this paper we present a technique to perform efficient aggregate sharing for data stream windows, which are declared as user-defined functions (UDFs) and can contain arbitrary business logic. To this end, we first introduce the concept of User-Defined Windows (UDWs), a simple, UDF-based programming abstraction that allows users to programmatically define custom windows. We then define semantics for UDWs, based on which we design Cutty, a low-cost aggregate sharing technique. Cutty improves and outperforms the state of the art for aggregate sharing on single and multiple queries. Moreover, it enables aggregate sharing for a broad class of non-periodic UDWs. We implemented our techniques on Apache Flink, an open source stream processing system, and performed experiments demonstrating orders of magnitude of reduction in aggregation costs compared to the state of the art.
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by...
详细信息
ISBN:
(纸本)9783030856656;9783030856649
In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based on checkpoint/restart are lagging behind performance portability efforts: users still need to capture consistent states manually, which introduces the need for fine-tuning and customization. In this paper we aim to close this gap by introducing a set of abstractions that make it easier for the application developers to reason about resilience. To this end, we extend the existing abstractions proposed by performance portability efforts towards resilience. By marking critical data structures that need to be checkpointed, one can enable an optimized runtime to automate checkpoint-restart using high performance and scalable asynchronously techniques. We illustrate the feasibility of our proposal using a prototype that combines the Kokkos runtime (HPC performance portability), with the VELOC runtime (large-scale low overhead checkpoint-restart). Our experimental results show negligible performance overhead compared with a manually tuned implementation of checkpoint-restart while requiring minimal changes in the application code.
In this paper, we present WCSim, Workflow Cloud Simulator. Firstly, we argue that this cloud simulation tool offers a high level of accessibility by allowing the description of various components, such as users, infra...
详细信息
ISBN:
(纸本)9798350305487
In this paper, we present WCSim, Workflow Cloud Simulator. Firstly, we argue that this cloud simulation tool offers a high level of accessibility by allowing the description of various components, such as users, infrastructures, and workload, of a given scenario simply by providing parameters at launch time, without requiring the extension of the simulator code. Then, we explain how we conceived the components for the simulation models and provide a detailed description of the implemented software. Additionally, we compare the results of a small scenario obtained from two other simulation tools with those provided by WCSim. Finally, we present a case study that illustrates the usage of WCSim. The paper also introduces the a abstraction to model workflows as a Direct Acyclic Graph of Bag of Tasks.
Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use...
详细信息
ISBN:
(纸本)9781538658154
Concurrent data structures are widely used in many software stack levels, ranging from high level parallel scientific applications to low level operating systems. The key issue of these objects is their concurrent use by several computing units (threads or process) so that the design of these structures is much more difficult compared to their sequential counterpart, because of their extremely dynamic nature requiring protocols to ensure data consistency, with a significant cost overhead. At this regard, several studies emphasize a tension between the needs of sequential correctness of the concurrent data structures and scalability of the algorithms, and in many cases it is evident the need to rethink the data structure design, using approaches based on randomization and/or redistribution techniques in order to fully exploit the computational power of the recent computing environments. The problem is grown in importance with the new generation High Performance Computing systems aimed to achieve extreme performance. It is easy to observe that such systems are based on heterogeneous architectures integrating several independent nodes in the form of clusters or MPP systems, where each node is composed by powerful computing elements (CPU core, GPUs or other acceleration devices) sharing resources in a single node. These systems therefore make massive use of communication libraries to exchange data among the nodes, as well as other tools for the management of the shared resources inside a single node. For such a reason, the development of algorithms and scientific software for dynamic data structures on these heterogeneous systems implies a suitable combination of several methodologies and tools to deal with the different kinds of parallelism corresponding to each specific device, so that to be aware of the underlying platform. The present work is aimed to introduce a scalable model to manage a special class of dynamic data structure known as heap based priority queue (o
暂无评论