the proceedings contain 58 papers. the topics discussed include: beyond human-level accuracy: computational challenges in deep learning;throughput-oriented GPU memory allocation;SEP-graph: finding shortest execution p...
ISBN:
(纸本)9781450362252
the proceedings contain 58 papers. the topics discussed include: beyond human-level accuracy: computational challenges in deep learning;throughput-oriented GPU memory allocation;SEP-graph: finding shortest execution paths for graph processing under a hybrid framework on GPU;incremental flattening for nested data parallelism;modular transactions: bounding mixed races in space and time;processing transactions in a predefined order;data-flow/dependence profiling for structured transformations;lightweight hardware transactional memory profiling;provably and practically efficient granularity control;semantics-aware scheduling policies for synchronization determinism;and a round-efficient distributed betweenness centrality algorithm.
the Problem-Based Benchmark Suite (PBBS) is a set of benchmark problems designed for comparing algorithms, implementations and platforms. For each problem, the suite defines the problem in terms of the input-output re...
详细信息
ISBN:
(纸本)9781450392044
the Problem-Based Benchmark Suite (PBBS) is a set of benchmark problems designed for comparing algorithms, implementations and platforms. For each problem, the suite defines the problem in terms of the input-output relationship, and supplies a set of input instances along with input generators, a default implementation, code for checking correctness or accuracy, and a timing harness. the suite makes it possible to compare different algorithms, platforms (e.g. CPU vs CPU), and implementations using different programming languages or libraries. the purpose is to better understand how well a wide variety of problems parallelize, and what techniques/algorithms are most effective. the suite was first announced in 2012 with 14 benchmark problems. Here we describe some significant updates. In particular, we have added nine new benchmarks from a mix of problems in text processing, computational geometry and machine learning. We have further optimized the default implementations;several are the fastest available for multicore CPUs, often achieving near perfect speedup on the 72 core machine we test them on. the suite now also supplies significantly larger default test instances, as well as a broader variety, with many derived from real-world data.
Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issu...
详细信息
ISBN:
(纸本)9781450392044
Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PERFLOW. PERFLOW abstracts the step-by-step process of performance analysis as a dataflow graph. this dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PERFLOW'S built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PERFLOW by three case studies of real-world applications with up to 700K lines of code. Results show that PERFLOW significantly eases the implementation of customized analysis tasks. In addition, PERFLOW is able to perform analysis and locate performance bugs automatically and effectively.
High-performance classical simulator for quantum circuits, in particular the tensor network contraction algorithm, has become an important tool for the validation of noisy quantum computing. In order to address the me...
详细信息
the Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method plays an important role in solving large-scale and complex engineering problems. this method needs to handle numerous matrix-vector multiplic...
详细信息
this paper presents OpenCilk, an open-source software infrastructure for task-parallelprogrammingthat allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation str...
详细信息
Concurrent B+trees have been widely used in many systems. Withthe scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate con...
详细信息
Maintaining a dynamic k-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for ...
详细信息
Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. the existing parallel approa...
详细信息
ISBN:
(纸本)9781450392044
Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental building blocks in sparse linear solvers, graph processing frameworks and machine learning applications. the existing parallel approaches for shared memory SpGEMM mostly use the row-row style with possibly good parallelism. However, because of the irregularity in sparsity structures, the existing row-row methods often suffer from three problems: (1) load imbalance, (2) high global space complexity and unsatisfactory data locality, and (3) sparse accumulator selection. We in this paper propose a tiled parallel SpGEMM algorithm named TileSpGEMM. Our algorithm sparsifies the tiled method in dense general matrix-matrix multiplication (GEMM), and saves each non-empty tile in a sparse form. Its first advantage is that the basic working unit is now a fixed-size sparse tile containing a small number of nonzeros, but not a row possibly very long. thus the load imbalance issue can be naturally alleviated. Secondly, the temporary space needed for each tile is small and can always be in on-chip scratchpad memory. thus there is no need to allocate an off-chip space for a large amount of intermediate products, and the data locality can be much better. thirdly, because the computations are restricted within a single tile, it is relatively easier to select a fast sparse accumulator for a sparse tile. Our experimental results on two newest NVIDIA GPUs show that our TileSpGEMM outperforms four state-of-the-art SpGEMM methods cuSPARSE, bhSPARSE, NSPARSE and spECK in 139, 138, 127 and 94 out of all 142 square matrices executing no less than one billion flops for an SpGEMM operation, and delivers up to 2.78x, 145.35x, 97.86x and 3.70x speedups, respectively.
Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given n threads and m >= n distinct priority queues, task insertions a...
详细信息
ISBN:
(纸本)9781450392044
Designing and implementing efficient parallel priority schedulers is an active research area. An intriguing proposed design is the Multi-Queue: given n threads and m >= n distinct priority queues, task insertions are performed uniformly at random, while, to delete, a thread picks two queues uniformly at random, and removes the observed task of higher priority. this approach scales well, and has probabilistic rank guarantees: roughly, the rank of each task removed, relative to remaining tasks in all other queues, is O(m) in expectation. Yet, the performance of this pattern is below that of well-engineered schedulers, which eschew theoretical guarantees for practical efficiency. We investigate whether it is possible to design and implement a Multi-Queue-based task scheduler that is both highly-efficient and has analytical guarantees. We propose a new variant called the Stealing Multi-Queue (SMQ), a cache-efficient variant of the Multi-Queue, which leverages both queue affinity-each thread has a local queue, from which tasks are usually removed;but, with some probability, threads also attempt to steal higher-priority tasks from the other queues-and task batching, that is, the processing of several tasks in a single insert / remove step. these ideas are well-known for task scheduling without priorities;our theoretical contribution is showing that, despite relaxations, this design can still provide rank guarantees, which in turn implies bounds on total work performed. We provide a general SMQ implementation which can surpass state-of-the-art schedulers such as OBIM and PMOD in terms of performance on popular graph-processing benchmarks. Notably, the performance improvement comes mainly from the superior rank guarantees provided by our scheduler, confirming that analytically-reasoned approaches can still provide performance improvements for priority task scheduling. the full version of this paper is available in [24].
暂无评论