检索结果-内蒙古大学图书馆

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Cruz, Flavio Rocha, Ricardo Goldstein, Seth Copen Univ Porto CRACS Rua Campo Alegre 1021 P-4169007 Oporto Portugal Univ Porto INESC TEC Rua Campo Alegre 1021 P-4169007 Oporto Portugal Univ Porto Fac Sci Rua Campo Alegre 1021 P-4169007 Oporto Portugal Carnegie Mellon Univ Pittsburgh PA 15213 USA

ISBN: (纸本)9781450340922

Declarative programming has been hailed as a promising approach to parallel programming since it makes it easier to reason about programs while hiding the implementation details of parallelism from the programmer. However, its advantage is also its disadvantage as it leaves the programmer with no straightforward way to optimize programs for performance. In this paper, we introduce Coordinated Linear Meld (CLM), a concurrent forward-chaining linear logic programming language, with a declarative way to coordinate the execution of parallel programs allowing the programmer to specify arbitrary scheduling and data partitioning policies. Our approach allows the programmer to write graph-based declarative programs and then optionally to use coordination to fine-tune parallel performance. In this paper we specify the set of coordination facts, discuss their implementation in a parallel virtual machine, and show-through example-how they can be used to optimize parallel execution. We compare the performance of CLM programs against the original uncoordinated Linear Meld and several other frameworks.

关键词： Design Languages Performance parallel programming Linear Logic

来源：评论

学校读者我要写书评

暂无评论

High Performance Model Based Image Reconstruction 16

High Performance Model Based Image Reconstruction

引用

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Wang, Xiao Sabne, Amit Kisner, Sherman Raghunathan, Anand Bouman, Charles Midkiff, Samuel Purdue Univ Sch Elect & Comp Engn W Lafayette IN 47907 USA High Performance Imaging LLC W Lafayette IN USA

ISBN: (纸本)9781450340922

Computed Tomography (CT) Image Reconstruction is an important technique used in a wide range of applications, ranging from explosive detection, medical imaging to scientific imaging. Among available reconstruction methods, Model Based Iterative Reconstruction (MBIR) produces higher quality images and allows for the use of more general CT scanner geometries than is possible with more commonly used methods. The high computational cost of MBIR, however, often makes it impractical in applications for which it would otherwise be ideal. This paper describes a new MBIR implementation that significantly reduces the computational cost of MBIR while retaining its benefits. It describes a novel organization of the scanner data into super-voxels (SV) that, combined with a super-voxel buffer (SVB), dramatically increase locality and prefetching, enable parallelism across SVs and lead to an average speedup of 187 on 20 cores.

关键词： Applications Algorithms Multicore parallel algorithm CT image reconstruction MBIR

来源：评论

学校读者我要写书评

暂无评论

NUMA-aware Scheduling and Memory Allocation for data-flow task-parallel Applications 16

NUMA-aware Scheduling and Memory Allocation for data-flow ta...

引用

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Drebes, Andi Pop, Antoniu Heydemann, Karine Drach, Nathalie Cohen, Albert Univ Manchester Sch Comp Sci Manchester M13 9PL Lancs England UPMC Paris 06 Sorbonne Univ CNRS LIP6UMR 7606 Paris France Inria Ecole Normale Super Rocquencourt France

ISBN: (纸本)9781450340922

Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables fine-grained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 9 9 %, resulting in a speedup of up to 5x compared to a NUMA-aware hierarchical work-stealing baseline.

关键词： Scalability

来源：评论

学校读者我要写书评

暂无评论

AUTOGEN: Automatic Discovery of Cache-Oblivious parallel Recursive Algorithms for Solving Dynamic Programs 16

AUTOGEN: Automatic Discovery of Cache-Oblivious Parallel Rec...

引用

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Chowdhury, Rezaul Ganapathi, Pramod Tithi, Jesmin Jahan Bachmeier, Charles Kuszmaul, Bradley C. Leiserson, Charles E. Solar-Lezama, Armando Tang, Yuan SUNY Stony Brook Dept Comp Sci Stony Brook NY 11794 USA MIT Comp Sci & Artificial Intelligence Lab Cambridge MA 02139 USA Fudan Univ Shanghai Key Lab Intelligent Informat Proc Sch Software Shanghai Peoples R China

ISBN: (纸本)9781450340922

We present AUTOGEN-an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size, and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.

关键词： AutoGen automatic discovery dynamic programming recursive divide-and-conquer cache-efficient parallel cacheoblivious energy-efficient cache-adaptive

来源：评论

学校读者我要写书评

暂无评论

Multi-Core On-The-Fly SCC Decomposition 16

Multi-Core On-The-Fly SCC Decomposition

引用

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Bloemen, Vincent Laarman, Alfons van de Pol, Jaco Univ Twente Formal Methods & Tools POB 217 NL-7500 AE Enschede Netherlands Vienna Univ Technol FORSYTE Vienna Austria

ISBN: (纸本)9781450340922

The main advantages of Tarjan's strongly connected component (SCC) algorithm are its linear time complexity and ability to return SCCs on-the-fly, while traversing or even generating the graph. Until now, most parallel SCC algorithms sacrifice both: they run in quadratic worst-case time and/or require the full graph in advance. The current paper presents a novel parallel, on-the-fly SCC algorithm. It preserves the linear-time property by letting workers explore the graph randomly while carefully communicating partially completed SCCs. We prove that this strategy is correct. For efficiently communicating partial SCCs, we develop a concurrent, iterable disjoint set structure (combining the union-find data structure with a cyclic list). We demonstrate scalability on a 64-core machine using 75 real-world graphs (from model checking and explicit data graphs), synthetic graphs (combinations of trees, cycles and linear graphs), and random graphs. Previous work did not show speedups for graphs containing a large SCC. We observe that our parallel algorithm is typically 10-30x faster compared to Tarjan's algorithm for graphs containing a large SCC. Comparable performance (with respect to the current state-of-the-art) is obtained for graphs containing many small SCCs.

关键词： strongly connected components SCC algorithm graph digraph parallel multi-core union-find depth-first search

来源：评论

学校读者我要写书评

暂无评论

A High-Performance parallel Algorithm for Nonnegative Matrix Factorization 16

A High-Performance Parallel Algorithm for Nonnegative Matrix...

引用

21st acm sigplan symposium on principles and practice of parallel programming (ppopp)

作者： Kannan, Ramakrishnan Ballard, Grey Park, Haesun Georgia Tech Atlanta GA 30332 USA Sandia Natl Labs Livermore CA 94550 USA

ISBN: (纸本)9781450340922

Non-negative matrix factorization (NMF) is the problem of determining two non-negative low rank factors W and H, for the given input matrix A, such that A approximate to WH. NMF is a useful tool for many applications in di ff erent domains such as topic modeling in text mining, background separation in video analysis, and community detection in social networks. Despite its popularity in the data mining community, there is a lack of e ffi cient distributed algorithms to solve the problem for big data sets. We propose a high-performance distributed-memory parallel algorithm that computes the factorization by iteratively solving alternating non-negative least squares (NLS) subproblems for W and H. It maintains the data and factor matrices in memory (distributed across processors), uses MPI for interprocessor communication, and, in the dense case, provably minimizes communication costs (under mild assumptions). As opposed to previous implementations, our algorithm is also flexible: (1) it performs well for both dense and sparse matrices, and (2) it allows the user to choose any one of the multiple algorithms for solving the updates to low rank factors W and H within the alternating iterations. We demonstrate the scalability of our algorithm and compare it with baseline implementations, showing significant performance improvements.

关键词： Non-negative matrix factorization

来源：评论

学校读者我要写书评

暂无评论

Scalable adaptive NUMA-aware Lock combining local locking and remote locking for efficient concurrency 16

Scalable adaptive NUMA-aware Lock combining local locking an...

引用

21st acm sigplan symposium on principles and practice of parallel programming, ppopp 2016

作者： Zhang, Mingzhe Lau, Francis C.M. Wang, Cho-Li Cheng, Luwei Chen, Haibo Dept. Computer Science University of Hong Kong Hong Kong Facebook United States Institute of Parallel and Distributed Systems Shanghai Jiao Tong University China

ISBN: (纸本)9781450340922

Scalable locking is a key building block for scalable multi-threaded software. Its performance is especially critical in multi-socket, multi-core machines with non-uniform memory access (NUMA). Previous schemes such as local locking and remote locking only perform well under a certain level of contention, and often require non-trivial tuning for a particular configuration. Besides, for large NUMA systems, because of unmanaged lock server's nomination, current distance-first NUMA policies cannot perform satisfactorily. In this work, we propose SANL, a locking scheme that can de-liver high performance under various contention levels by adap-tively switching between the local and the remote lock scheme. Furthermore, we introduce a new NUMA policy for the remote lock that jointly considers node distances and server utilization when choosing lock servers. A comparison with seven represen-tative locking schemes shows that SANL outperforms the others in most contention situations. In one group test, SANL is 3.7 times faster than RCL lock and 17 times faster than POSIX mutex. © 2016 acm.

关键词： Locks (fasteners)

来源：评论

学校读者我要写书评

暂无评论

Assessing the performance portability of modern parallel programming models using TeaLeaf

引用

CONCURRENCY AND COMPUTATION-practice & EXPERIENCE 2017年第15期29卷 1-15页

作者： Martineau, Matthew McIntosh-Smith, Simon Gaudin, Wayne Univ Bristol HPC Grp Bristol Avon England UK Atom Weap Estab AWE Aldermaston England

In this work, we evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port Tealeaf, a miniature proxy application, or mini app, that solves the heat conduction equation and belongs to the Mantevo Project. We find that the best performance is achieved with architecture-specific implementations but that, in many cases, the performance portable models are able to solve the same problems to within a 5% to 30% performance penalty. While the models expose varying levels of complexity to the developer, they all achieve reasonable performance with this application. As such, if this small performance penalty is permissible for a problem domain, we believe that productivity and development complexity can be considered the major differentiators when choosing a modern parallel programming model to develop applications like Tealeaf.

关键词： Kokkos OpenMP 4 0 performance portability programming models RAJA

来源：评论

学校读者我要写书评

暂无评论

Enabling semantics to improve detection of data races and misuses of lock-free data structures

引用

CONCURRENCY AND COMPUTATION-practice & EXPERIENCE 2017年第15期29卷

作者： Dolz, Manuel F. Astorga, David Del Rio Fernandez, Javier Torquati, Massimo Garcia, Jose Daniel Garcia-Carballeira, Felix Danelutto, Marco Univ Carlos III Madrid Dept Comp Sci Madrid 28911 Spain Univ Pisa Dept Comp Sci I-56127 Pisa Italy

The rapid progress of multi/many-core architectures has caused data-intensive parallel applications not yet fully optimized to deliver the best performance. In the advent of concurrent programming, frameworks offering structured patterns have alleviated developers' burden adapting such applications to multithreaded architectures. While some of these patterns are implemented using synchronization primitives, others avoid them by means of lock-free data mechanisms. However, lock-free programming is not straightforward, ensuring an appropriate use of their interfaces can be challenging, since different memory models plus instruction reordering at compiler/processor levels can interfere in the occurrence of data races. The benefits of race detectors are formidable in this sense;however, they may emit false positives if are unaware of the underlying lock-free structure semantics. To mitigate this issue, this paper extends ThreadSanitizer, a race detection tool, with the semantics of 2 lock-free data structures: the single-producer/single-consumer and the multiple-producer/multiple-consumer queues. With it, we are able to drop false positives and detect potential semantic violations. The experimental evaluation, using different queue implementations on a set of benchmarks and real applications, demonstrates that it is possible to reduce, on average, 60% the number of data race warnings and detect wrong uses of these structures.

关键词： data race detectors parallel programming semantics wait- lock-free data structures

来源：评论

学校读者我要写书评

暂无评论

Guided installation of basic linear algebra routines in a cluster with manycore components

引用

CONCURRENCY AND COMPUTATION-practice & EXPERIENCE 2017年第15期29卷 1-14页

作者： Cuenca, J. Garcia, L. P. Gimenez, D. Herrera, F. J. Univ Murcia Dept Engn & Technol Comp Murcia Spain Tech Univ Cartagena Serv Support Technol Res Murcia Spain Univ Murcia Dept Comp & Syst Murcia Spain

Computational systems are nowadays composed of basic computational components that share multiprocessors and coprocessors of different types, typically several graphics processing units (GPUs) or many integrated cores (MICs), and those computational components are combined in heterogeneous clusters of nodes with different characteristics, including coprocessors of different types, with varying numbers of nodes at different speeds. The software previously developed and optimized for simpler system needs to be redesigned and reoptimized for these new, more complex systems. The adaptation to hybrid multicore+multiGPU and multicore+multiMIC of autotuning techniques for basic linear algebra routines is analyzed. The matrix-matrix multiplication kernel, which is optimized for different computational system components through guided experimentation, is studied. The routine is installed for each node in the cluster, and the information generated from individual installations may be used for a hierarchical installation in a cluster. The basic matrix-matrix multiplication may, in turn, be used inside higher level routines, which delegate their efficient execution to the optimization of the lower level routine. Experimental results are satisfactory in different multicore+multiGPU and multicore+multiMIC systems. So the guided search of execution configurations for satisfactory execution times proves to be a useful tool for heterogeneous systems, where the complexity of the system means a correct use of highly efficient routines and libraries is difficult.

关键词： autotuning heterogeneous computing hybrid programming parallel linear algebra manycore

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：