检索结果-内蒙古大学图书馆

IEEE International Conference on Cluster computing

作者： Javanmard, Mohammad Mahdi Ahmad, Zafar Zola, Jaroslaw Pouchet, Louis-Noel Chowdhury, Rezaul Harrison, Robert SUNY Stony Brook Stony Brook NY 11794 USA Univ Buffalo Buffalo NY USA Colorado State Univ Ft Collins CO 80523 USA

ISBN: (纸本)9781728166773

One of the most important properties of distributed computing systems (e.g., Apache Spark, Apache Hadoop, etc) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important trade-off between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this trade-off, it is crucial to have a well-decomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the trade-off spectrum to execute the program efficiently on a specific cluster. We design and implement well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as Floyd-Warshall's all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multi-way recursive divide-&-conquer algorithms. We explain how to map implementations of those grid-based parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a 2 - 5x speedup of the DP benchmarks.

关键词： Dynamic Programming Recursive Divide-&-Conquer distributed-memory computing Apache Spark I/O-efficiency Communication-efficiency Polyhedral Compilation

来源：评论

学校读者我要写书评

暂无评论

CuSP: A Customizable Streaming Edge Partitioner for distributed Graph Analytics 33

CuSP: A Customizable Streaming Edge Partitioner for Distribu...

引用

33rd IEEE International Parallel and distributed Processing Symposium (IPDPS)

作者： Hoang, Loc Dathathri, Roshan Gill, Gurbinder Pingali, Keshav Univ Texas Austin Dept Comp Sci Austin TX 78712 USA

ISBN: (纸本)9781728112466

Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6x faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.

关键词： Graph analytics graph partitioning streaming partitioners distributed-memory computing

来源：评论

学校读者我要写书评

暂无评论

Code Generation and Optimization of distributed-memory Dense Linear Algebra Kernels

引用

Procedia Computer Science 2013年 18卷 1282-1291页

作者： Bryan Marker Don Batory Robert van de Geijn Department of Computer Science The University of Texas at Austin

Design by Transformation (DxT) is an approach to software development that encodes domain-specific programs as graphs and expert design knowledge as graph transformations. The goal of DxT is to mechanize the generation of highly-optimized code. This paper demonstrates how DxT can be used to transform sequential specifications of an important set of Dense Linear Algebra (DLA) kernels, the level-3 Basic Linear Algebra Subprograms (BLAS3), into high-performing library routines targeting distributed-memory (cluster) architectures. Getting good BLAS3 performance for such platforms requires deep domain knowledge, so their implementations are manually coded by experts. Unfortunately, there are few such experts and developing the full variety of BLAS3 implementations takes a lot of repetitive effort. A prototype tool, DxTer, automates this tedious task. We explain how we build on previous work to represent loops and multiple loop-based algorithms in DxTer. Performance results on a BlueGene/P parallel supercomputer show that the generated code meets or beats implementations that are hand-coded by a human expert and outperforms the widely used ScaLAPACK library.

关键词： program generation dense linear algebra high-performance software distributed-memory computing

来源：评论

学校读者我要写书评

暂无评论

DOBSON: a Pentium-based SMP Linux PC Beowulf for distributed-memory high resolution environment modelling

引用

ENVIRONMENTAL MODELLING & SOFTWARE 2005年第10期20卷 1299-1306页

作者： Wang, KY Shallcross, DE Hall, SM Lo, YH Chou, C Chen, D Natl Cent Univ Dept Atmospher Sci Chungli 32054 Taiwan Univ Bristol Sch Chem Ctr Biogeochem Bristol BS8 1TS Avon England Univ Cambridge Ctr Atmospher Sci Cambridge CB2 1EW England

With the increasing computational speed of PC processors, lowering hardware prices, and the increasing accessibility of open source software, PC clusters hake become an attractive option for exploring budgetary high performance computation on high resolution environment modelling. This paper documents the implementation and operation of a Beowulf-class PC cluster called DOBSON built in the Atmospheric Modelling Laboratory at National Central University in Taiwan. Based on our current configuration, the DOBSON cluster achieves a 4.4-8.7 times of speedup with 8 and 14 CPUs. respectively, compared with a 1 CPU in simulations of the Antarctic vortex. The potential impacts of running a high resolution environmental modelling system on the DOBSON cluster were examined by case simulations of Typhoon Haiyan with a nested five-domain configuration. These results show that the complicated topographical flow in the Taipei Basin can only be properly resolved when model grid resolutions less than 9 km were used. 2005 Elsevier Ltd. All rights reserved.

关键词： Beowulf cluster computing air pollution modelling distributed-memory computing

来源：评论

学校读者我要写书评

暂无评论

Integrating bulk-data transfer into the aurora distributed shared data system

引用

JOURNAL OF PARALLEL AND distributed computing 2001年第11期61卷 1609-1632页

作者： Lu, P Univ Alberta Dept Comp Sci Edmonton AB T6G 2E8 Canada

The Aurora distributed shared data system implements a shared-data abstraction on distributed-memory platforms, such as clusters, using abstract data types. Aurora programs are written in C++ and instantiate shared-data objects whose data-sharing behaviour can be optimized using a novel technique called scoped behaviour. Each object and each phase of the computation (i.e., use-context) can be independently optimized with per-object and per-context flexibility. Within the scoped behaviour framework, optimizations such as bulk-data transfer can be implemented and made available to the application programmer. Scoped behaviour carries semantic information regarding the specific data-sharing pattern through various layers of software. We describe how the optimizations are integrated from the uppermost application-programmer layers down to the lowest UDP-based layers of the Aurora system. A bulk-data transfer network protocol bypasses some bottlenecks associated with TCP/IP and achieves higher performance on an ATM network than either TreadMarks (distributed shared memory) or MPICH (message passing) for matrix multiplication and parallel sorting. (C) 2001 Academic Press.

关键词： bulk-data transfer distributed-memory computing shared data data-sharing patterns optimizations scoped behaviour network of workstations clusters

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：