One of the most important properties of distributedcomputing systems (e.g., Apache Spark, Apache Hadoop, etc) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster...
详细信息
ISBN:
(纸本)9781728166773
One of the most important properties of distributedcomputing systems (e.g., Apache Spark, Apache Hadoop, etc) on clusters and computation clouds is the ability to scale out by adding more compute nodes to the cluster. This important feature can lead to performance gain provided the computation (or the algorithm) itself can scale out. In other words, the computation (or the algorithm) should be easily decomposable into smaller units of work to be distributed among the workers based on the hardware/software configuration of the cluster or the cloud. Additionally, on such clusters, there is an important trade-off between communication cost, parallelism, and memory requirement. Due to the scalability need as well as this trade-off, it is crucial to have a well-decomposable, adaptive, tunable, and scalable program. Tunability enables the programmer to find an optimal point in the trade-off spectrum to execute the program efficiently on a specific cluster. We design and implement well-decomposable and tunable dynamic programming algorithms from the Gaussian Elimination Paradigm (GEP), such as Floyd-Warshall's all-pairs shortest path and Gaussian elimination without pivoting, for execution on Apache Spark. Our implementations are based on parametric multi-way recursive divide-&-conquer algorithms. We explain how to map implementations of those grid-based parallel algorithms to the Spark framework. Finally, we provide experimental results illustrating the performance, scalability, and portability of our Spark programs. We show that offloading the computation to an OpenMP environment (by running parallel recursive kernels) within Spark is at least partially responsible for a 2 - 5x speedup of the DP benchmarks.
Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. distributed-memory clusters are often used for analyzing such large graphs since the main mem...
详细信息
ISBN:
(纸本)9781728112466
Graph analytics systems must analyze graphs with billions of vertices and edges which require several terabytes of storage. distributed-memory clusters are often used for analyzing such large graphs since the main memory of a single machine is usually restricted to a few hundreds of gigabytes. This requires partitioning the graph among the machines in the cluster. Existing graph analytics systems usually come with a built-in partitioner that incorporates a particular partitioning policy, but the best partitioning policy is dependent on the algorithm, input graph, and platform. Therefore, built-in partitioners are not sufficiently flexible. Stand-alone graph partitioners are available, but they too implement only a small number of partitioning policies. This paper presents CuSP, a fast streaming edge partitioning framework which permits users to specify the desired partitioning policy at a high level of abstraction and generates high-quality graph partitions fast. For example, it can partition wdc12, the largest publicly available web-crawl graph, with 4 billion vertices and 129 billion edges, in under 2 minutes for clusters with 128 machines. Our experiments show that it can produce quality partitions 6x faster on average than the state-of-the-art stand-alone partitioner in the literature while supporting a wider range of partitioning policies.
Design by Transformation (DxT) is an approach to software development that encodes domain-specific programs as graphs and expert design knowledge as graph transformations. The goal of DxT is to mechanize the generatio...
详细信息
Design by Transformation (DxT) is an approach to software development that encodes domain-specific programs as graphs and expert design knowledge as graph transformations. The goal of DxT is to mechanize the generation of highly-optimized code. This paper demonstrates how DxT can be used to transform sequential specifications of an important set of Dense Linear Algebra (DLA) kernels, the level-3 Basic Linear Algebra Subprograms (BLAS3), into high-performing library routines targeting distributed-memory (cluster) architectures. Getting good BLAS3 performance for such platforms requires deep domain knowledge, so their implementations are manually coded by experts. Unfortunately, there are few such experts and developing the full variety of BLAS3 implementations takes a lot of repetitive effort. A prototype tool, DxTer, automates this tedious task. We explain how we build on previous work to represent loops and multiple loop-based algorithms in DxTer. Performance results on a BlueGene/P parallel supercomputer show that the generated code meets or beats implementations that are hand-coded by a human expert and outperforms the widely used ScaLAPACK library.
With the increasing computational speed of PC processors, lowering hardware prices, and the increasing accessibility of open source software, PC clusters hake become an attractive option for exploring budgetary high p...
详细信息
With the increasing computational speed of PC processors, lowering hardware prices, and the increasing accessibility of open source software, PC clusters hake become an attractive option for exploring budgetary high performance computation on high resolution environment modelling. This paper documents the implementation and operation of a Beowulf-class PC cluster called DOBSON built in the Atmospheric Modelling Laboratory at National Central University in Taiwan. Based on our current configuration, the DOBSON cluster achieves a 4.4-8.7 times of speedup with 8 and 14 CPUs. respectively, compared with a 1 CPU in simulations of the Antarctic vortex. The potential impacts of running a high resolution environmental modelling system on the DOBSON cluster were examined by case simulations of Typhoon Haiyan with a nested five-domain configuration. These results show that the complicated topographical flow in the Taipei Basin can only be properly resolved when model grid resolutions less than 9 km were used. 2005 Elsevier Ltd. All rights reserved.
The Aurora distributed shared data system implements a shared-data abstraction on distributed-memory platforms, such as clusters, using abstract data types. Aurora programs are written in C++ and instantiate shared-da...
详细信息
The Aurora distributed shared data system implements a shared-data abstraction on distributed-memory platforms, such as clusters, using abstract data types. Aurora programs are written in C++ and instantiate shared-data objects whose data-sharing behaviour can be optimized using a novel technique called scoped behaviour. Each object and each phase of the computation (i.e., use-context) can be independently optimized with per-object and per-context flexibility. Within the scoped behaviour framework, optimizations such as bulk-data transfer can be implemented and made available to the application programmer. Scoped behaviour carries semantic information regarding the specific data-sharing pattern through various layers of software. We describe how the optimizations are integrated from the uppermost application-programmer layers down to the lowest UDP-based layers of the Aurora system. A bulk-data transfer network protocol bypasses some bottlenecks associated with TCP/IP and achieves higher performance on an ATM network than either TreadMarks (distributed shared memory) or MPICH (message passing) for matrix multiplication and parallel sorting. (C) 2001 Academic Press.
暂无评论