The domains of parallel and distributed computing have been converging continuously up to the degree that state-of-the-art server computer systems incorporate characteristics from both domains: They comprise a hierarc...
详细信息
ISBN:
(纸本)9781509036820
The domains of parallel and distributed computing have been converging continuously up to the degree that state-of-the-art server computer systems incorporate characteristics from both domains: They comprise a hierarchy of enclosures, where each enclosure houses multiple processor sockets and each socket again contains multiple memory controllers. A global address space and cache coherency are facilitated using multiple layers of fast interconnection technologies even across enclosures. The growing popularity of such systems creates an urge for efficient mappings of cardinal algorithms onto such hierarchical architectures. However, the growing complexity of such systems and the inconsistencies between implementation strategies of different hardware vendors make it increasingly harder to do find efficient mapping strategies that are universally valid. In this paper, we present scalable optimization and mapping strategies in a case study of the popular Scale-Invariant Feature Transform (SIFT) computer vision algorithm. Our approaches are evaluated using a state-of-the-art hierarchical Non-Uniform Memory Access (NUMA) system with 240 physical cores and 12 terabytes of memory, apportioned across 16 NUMA nodes (sockets). SIFT is particularly interesting since the algorithm utilizes a variety of common data access patterns, thus allowing us to discuss the scaling properties of optimization strategies from the distributed and parallel computing domains and their applicability on emerging server systems.
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the sc...
详细信息
ISBN:
(纸本)9781509021406
Many high-performance distributed memory applications rely on point-to-point messaging using the Message Passing Interface (MPI). Due to the latency of the network, and other costs, this communication can limit the scalability of an application when run on high node counts of distributed memory supercomputers. Communication costs are further increased on modern multi- and many-core architectures, when using more than one MPI process per node, as each process sends and receives messages independently, inducing multiple latencies and contention for resources. In this paper, we use shared memory constructs available in the MPI 3.0 standard to implement an aggregated communication method to minimize the number of inter-node messages to reduce these costs. We compare the performance of this Minimal Aggregated SHared Memory (MASHM) messaging to the standard point-to-point implementation on large-scale supercomputers, where we see that MASHM leads to enhanced strong scalability of a weighted Jacobi relaxation. For this application, we also see that the use of shared memory parallelism through MASHM and MPI 3.0 can be more efficient than using Open Multi-Processing (OpenMP). We then present a model for the communication costs of MASHM which shows that this method achieves its goal of reducing latency costs while also reducing bandwidth costs. Finally, we present MASHM as an open source library to facilitate the integration of this efficient communication method into existing distributed memory applications.
Sequential Consistency (SC) is the most intuitive memory model for parallel programs. However, modern architectures aggressively reorder and overlap memory accesses, causing SC violations. An SC violation is virtually...
详细信息
ISBN:
(纸本)9781467390026
Sequential Consistency (SC) is the most intuitive memory model for parallel programs. However, modern architectures aggressively reorder and overlap memory accesses, causing SC violations. An SC violation is virtually always a bug. Most prior schemes either search the entire state space of a program, or use a constraint solver to find SC violations. A promising recent scheme uses active testing technique but fails to be effective for SC violations involving larger number of threads and variables, and larger codebases. We propose Orion, the first active testing technique that can detect, expose, and classify any arbitrary SC violations in any program. Orion works in two phases. In the first phase, it finds potential SC violation cycles by focusing on racing accesses. In the second phase, it exposes each SC violation cycle by enforcing the exact scheduling order. We present a detailed design of Orion in the paper. We tested different concurrent algorithms, bug kernels, SPLASH2, PARSEC applications, and an open source program, Apache. We experimented with TSO and PSO memory models. We detected and exposed 60 SC violations of which 15 violations involve more than two processors and variables. Orion exposes SC violations quickly and with high probability. Compared to a state-of-the-art active testing technique, it has a much better SC violation detection ability.
All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallelalgorithms face various performance trade-off...
详细信息
ISBN:
(纸本)9781509036837
All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallelalgorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel_for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.
In the Big Data computing, improving performance with memory computing is one of hot spots. In the memory computing, the data deployment directly affects load balance and task efficiency. In the scene of memory comput...
详细信息
ISBN:
(纸本)9781467391160
In the Big Data computing, improving performance with memory computing is one of hot spots. In the memory computing, the data deployment directly affects load balance and task efficiency. In the scene of memory computing of electric power data, two unsolved problems are: (1) only memory space, without the CPU frequency and nuclear number, could be considered for load balance and improving performance;(2) there are so many manual operations that it is difficult to complete data deployment automatically. This paper provides an electric power data deployment solution for distributed memory computing to solve the above challenges. In the solution, according to business logic and hardware configuration of cluster nodes, the data deployment strategy can be established. Then, the deployment scheme can be implemented with interface operation. Lastly, cluster nodes load data according to the deployment scheme. The solution has been applied to the Objectification parallel Computing (OPC). The application result shows that OPC can achieve the best performance which can meet the demand of system efficiency and the operation of data deployment is simple.
In this paper, we will examine how to improve workload balancing on a computing cluster by a parallel loop self-scheduling scheme. We use hybrid MPI and OpenMP parallelprogramming in C language. The block partition l...
详细信息
ISBN:
(纸本)9781467391160
In this paper, we will examine how to improve workload balancing on a computing cluster by a parallel loop self-scheduling scheme. We use hybrid MPI and OpenMP parallelprogramming in C language. The block partition loop is according to the performance weighting of compute nodes. This study implements parallel loop self-scheduling use Xeon Phi, with its characteristics to improve workload balancing between heterogeneous nodes. The parallel loop self-scheduling is composed of the static and dynamic allocation. A weighting algorithm is adopted in the static part while the well-known loop self-scheduling scheme is adopted in the dynamic part. In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. In our experiment, we will use a plurality of computing nodes. We compute four applications, i.e., matrix multiplication, sparse matrix multiplication, Mandelbrot set computation, and the circuit satisfiability problem. Our results will show how to do the weight allocation and how to choose a scheduling scheme to achieve the best performance in the parallel loop self-scheduling.
Modern FPGA-based Multiprocessor Systems-on-Chip (MPSoCs) support dynamic reconfiguration of processing elements (PEs) such as processors and accelerators. The reconfiguration improves the flexibility of the system du...
详细信息
Modern FPGA-based Multiprocessor Systems-on-Chip (MPSoCs) support dynamic reconfiguration of processing elements (PEs) such as processors and accelerators. The reconfiguration improves the flexibility of the system due to the dynamic and partial exchange of PEs. However, the design of reconfigurable MPSoCs leads also to higher complexity in programming due to the huge design space. To bridge this gap, this paper presents a runtime system consisting of a software layer called LinROS. It dynamically schedules and reconfigures PEs. LinROS uses a novel Linux device driver that automatically manages the software and hardware of the reconfigurable MPSoC at runtime. In addition, an IP core developed for LinROS facilitates an easy hardware integration of PEs using High-Level-Synthesis tools such as VivadoHLS. Data exchange between PEs is managed by the IP core. The entire system is evaluated on a Xilinx Zynq SoC performing image processing algorithms. The results show a negligible overhead of the scheduling while the programming of reconfigurable MPSoCs is significantly simplified by the device driver. Furthermore, the hardware design of PEs is also simplified due to the High-Level Synthesis.
The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focu...
详细信息
ISBN:
(纸本)9781479986484
The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manycore architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manycore library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25x faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.
A broad class of applications involve indirect or data-dependent memory accesses and are referred to as irregular applications. Recent developments in SIMD architectures - specifically, the emergence of wider SIMD lan...
详细信息
ISBN:
(纸本)9781509042456
A broad class of applications involve indirect or data-dependent memory accesses and are referred to as irregular applications. Recent developments in SIMD architectures - specifically, the emergence of wider SIMD lanes, combination of SIMD parallelism with many-core MIMD parallelism, and more flexible programming APIs - are providing new opportunities as well as challenges for this class of applications. In this paper, we propose a general optimization methodology, to effectively optimize different subclasses of irregular applications. Based on the observation that all applications with indirect memory accesses can be viewed as sparse matrix computations, we design an optimization methodology, which includes three sub-steps: 1) locality enhancement through tiling, 2) data access pattern identification, and 3) write conflict removal at both SIMD and MIMD levels. This method has been applied to unstructured grids, molecular dynamics, and graph applications, in addition to sparse matrix computations. The speedups achieved by our single threaded vectorized code over serial code is up to 9.05, whereas the overall speedup while utilizing both SIMD and MIMD (61 cores in Intel Xeon Phi) with our approach is up to 467.1. Further optimization using matrix reordering on irregular reductions and graph algorithms is able to achieve an incremental speedup of up to 1.69, though at a relatively high preprocessing cost. Moreover, SpMM using our approach outperforms routines from a highly optimized commercial library by up to 2.81×.
These keynote discusses the following: parallel and Interactive Computing of Big Data; Approximation algorithms: Methodologies, Applications and Empirical Evaluation.
These keynote discusses the following: parallel and Interactive Computing of Big Data; Approximation algorithms: Methodologies, Applications and Empirical Evaluation.
暂无评论