Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often...
详细信息
ISBN:
(纸本)9798350337662
Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities. To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to handcrafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://***/anyq.
Computer vision requires the processing of large volumes of data and requires parallelarchitectures and algorithms to be useful in real-time, industrial applica tions. The INSIGHT dataflow language was designed to al...
Computer vision requires the processing of large volumes of data and requires parallelarchitectures and algorithms to be useful in real-time, industrial applica tions. The INSIGHT dataflow language was designed to allow encoding of vision algorithms at all levels of the computer vision paradigm. INSIGHT programs, which are relational in nature, can be translated into a graph structure that represents an architecture for solving a particular vision problem or a configuration of a reconfi gurable computational network. We consider here IN SIGHT programs that produce a parallel net architecture for solving low-, mid-, and high-level vision tasks.
In this paper, we will examine how to improve workload balancing on a computing cluster by a parallel loop self-scheduling scheme. We use hybrid MPI and OpenMP parallelprogramming in C language. The block partition l...
详细信息
ISBN:
(纸本)9781467391160
In this paper, we will examine how to improve workload balancing on a computing cluster by a parallel loop self-scheduling scheme. We use hybrid MPI and OpenMP parallelprogramming in C language. The block partition loop is according to the performance weighting of compute nodes. This study implements parallel loop self-scheduling use Xeon Phi, with its characteristics to improve workload balancing between heterogeneous nodes. The parallel loop self-scheduling is composed of the static and dynamic allocation. A weighting algorithm is adopted in the static part while the well-known loop self-scheduling scheme is adopted in the dynamic part. In recent years, Intel promotes its new product Xeon Phi coprocessor, which is similar to the x86 architecture coprocessor. It has about 60 cores and can be regarded as a single computing node, with the computing power that cannot be ignored. In our experiment, we will use a plurality of computing nodes. We compute four applications, i.e., matrix multiplication, sparse matrix multiplication, Mandelbrot set computation, and the circuit satisfiability problem. Our results will show how to do the weight allocation and how to choose a scheduling scheme to achieve the best performance in the parallel loop self-scheduling.
This paper addresses the problem of transcoding proxy placement for coordinated en-route web caching for tree networks. We propose a model for this problem by considering all the nodes among the network in a coordinat...
详细信息
ISBN:
(纸本)0769521355
This paper addresses the problem of transcoding proxy placement for coordinated en-route web caching for tree networks. We propose a model for this problem by considering all the nodes among the network in a coordinated way and formulate this problem as an optimization problem. We implement our dynamic programming-based algorithm and evaluate our model on different performance metrics through extensive simulation experiments. The implementation results show that our model outperforms the placement model for linear topology.
This paper introduces a new type of parallel computer based on N+1 programs (hereinafter, N+1 computer), as well as its features. A new concept of parallel computing architecture based on N+1 programs is also presente...
详细信息
With the rapid development of Internet and the continuous rise of network users, the network traffic in various regions is increasing rapidly. In the face of a large number of high speed and high throughput of the net...
详细信息
ISBN:
(纸本)9781538694039
With the rapid development of Internet and the continuous rise of network users, the network traffic in various regions is increasing rapidly. In the face of a large number of high speed and high throughput of the network environment, traditional packet capture methods and processing capabilities cannot reach the corresponding speed, which results in severe packet loss. This paper focuses on a high-performance packet acquisition and distribution method to break through the performance bottleneck of universal servers and network cards. This paper studies a packet capture method based on DPDK platform, and uses the processing of hash value in RSS to improve the efficiency of data packet distribution, which realizes the process from performance acquisition to efficiently multi-core parallel processing. This method can effectively reduce packet loss and improve the data packet processing rate. It can also reduce resource waste and network overhead for traffic capture and distribution. Preliminary experiments show that DPDK-based traffic processing has obvious advantages over PF-RING and Netmap in data processing speed.
Threads provides a mechanism for simulating the execution of parallel algorithms on a simplified model of a shared-memory multiprocessor. The algorithms can be expressed in a high-level block-structured language, whic...
详细信息
Threads provides a mechanism for simulating the execution of parallel algorithms on a simplified model of a shared-memory multiprocessor. The algorithms can be expressed in a high-level block-structured language, which supports multiple threads of execution within a common body of program code. Results show an ability to achieve good speedup for small problems using algorithms derived by simple modifications of sequential algorithms. As well, a sibling thread synchronisation feature provides the basis for the synchronous execution of threads. k -parallel algorithms tailored to the machine size and implemented as synchronously executing iterations, can provide near linear speedup as the problem size is increased. The techniques described in this paper seem to promise an effective synchronous execution mode for shared-memory MIMD architectures.
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The...
详细信息
This paper presents the analysis of a parallel formulation of depth-first search. At the heart of this parallel formulation is a dynamic work-distribution scheme that divides the work between different processors. The effectiveness of the parallel formulation is strongly influenced by the work-distribution scheme and the target architecture. We introduce the concept of isoefficiency function to characterize the effectiveness of different architectures and work-distribution schemes. Many researchers considered the ring architecture to be quite suitable for parallel depth-first search. Our analytical and experimental results show that hypercube and shared-memory architectures are significantly better. The analysis of previously known work-distribution schemes motivated the design of substantially improved schemes for ring and shared-memory architectures. In particular, we present a work-distribution algorithm that guarantees close to optimal performance on a shared-memory/ω-network-with-message-combining architecture (e.g. RP3). Much of the analysis presented in this paper is applicable to other parallel algorithms in which work is dynamically shared between different processors (e.g., parallel divide-and-conquer algorithms). The concept of isoefficiency is useful in characterizing the scalability of a variety of parallel algorithms.
programmingparallelarchitectures using a hierarchical point of view is becoming today's standard as machines are structured by multiple layers of memories. To handle such architectures, we focus on the MULTI-BSP...
详细信息
ISBN:
(数字)9781665488020
ISBN:
(纸本)9781665488020
programmingparallelarchitectures using a hierarchical point of view is becoming today's standard as machines are structured by multiple layers of memories. To handle such architectures, we focus on the MULTI-BSP bridging model. This model extends BSP and proposes a structured way of programming multi-level architectures. In the context of parallelprogramming we, now need to manage new concerns such as memory coherency, deadlocks and safe data communications. To do so, we propose a typing system for MULTI-ML, a ML-like programming language based on the MULTI-BSP model. This type system introduces data locality using type annotations and effects to be able to detected wrong uses of multi-level architectures. We thus ensure that "Well-typed programs cannot go wrong" on hierarchical architectures.
With the evolution of High Performance Computing, multi-core and many-core systems are a common feature of new hardware architectures. The required programming efforts induced by the introduction of these architecture...
详细信息
With the evolution of High Performance Computing, multi-core and many-core systems are a common feature of new hardware architectures. The required programming efforts induced by the introduction of these architectures are challenging due to the increasing number of cores. parallelprogramming models based on the data flow model and the task programming paradigm intend to fix this issue. Iterative linear solvers are a key part of petroleum reservoir simulation as they can represent up to 80% of the total computing time. In these algorithms, the standard preconditioning methods for large, sparse and unstructured matrices such as Incomplete LU Factorization (ILU) or Algebraic Multigrid (AMG) fail to scale on shared-memory architectures with large number of cores. Multi-level domain decomposition (DDML) preconditioners recently introduced seem to be both numerically robust and scalable on emerging architectures because of their parallel nature. This paper proposes a parallel implementation of these preconditioners using the task programming paradigm with a data flow model. This approach is validated on linear systems extracted from realistic petroleum reservoir simulations. This shows that, given an appropriate coarse operator in such preconditioners, the method has good convergence rates while our implementation ensures interesting scalability on multi-core architectures. (C) 2019 Elsevier B.V. All rights reserved.
暂无评论