the auction algorithm has been widely used to solve the bipartite graph matching problem and its parallel implementation is employed to find solutions in a reasonable computational time. Moreover, the new multicore ar...
详细信息
ISBN:
(纸本)9781538666142
the auction algorithm has been widely used to solve the bipartite graph matching problem and its parallel implementation is employed to find solutions in a reasonable computational time. Moreover, the new multicore architectures, besides its various cores, have a SIMD instruction set that can increase application performance when exactly the same operations are to be performed on multiple data objects. the aim of this paper is to efficiently execute the auction algorithm on these architectures. To achieve that, a vectorized version was implemented and evaluated. these versions were then run in parallel using the OpenMP library. Finally, to optimize the number of threads used during the execution, a malleable strategy is proposed and evaluated. Results show that the vectorized version outperforms the sequential one by a factor of 10, while the malleable vectorized version was able to adapt its execution to exploit the full potential of multicore architectures.
Nowadays, subsequence similarity search under the Dynamic Time Warping (DTW) similarity measure is applied in a wide range of time series mining applications. Since the DTW measure has a quadratic computational comple...
详细信息
Real time response is critical in many areas and requires fast processing of streaming data continuously coming from deferent sources. this paper presents some approaches and implementation issues for developing compl...
详细信息
ISBN:
(纸本)9781538670972
Real time response is critical in many areas and requires fast processing of streaming data continuously coming from deferent sources. this paper presents some approaches and implementation issues for developing complex workflows in stream data processing, which is one of the main challenges in big data processing.
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain ...
详细信息
ISBN:
(纸本)9781450359863
Graph-specific computing withthe support of dedicated accelerator has greatly boosted the graph processing in both efficiency and energy. Nevertheless, their data conflict management is still sequential when certain vertex needs a large number of conflicting updates at the same time, leading to prohibitive performance degradation. this is particularly true and serious for processing natural graphs. In this paper, we have the insight that the atomic operations for the vertex updating of many graph algorithms (e.g., BFS, PageRank, and WCC) are typically incremental and simplex. this hence allows us to parallelize the conflicting vertex updates in an accumulative manner. We architect AccuGraph, a novel graph-specific accelerator that can simultaneously process atomic vertex updates for massive parallelism while ensuring the correctness. A parallel accumulator is designed to remove the serialization in atomic protections for conflicting vertex updates through merging their results in parallel. Our implementation on Xilinx FPGA with a wide variety of typical graph algorithms shows that our accelerator achieves an average throughput by 2.36 GTEPS as well as up to 3.14x performance speedup in comparison with state-of-the-art ForeGraph (with its single-chip version).
the proceedings contain 11 papers. the topics discussed include: on advanced Monte Carlo methods for linear algebra on advanced accelerator architectures;event-triggered communication in parallel computing;non-collect...
ISBN:
(纸本)9781728101767
the proceedings contain 11 papers. the topics discussed include: on advanced Monte Carlo methods for linear algebra on advanced accelerator architectures;event-triggered communication in parallel computing;non-collective scalable global network based on local communications;shift-collapse acceleration of generalized polarizable reactive molecular dynamics for machine learning-assisted computational synthesis of layered materials;communication avoiding multigrid preconditioned conjugate gradient method for extreme scale multiphase CFD simulations;dynamic load balancing of plasma and flow simulations;low thread-count Gustavson: a multithreaded algorithm for sparse matrix-matrix multiplication using perfect hashing;and a general-purpose hierarchical mesh partitioning method with node balancing strategies for large-scale numerical simulations.
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that c...
详细信息
ISBN:
(纸本)9781538674796
Sequence alignment is the most widely used operation in bioinformatics. Withthe exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that can be at the order of hundreds of millions of characters long) would require excessive processing power and memory bandwidth. Sequence alignment algorithms can potentially benefit from the processing power of massive parallel processors due their simple arithmetic operations, coupled withthe inherent fine-grained and coarse-grained parallelism that they exhibit. However, the limited memory bandwidth in conventional computing systems prevents exploiting the maximum achievable speedup. In this paper, we propose a processing-in-memory architecture as a viable solution for the excessive memory bandwidth demand of bioinformatics applications. the design is composed of a set of simple and lightweight processing elements, customized to the sequence alignment algorithm, integrated at the logic layer of an emerging 3D DRAM architecture. Experimental results show that the proposed architecture results in up to 2.4x speedup and 41% reduction in power consumption, compared to a processor-side parallel implementation.
Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via ex...
详细信息
ISBN:
(纸本)9781450365109
Emerging many-core CPU architectures with high degrees of single-instruction, multiple data (SIMD) parallelism promise to enable increasingly ambitious simulations based on partial differential equations (PDEs) via extreme-scale computing. However, such architectures present several challenges to their efficient use. Here, we explore the efficient implementation of sparse matrix-vector (SpMV) multiplications-a critical kernel for the workhorse iterative linear solvers used in most PDE-based simulations-on recent CPU architectures from Intel as well as the second-generation Knights Landing Intel Xeon Phi, which features many CPU cores, wide SIMD lanes, and on-package high-bandwidth memory. Traditional SpMV algorithms use compressed sparse row storage format, which is a hindrance to exploiting wide SIMD lanes. We study alternative matrix formats and present an efficient optimized SpMV kernel, based on a sliced ELLPACK representation, which we have implemented in the PETSc library. In addition, we demonstrate the benefit of using this representation to accelerate preconditioned iterative solvers in realistic PDE-based simulations in parallel.
In recent years, Industry 4.0 has emerged and received substantial attention. Industry 4.0 integrate physical and decisional aspects of manufacturing processes, to improve the ability to cope with unpredictable and ne...
详细信息
In recent years, Industry 4.0 has emerged and received substantial attention. Industry 4.0 integrate physical and decisional aspects of manufacturing processes, to improve the ability to cope with unpredictable and negative events. Production scheduling has always been a common problem faced by manufacturing systems, stochastic scheduling is thus a hot research topic. It is usually assumed in the literature that job processing times are stochastic. Due to the inadequate or unrepresentative historical data, the probability distribution of processing times may not be well estimated. this work investigate a bi-objective stochastic parallel machine scheduling problem, to minimize the capital budget and risk level simultaneously. Only partial distributional information on uncertain job processing times, i.e., the mean and covariance matrix, is known. Especially, the risk level is measured by the probability of existing tardy jobs. For the problem, a distributionally robust bi-objective formulation is first proposed. then a popular approximation method is applied for the probabilistic objective function, and 6-constraint method is further developed. A case study is conducted and analyzed, and some managerial insights are drawn. (C) 2019, IFAC (international Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved.
processing-In-Memory (PIM) is an increasingly popular architecture aimed at addressing the 9;memory wall9; crisis by prioritizing the integration of processors within DRAM. It promotes low data access latency, h...
详细信息
ISBN:
(纸本)9781450359863
processing-In-Memory (PIM) is an increasingly popular architecture aimed at addressing the 'memory wall' crisis by prioritizing the integration of processors within DRAM. It promotes low data access latency, high bandwidth, massive parallelism, and low power consumption. the skyline operator is a known primitive used to identify those multi-dimensional points offering optimal trade-offs within a given dataset. For large multidimensional dataset, calculating the skyline is extensively compute and data intensive. Although, PIM systems present opportunities to mitigate this cost, their execution model relies on all processors operating in isolation with minimal data exchange. this prohibits direct application of known skyline optimizations which are inherently sequential, creating dependencies and large intermediate results that limit the maximum parallelism, throughput, and require an expensive merging phase. In this work, we address these challenges by introducing the first skyline algorithm for PIM architectures, called DSky. It is designed to be massively parallel and throughput efficient by leveraging a novel work assignment strategy that emphasizes load balancing. Our experiments demonstrate that it outperforms the state-of-the-art algorithms for CPUs and GPUs, in most cases. DSky achieves 2x to 14x higher throughput compared to the state-of-the-art solutions on competing CPU and GPU architectures. Furthermore, we showcase DSky's good scaling properties which are intertwined with PIM's ability to allocate resources with minimal added cost. In addition, we showcase an order of magnitude better energy consumption compared to CPUs and GPUs.
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in boththe s...
详细信息
ISBN:
(纸本)9781538683866
Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in boththe sequential and parallel settings. Several recent sequential and parallelalgorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallelarchitectures is therefore quite challenging. For this reason, it can be noticed that the time spent by the current parallel graph connectivity algorithms on BFS operations is usually a significant portion of their overall runtime. In this paper, we study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of relying on BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. the incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. Our experiments on finding the 2, and 3-connectivity of graphs on Nvidia K40c GPUs improve the state-of-the-art on the corresponding problems by a factor 2.2x, and 2.1x respectively.
暂无评论