We present the first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring. In s...
详细信息
ISBN:
(纸本)9781728149523
We present the first conditional hardness results for massively parallel algorithms for some central graph problems including (approximating) maximum matching, vertex cover, maximal independent set, and coloring. In some cases, these hardness results match or get close to the state of the art algorithms. Our hardness results are conditioned on a widely believed conjecture in massively parallel computation about the complexity of the connectivity problem. We also note that it is known that an unconditional variant of such hardness results might be somewhat out of reach for now, as it would lead to considerably improved circuit complexity lower bounds and would concretely imply that NC1 is a proper subset of P. We obtain our conditional hardness result via a general method that lifts unconditional lower bounds from the well-studied LOCAL model of distributed computing to the massively parallel computation setting.
Past research has evidently proved that public key cryptosystems are usually slower than symmetric key cryptosystems due to the reason that they use one additional cryptographic key and different methods for encryptio...
详细信息
Past research has evidently proved that public key cryptosystems are usually slower than symmetric key cryptosystems due to the reason that they use one additional cryptographic key and different methods for encryption and decryption. RSA is one of the most common asymmetric key cryptography algorithms. Recent research has focused on speeding up RSA using various techniques. With the introduction of distributed computing, parallelization of algorithms enables them to run on multiple cores concurrently at a time. RSA consists of two resource intensive operations namely Modular Exponentiation of up to 1024-bit exponents and repeated calculation of Greatest common divisor. Thus, RSA lays the perfect base for application of Montgomery Reduction algorithm to optimize the Repeated Modular multiplication in exponentiation. In this paper we proposed a parallel scheme for RSA using a new parallel data structure known as Concurrent Indexed List of character blocks. The aim of our research was to improve the speed of RSA encryption and decryption using parallelism and also make it compatible with leading industry cryptography standards. We have simulated four different approaches namely both parallel and sequential with and without Montgomery. We have also integrated our parallel paradigm with renowned C++ Crypto library and achieved a speed-up of upto four times than sequential approach. Unlike any other previous approaches, our implementation got easily integrated with any external library and thus can be adopted by any other algorithmic scheme.
Model-based approaches in Software Performance Engineering (SPE) are used in early design phases to evaluate performance. Most current model-based prediction approaches work quite well for single-core CPUs but are not...
详细信息
ISBN:
(纸本)9781450362863
Model-based approaches in Software Performance Engineering (SPE) are used in early design phases to evaluate performance. Most current model-based prediction approaches work quite well for single-core CPUs but are not suitable or precise enough for multicore environments. This is because they only consider a single metric (i.e., the CPU speed) as a factor affecting performance. Therefore, we investigate parallel-performance-influencing factors (PPIFs) as a preparing step to improve current performance prediction models by providing references curves for the speedup behaviour of different resource demands and scenarios. In this paper, we show initial results and their relevance for future work.
Gradient Boosting Decision Tree (GBDT) is a widely used machine learning algorithm, whose training involves both irregular computation and random memory access and is challenging for system optimizations. In this pape...
详细信息
ISBN:
(纸本)9781728147345
Gradient Boosting Decision Tree (GBDT) is a widely used machine learning algorithm, whose training involves both irregular computation and random memory access and is challenging for system optimizations. In this paper, we conduct a comprehensive performance analysis of two state-of-the-art systems, XGBoost and LightGBM. They represent two typical parallel implementations for GBDT;one is data parallel and the other one is parallel over features. Substantial thread synchronization overhead, as well as the inefficiency of random memory access, is identified. We propose HarpGBDT, a new GBDT system designed from the perspective of parallel efficiency optimization. Firstly, we adopt a new tree growth method that selects the top K candidates of tree nodes to enable the use of more levels of parallelism without sacrificing the algorithm's accuracy. Secondly, we organize the training data and model data in blocks and propose a block-wise approach as a general model that enables the exploration of various parallelism options. Thirdly, we propose a mixed mode to utilize the advantages of a different mode of parallelism in different phases of training. By changing the configuration of the block size and parallel mode, HarpGBDT is able to attain better parallel efficiency. By extensive experiments on four datasets with different statistical characteristics on the Intel(R) Xeon(R) E5-2699 server, HarpGBDT on average performs 8x faster than XGBoost and 2.6x faster than LightGBM.
parallel radix sorting has been extensively studied in the literature for decades. However, the most efficient implementations require auxiliary memory proportional to the input size, which can be prohibitive for larg...
详细信息
ISBN:
(纸本)9781450361842
parallel radix sorting has been extensively studied in the literature for decades. However, the most efficient implementations require auxiliary memory proportional to the input size, which can be prohibitive for large inputs. The standard serial in-place radix sorting algorithm is based on swapping each key to its correct place in the array on each level of recursion. However, it is not straightforward to parallelize this algorithm due to dependencies among the swaps. This paper presents Regions Sort, a new parallel in-place radix sorting algorithm that is efficient both in theory and in practice. Our algorithm uses a graph structure to model dependencies across elements that need to be swapped, and generates independent tasks from this graph that can be executed in parallel. For sorting n integers from a range r, and a parameter K, Regions Sort requires only O(K log r log n) auxiliary memory. Our algorithm requires O(n log r) work and O((n/K + logK) log r) span, making it work-efficient and highly parallel. In addition, we present several optimizations that significantly improve the empirical performance of our algorithm. We compare the performance of Regions Sort to existing parallel in-place and out-of-place sorting algorithms on a variety of input distributions and show that Regions Sort is faster than optimized out-of-place radix sorting and comparison sorting algorithms, and is almost always faster than the fastest publicly-available in-place sorting algorithm.
Balkanski and Singer [4] recently initiated the study of adaptivity (or parallelism) for constrained submodular function maximization, and studied the setting of a cardinality constraint. Subsequent improvements for t...
详细信息
ISBN:
(纸本)9781611975482
Balkanski and Singer [4] recently initiated the study of adaptivity (or parallelism) for constrained submodular function maximization, and studied the setting of a cardinality constraint. Subsequent improvements for this problem by Balkanski, Rubinstein, and Singer [6] and Ene and Nguyen [21] resulted in a near-optimal (1-1/ epsilon-epsilon)approximation in O(log n/epsilon(2)) rounds of adaptivity. Partly motivated by the goal of extending these results to more general constraints, we describe parallel algorithms for approximately maximizing the multilinear relaxation of a monotone submodular function subject to packing constraints. Formally our problem is to maximize F(x) over x is an element of [0;1](n) subject to Ax <= 1 where F is the multilinear relaxation of a monotone submodular function. Our algorithm achieves a near-optimal (1 - 1=e - epsilon)-approximation in O(log(2) mlog n=epsilon(4)) rounds where n is the cardinality of the ground set and m is the number of packing constraints. For many constraints of interest, the resulting fractional solution can be rounded via known randomized rounding schemes that are oblivious to the specific submodular function. We thus derive randomized algorithms with poly-logarithmic adaptivity for a number of constraints including partition and laminar matroids, matchings, knapsack constraints, and their intersections. Our algorithm takes a continuous view point and combines several ideas ranging from the continuous greedy algorithm of [38, 13], its adaptation to the MWU framework for packing constraints [20], and parallel algorithms for packing LPs [31, 41]. For the basic setting of cardinality constraints, this viewpoint gives rise to an alternative, simple to understand algorithm that matches recent results [6, 21]. Our algorithm to solve the multilinear relaxation is deterministic if it is given access to a value oracle for the multilinear extension and its gradient;this is possible in some interesting cases such as the co
Optimization of searching the best possible action depending on various states like state of environment, system goal etc. has been a major area of study in computer systems. In any search algorithm, searching best po...
详细信息
ISBN:
(纸本)9781728143927
Optimization of searching the best possible action depending on various states like state of environment, system goal etc. has been a major area of study in computer systems. In any search algorithm, searching best possible solution from the pool of every possibility known can lead to the construction of the whole state search space popularly called as minimax algorithm. This may lead to a impractical time complexities which may not be suitable for real time searching operations. One of the practical solution for the reduction in computational time is Alpha Beta pruning Instead of searching for the whole state space, we prune the unnecessary branches, which helps reduce the time by significant amount. This paper focuses on the various possible implementations of the Alpha Beta pruning algorithms and gives an insight of what algorithm can be used for parallelism. Various studies have been conducted on how to make Alpha Beta pruning faster. parallelizing Alpha Beta pruning for the GPUs specific architectures like mesh(CUDA) etc. or shared memory model(OpenMP) helps in the reduction of the computational time. This paper studies the comparison between sequential and different parallel forms of Alpha Beta pruning and their respective efficiency for the chess game as an application.
Differential Evolution (DE) is a well-known meta-heuristic designed to solve continuous optimization problems. Its simple structure and straight forward search operators make it suitable for solving a wide range of re...
详细信息
ISBN:
(纸本)9781728137988
Differential Evolution (DE) is a well-known meta-heuristic designed to solve continuous optimization problems. Its simple structure and straight forward search operators make it suitable for solving a wide range of real world problems. Despite its success, DE performance may be limited when tackling high dimensional complex problems. Therefore, its algorithmic structure can be reconsidered by adaptively controlling its parameters, and incorporating more resilient search operators. In this study, a Q-learning-based strategy is proposed to adapt DE parameters during the search process. Moreover, an eigenvector-based crossover is introduced in order to accelerate the convergence rate when ill-conditioned landscapes are treated. However, to avoid premature convergence, a simple yet efficient switching technique is proposed to choose between the normal and the eigenvector-based crossover. Due to the high computational time that might occur when applying the eigenvector-based crossover, a parallel counterpart of the algorithm has been implemented using graphics processing units (GPUs). The proposed algorithm has been applied to find the optimal mechanical structure of a recent electric motor. Its performance has been also validated by testing the proposal on CEC 2011 test suite, which contains a set of real world problems. The experimental results reveal the competetive performance of our algorithm compared to recent adaptive DE versions. Besides, the parallel version of the proposal achieved a serious speedup compared with the sequential version while keeping the same results.
We study parallel algorithms for the classical balls-into -bins problem, in which m balls acting in parallel as separate agents are placed into n bins. algorithms operate in synchronous rounds, in each of which balls ...
详细信息
ISBN:
(纸本)9781450361842
We study parallel algorithms for the classical balls-into -bins problem, in which m balls acting in parallel as separate agents are placed into n bins. algorithms operate in synchronous rounds, in each of which balls and bins exchange messages once. The goal is to minimize the maximal load over all bins using a small number of rounds and few messages. While the case of m = n balls has been extensively studied, little is known about the heavily loaded case. In this work, we consider parallel algorithms for this somewhat neglected regime of m >> n. The naive solution of allocating each ball to a bin chosen uniformly and independently at random results in maximal load m/n 0(Vm/n " log n) (for m > n log n) with high probability (w.h.p.). In contrast, for the sequential setting Berenbrink et al. [5] showed that letting each ball join the least loaded bin of two randomly selected bins reduces the maximal load to m/ n 0(log log m) w.h.p. To date, no parallel variant of such a result is known. We present a simple parallel threshold algorithm that obtains a maximal load of m/ n + 0(1) w.h.p. within 0(log log(m/n) + log* n) rounds. The algorithm is symmetric (balls and bins all "look the same"), and balls send 0(1) messages in expectation. The additive term of O(log* n) in the complexity is known to be tight for such algorithms [10]. We also prove that our analysis is tight, i.e., algorithms of the type we provide must run for 11(min{log log (m / n), n}) rounds w.h.p. Finally, we give a simple asymmetric algorithm (i.e., balls are aware of a common labeling of the bins) that achieves a maximal load of m/ n + 0(1) in a constant number of rounds w.h.p. Again, balls send only a single message per round, and bins receive (1 + o(1))m/n 0 (log n) messages w.h.p. This goes to show that, similar to the case of m = n, asymmetry allows for highly efficient solutions.
暂无评论