A fundamental quest in the theory of computing is to understand the power of randomness. It is not known whether every problem with an efficient randomized algorithm also has one that does not use randomness. One of t...
详细信息
A fundamental quest in the theory of computing is to understand the power of randomness. It is not known whether every problem with an efficient randomized algorithm also has one that does not use randomness. One of the extensively studied problems under this theme is that of perfect matching. The perfect matching problem has a randomized parallel (NC) algorithm based on the Isolation Lemma of Mulmuley, Vazirani, and Vazirani. It is a long-standing open question whether this algorithm can be derandomized. In this article, we give an almost complete derandomization of the Isolation Lemma for perfect matchings in bipartite graphs. This gives us a deterministic parallel (quasi-NC) algorithm for the bipartite perfect matching problem. Derandomization of the Isolation Lemma means that we deterministically construct a weight assignment so that the minimum weight perfect matching is unique. We present three different ways of doing this construction with a common main idea.
Several synchronous applications are based on the graph-structured data;among them, a very important application of this kind is community detection. Since the number and size of the networks modeled by graphs grow la...
详细信息
Several synchronous applications are based on the graph-structured data;among them, a very important application of this kind is community detection. Since the number and size of the networks modeled by graphs grow larger and larger, some level of parallelism needs to be used, to reduce the computational costs of such massive applications. Social networking sites allow users to manually categorize their friends into social circles (referred to as lists on Facebook and Twitter), while users, based on their interests, place themselves into groups of interest. However, the community detection and is a very effortful procedure, and in addition, these communities need to be updated very often, resulting in more effort. In this paper, we combine parallel processing techniques with a typical data structure like threaded binary trees to detect communities in an efficient manner. Our strategy is implemented over weighted networks with irregular topologies and it is based on a stepwise path detection strategy, where each step finds a link that increases the overall strength of the path being detected. To verify the functionality and parallelism benefits of our scheme, we perform experiments on five real-world data sets: Facebook (R), Twitter (R), Google+(R), Pokec, and LiveJournal.
In order to meet the latency requirements of the ultra-reliable low latency communication (URLLC) mode of the third-generation partnership project's long term evolution (LTE) mobile communication standard, this pa...
详细信息
In order to meet the latency requirements of the ultra-reliable low latency communication (URLLC) mode of the third-generation partnership project's long term evolution (LTE) mobile communication standard, this paper proposes a novel turbo decoding algorithm that supports an arbitrarily high degree of parallel processing, facilitating significantly higher processing throughputs and substantially lower processing latencies than the state-of-the-art (SOTA) LTE turbo decoder. As in conventional turbo decoding algorithms, the proposed Arbitrarily parallel Turbo Decoder (APTD) decomposes each frame of information bits into a sequence of windows, where the bits within different windows are processed simultaneously using forward and backward recursions in a serial manner. However, in contrast to conventional turbo decoding algorithms, the APTD does not require different windows to be composed of an identical number of bits, which allows the use of an arbitrary number of windows and hence an arbitrary degree of parallelism, when decoding information bits of an arbitrary frame length. Furthermore, conventional turbo decoding algorithms alternate between simultaneously processing the windows in the upper decoder and those in the lower decoder. By contrast, the APTD processes the odd-indexed windows in the upper decoder at the same time as the even-indexed windows in the lower decoder and alternates between this and the reversed arrangement, hence further improving the decoding throughput and latency. Furthermore, the APTD achieves a reduced hardware resource requirement by calculating the extrinsic information based only on the outputs of the forward recursions, rather than based on both the forward and backward recursions of conventional turbo decoding algorithms. We demonstrate that the proposed APTD achieves superior latency, throughput, and computational efficiency than the SOTA LTE turbo decoder at all frame lengths, but particularly at the short frame lengths that are t
PIPS-SBB is a distributed-memory parallel solver with a scalable data distribution paradigm. It is designed to solve mixed integer programs (MIPs) with a dual-block angular structure, which is characteristic of determ...
详细信息
PIPS-SBB is a distributed-memory parallel solver with a scalable data distribution paradigm. It is designed to solve mixed integer programs (MIPs) with a dual-block angular structure, which is characteristic of deterministic-equivalent stochastic mixed-integer programs. In this paper, we present two different parallelizations of Branch & Bound (B&B), implementing both as extensions of PIPS-SBB, thus adding an additional layer of parallelism. In the first of the proposed frameworks, PIPS-PSBB, the coordination and load-balancing of the different optimization workers is done in a decentralized fashion. This new framework is designed to ensure all available cores are processing the most promising parts of the B&B tree. The second, ug[PIPS-SBB,MPI], is a parallel implementation using the Ubiquity Generator, a universal framework for parallelizing B&B tree search that has been sucessfully applied to other MIP solvers. We show the effects of leveraging multiple levels of parallelism in potentially improving scaling performance beyond thousands of cores.
Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filter...
详细信息
Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filter methods has often been shown to be superior to other state estimation techniques, such as the extended Kalman filter (EKF), for many applications. Unfortunately, the high computational cost and highly nondeterministic runtime behavior of particle filters often preclude their use in hard, real-time environments, where filter response must meet the strict timing requirements of the application. Particle filter algorithms are composed of three main stages: prediction, update, and resampling. General purpose graphics processing units (GPGPUs) have been successfully employed in previous research to accelerate the computation of both the prediction and update stages by exploiting their natural fine-grain parallelism. This research focuses on accelerating the resampling stage for GPGPU execution, which has been much more difficult to parallelize due to it's apparent inherent sequentially. This paper introduces a novel GPGPU implementation of the systematic and stratified resampling algorithms that exploit the monotonically increasing nature of the prefix-sum and the evolutionary nature of the particle weighting process to allow the re-indexing portion of the algorithms to occur in a two-phase, multi-threaded manner. This resulting measured factor of performance improvement for the systematic and stratified algorithms was 15x and 32x, respectively, over the serial implementations.
In this paper we describe a general framework for parallel optimization based on the island model of evolutionary algorithms. The framework runs a number of optimization methods in parallel with periodic communication...
详细信息
In this paper we describe a general framework for parallel optimization based on the island model of evolutionary algorithms. The framework runs a number of optimization methods in parallel with periodic communication. In this way, it essentially creates a parallel ensemble of optimization methods. At the same time, the system contains a planner that decides which of the available optimization methods should be used to solve the given optimization problem and changes the distribution of such methods during the run of the optimization. Thus, the system effectively solves the problem of online parallel portfolio selection. The proposed system is evaluated in a number of common benchmarks with various problem encodings as well as in two real-life problems - the optimization in recommender systems and the training of neural networks for the control of electric vehicle charging.
The main contribution of this paper is to show optimal parallel algorithms to compute the sum, the prefix-sums, and the summed area table on two memory machine models, the Discrete Memory Machine (DMM) and the Unified...
详细信息
One of the important problems in parallel computing is the mapping of the parallel algorithm to the parallel computing platform. Hereby, for each parallel node the corresponding code for the parallel nodes must be imp...
详细信息
In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technolo...
详细信息
ISBN:
(纸本)9781450357999
In this paper, we design parallel write-efficient geometric algorithms that perform asymptotically fewer writes than standard algorithms for the same problem. This is motivated by emerging non-volatile memory technologies with read performance being close to that of random access memory but writes being significantly more expensive in terms of energy and latency. We design algorithms for planar Delaunay triangulation, k-d trees, and static and dynamic augmented trees. Our algorithms are designed in the recently introduced Asymmetric Nested-parallel Model, which captures the parallel setting in which there is a small symmetric memory where reads and writes are unit cost as well as a large asymmetric memory where writes are omega times more expensive than reads. In designing these algorithms, we introduce several techniques for obtaining write-efficiency, including DAG tracing, prefix doubling, and alpha-labeling, which we believe will be useful for designing other parallel write-efficient algorithms.
In this paper we present an approach for real-time simulation and Hardware-in-the-Loop (HIL) testing of Modular Multilevel Converters (MMCs) that rely on switching models while supporting system level analysis. Using ...
详细信息
In this paper we present an approach for real-time simulation and Hardware-in-the-Loop (HIL) testing of Modular Multilevel Converters (MMCs) that rely on switching models while supporting system level analysis. Using the Latency Based Linear Multistep Compound (LB-LMC) approach, we achieved a 50 ns simulation time step for systems composed of several MMC converters and for converters of various complexity. To facilitate system level testing, we introduce the use of a serial communication-based (Aurora) interface for HIL testing of MMC converters and we analyzed the effect that communication latency has on the accuracy of the HIL test. The simulation and HIL results are validated against an MMC laboratory prototype.
暂无评论