We study a generalization of the famous κ-center problem where each object is an affine subspace of dimension Δ, and give either the first or significantly improved algorithms and hardness results for many combinati...
详细信息
ISBN:
(纸本)9781627484855
We study a generalization of the famous κ-center problem where each object is an affine subspace of dimension Δ, and give either the first or significantly improved algorithms and hardness results for many combinations of parameters. This generalization from points (Δ = 0) is motivated by the analysis of incomplete data, a pervasive challenge in statistics: incomplete data objects in R~d can be modeled as affine subspaces. We give three algorithmic results for different values of k, under the assumption that all subspaces are axis-parallel, the main case of interest because of the correspondence to missing entries in data tables. 1) k = 1: Two polynomial time approximation schemes which runs in poly(A, 1/?)nd. 2) k = 2: O(Δ~(1/4))-approximation algorithm which runs in poly(n, d, Δ) 3) General k: Polynomial time approximation scheme which runs in 2~(O (Δk log k (1+1/?~2)))nd We also prove nearly matching hardness results; in both the general (not necessarily axis-parallel) case (for k ≥ 2) and in the axis-parallel case (for k ≥ 3), the running time of an approximation algorithm with any approximation ratio cannot be polynomial in even one of k and Δ, unless P = NP. Furthermore, assuming that the 3-SAT problem cannot be solved subexponentially, the dependence on both k and Δ must be exponential in the general case (in the axis-parallel case, only the dependence on k drops to 2~(Ω(k))). The simplicity of the first and the third algorithm suggests that they might be actually used in statistical applications. The second algorithm, which demonstrates a theoretical gap between the axis-parallel and general case for k = 2, displays a strong connection between geometric clustering and classical coloring problems on graphs and hypergraphs, via a new Helly-type theorem.
Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]...
详细信息
Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a V
We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups over the only published optimal (linear-time) serial implementation of a ...
详细信息
ISBN:
(纸本)9781450312134
We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. This is accomplished on the PRAM-inspired XMT many-core architecture. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Copyright is held by the author/owner(s).
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache ...
详细信息
ISBN:
(纸本)9781450312134
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries in the optimal O(sortP (N) + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and sortP (N) is the parallel I/O complexity of sorting N elements using P processors. Copyright 2012 acm.
This paper presents the design, analysis, and implementation of parallel and sequential I/O-efficient algorithms for set cover, tying together the line of work on parallel set cover and the line of work on efficient s...
详细信息
ISBN:
(纸本)9781450312134
This paper presents the design, analysis, and implementation of parallel and sequential I/O-efficient algorithms for set cover, tying together the line of work on parallel set cover and the line of work on efficient set cover algorithms for large, disk-resident instances. Our contributions are twofold: First, we design and analyze a parallel cache-oblivious set-cover algorithm that offers essentially the same approximation guarantees as the standard greedy algorithm, which has the optimal approximation. Our algorithm is the first efficient external-memory or cache-oblivious algorithm for when neither the sets nor the elements fit in memory, leading to I/O cost (cache complexity) equivalent to sorting in the Cache Oblivious or parallel Cache Oblivious models. The algorithm also implies low cache misses on parallel hierarchical memories (again, equivalent to sorting). Second, building on this theory, we engineer variants of the theoretical algorithm optimized for different hardware setups. We provide experimental evaluation showing substantial speedups over existing algorithms without compromising the solution's quality. Copyright 2012 acm.
This paper presents parallelalgorithms for embedding an arbitrary n-point metric space into a distribution of dominating trees with O(log n) expected stretch. Such embedding has proved useful in the design of many ap...
详细信息
ISBN:
(纸本)9781450312134
This paper presents parallelalgorithms for embedding an arbitrary n-point metric space into a distribution of dominating trees with O(log n) expected stretch. Such embedding has proved useful in the design of many approximation algorithms in the sequential setting. We give a parallel algorithm that runs in O(n2 log n) work and O(log2 n) depth - these bounds are independent of Δ = maxx,y d(x,y)/minx≠y d(x;y), the ratio of the largest to smallest distance. Moreover, when Δ is exponentially bounded (Δ ≤/2O(n)), our algorithm can be improved to O(n2) work and O(log2 n) depth. Using these results, we give an RNC O(log κ)-approximation algorithm for κ-median and an RNC O(log n)-approximation for buy-at-bulk network design. The κ-median algorithm is the first RNC algorithm with non-trivial guarantees for arbitrary values of κ, and the buy-at-bulk result is the first parallel algorithm for the problem. Copyright 2012 acm.
In this paper we study a scheduling problem with moldable and non-moldable parallel tasks on m processors. A non-moldable parallel task is one that runs in parallel on a specific given number of processors. The goal i...
详细信息
ISBN:
(纸本)9781450312134
In this paper we study a scheduling problem with moldable and non-moldable parallel tasks on m processors. A non-moldable parallel task is one that runs in parallel on a specific given number of processors. The goal is to find a non-preemptive schedule on the m processors which minimizes the makespan, or the latest task completion time. The previous best result is the list scheduling algorithm with an absolute approximation ratio of 2. On the other hand, there does not exist an approximation algorithm for scheduling non-moldable parallel tasks with ratio smaller than 1.5, unless P = NP. In this paper we show that a schedule with length (1.5+Ε)OPT can be computed for the scheduling problem in time O(n log n) + f(1/Ε). Furthermore we present an (1.5+Ε) approximation algorithm for scheduling moldable parallel tasks. Copyright 2012 acm.
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallelalgorithms for matrix multiplication with perfect st...
详细信息
ISBN:
(纸本)9781450312134
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallelalgorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallelalgorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. Copyright is held by the author/owner(s).
Lock-free data structures provide a progress guarantee and are known for facilitating scalability, avoiding deadlocks and livelocks, and providing guaranteed system responsiveness. In this paper we present a design fo...
详细信息
ISBN:
(纸本)9781450312134
Lock-free data structures provide a progress guarantee and are known for facilitating scalability, avoiding deadlocks and livelocks, and providing guaranteed system responsiveness. In this paper we present a design for a lock-free balanced tree, specifically, a B+tree. The B +tree data structure has an important practical applications, and is used in various storage-system products. As far as we know this is the first design of a lock-free, dynamic, and balanced tree, that employs standard compare-and-swap operations. Copyright 2012 acm.
暂无评论