We introduce a new notion of information complexity for multi-pass streaming problems and use it to resolve several important questions in data streams: (1) In the coin problem, one sees a stream of = i.i.d. uniformly...
详细信息
ISBN:
(纸本)9798400703836
We introduce a new notion of information complexity for multi-pass streaming problems and use it to resolve several important questions in data streams: (1) In the coin problem, one sees a stream of = i.i.d. uniformly random bits and one would like to compute the majority with constant advantage. We show that any constant-pass algorithm must use Omega(log n) bits of memory, significantly extending an earlier Omega(log n) bit lower bound for single-pass algorithms of Braverman-Garg-Woodruff (FOCS, 2020). This also gives the first Omega(log n) bit lower bound for the problem of approximating a counter up to a constant factor in worst-case turnstile streams for more than one pass. (2) In the needle problem, one either sees a stream of = i.i.d. uniform samples from a domain [C], or there is a randomly chosen "needle" alpha is an element of[l] for which each item independently is chosen to equal alpha with probability p, and is otherwise uniformly random in [C]. The problem of distinguishing these two cases is central to understanding the space complexity of the frequency moment estimation problem in random order streams. We show tight multi-pass space bounds for this problem for every p < 1/root n log(3) n, resolving an open question of Lovett and Zhang (FOCS, 2023);even for 1-pass our bounds are new. To show optimality, we improve both lower and upper bounds from existing results. Our information complexity framework significantly extends the toolkit for proving multi-pass streaming lower bounds, and we give a wide number of additional streaming applications of our lower bound techniques, including multi-pass lower bounds for l(p)-norm estimation, l(p)-point query and heavy hitters, and compressed sensing problems.
In this paper, we explore a novel online learning setting, where the online learners are presented with "doubly-streaming" data. Namely, the data instances constantly streaming in are described by feature sp...
详细信息
In this paper, we explore a novel online learning setting, where the online learners are presented with "doubly-streaming" data. Namely, the data instances constantly streaming in are described by feature spaces that over-time evolve, with new features emerging and old features fading away. The main challenge of this problem lies in the fact that the newly emerging features are described by very few samples, resulting in weak learners that tend to make error predictions. A seemingly plausible idea to overcome the challenge is to establish a relationship between the old and new feature spaces, so that an online learner can leverage the knowledge learned from the old features to better the learning performance on the new features. Unfortunately, this idea does not scale up to high-dimensional feature spaces that entail very complex feature interplay. Specifically. a tradeoff between onlineness, which biases shallow learners, and expressiveness, which requires deep models, is inevitable. Motivated by this, we propose a novel paradigm, named Online Learning Deep models from Data of Double Streams ((OLDS)-S-3), where a shared latent subspace is discovered to summarize information from the old and new feature spaces, building an intermediate feature mapping relationship. A key trait of (OLDS)-S-3 is to treat the model capacityas a learnable semantics, aiming to yield optimal model depth and parameters jointly in accordance with the complexity and non-linearity of the input data streams in an online fashion. To ablate its efficacy and applicability, two variants of (OLDS)-S-3 are proposed namely, OLD-Linear that learns the relationship by a linear function;and OLD-FD learns that two consecutive feature spaces pre-and-post evolution with fixed deep depth. Besides, instead of re-starting the entire learning process from scratch, (OLDS)-S-3 learns multiple newly emerging feature spaces in a lifelong manner, retaining the knowledge from the learned and vanished feature space t
Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large,...
详细信息
Finding the connected components of a graph is a fundamental problem with uses throughout computer science and engineering. The task of computing connected components becomes more difficult when graphs are very large, or when they are dynamic, meaning the edge set changes over time subject to a stream of edge insertions and deletions. A natural approach to computing the connected components problem on a large, dynamic graph stream is to buy enough RAM to store the entire graph. However, the requirement that the graph fit in RAM is an inherent limitation of this approach and is prohibitive for very large graphs. Thus, there is an unmet need for systems that can process dense dynamic graphs, especially when those graphs are larger than available RAM. We present a new high-performance streaming graph-processing system for computing the connected components of a graph. This system, which we call GRAPHZEPPELIN, uses new linear sketching data structures (CUBESKETCH) to solve the streaming connected components problem and as a result requires space asymptotically smaller than the space required for a lossless representation of the graph. GRAPHZEPPELIN is optimized for massive dense graphs: GRAPHZEPPELIN can process millions of edge updates (both insertions and deletions) per second, even when the underlying graph is far too large to fit in available RAM. As a result GRAPHZEPPELIN vastly increases the scale of graphs that can be processed.
A constraint satisfaction problem (CSP), Max-CSP(F), is specified by a finite set of constraints F subset of {[q](k) -> {0, 1}} for positive integers q and k. An instance of the problem on n variables is given bym ...
详细信息
A constraint satisfaction problem (CSP), Max-CSP(F), is specified by a finite set of constraints F subset of {[q](k) -> {0, 1}} for positive integers q and k. An instance of the problem on n variables is given bym applications of constraints from F to subsequences of the n variables, and the goal is to find an assignment to the variables that satisfies the maximum number of constraints. In the (gamma , beta)-approximation version of the problem for parameters 0 <= beta < gamma <= 1, the goal is to distinguish instances where at least. fraction of the constraints can be satisfied from instances where at most beta fraction of the constraints can be satisfied. In this work, we consider the approximability of this problem in the context of sketching algorithms and give a dichotomy result. Specifically, for every family F and every beta < gamma, we show that either a linear sketching algorithm solves the problem in polylogarithmic space or the problem is not solvable by any sketching algorithm in o(root n) space. In particular, we give non-trivial approximation algorithms using polylogarithmic space for infinitely many constraint satisfaction problems. We also extend previously known lower bounds for general streaming algorithms to a wide variety of problems, and in particular the case of q = k = 2, where we get a dichotomy, and the case when the satisfying assignments of the constraints of F support a distribution on [q](k) with uniform marginals. Prior to this work, other than sporadic examples, the only systematic classes of CSPs that were analyzed considered the setting of Boolean variables q = 2, binary constraints k = 2, and singleton families vertical bar|F vertical bar| = 1 and only considered the setting where constraints are placed on literals rather than variables. Our positive results show wide applicability of bias-based algorithms used previously by [47] and [41], which we extend to include richer norm estimation algorithms, by giving a systematic way to di
Frequent Directions, as a deterministic matrix sketching technique, has been proposed for tackling low-rank approximation problems. This method has a high degree of accuracy and practicality, but experiences a lot of ...
详细信息
Frequent Directions, as a deterministic matrix sketching technique, has been proposed for tackling low-rank approximation problems. This method has a high degree of accuracy and practicality, but experiences a lot of computational cost for large-scale data. Several recent works on the randomized version of Frequent Directions greatly improve the computational efficiency, but unfortunately sacrifice some precision. To remedy such issue, this paper aims to find a more accurate projection subspace to further improve the efficiency and effectiveness of the existing Frequent Directions techniques. Specifically, by utilizing the power of Block Krylov Iteration and random projection technique, this paper presents a fast and accurate Frequent Directions algorithm named as r-BKIFD. The rigorous theoretical analysis shows that the proposed r-BKIFD has a comparable error bound with original Frequent Directions, and the approximation error can be arbitrarily small when the number of iterations is chosen appropriately. Extensive experimental results on both synthetic and real data further demonstrate the superiority of r-BKIFD over several popular Frequent Directions algorithms, both in terms of computational efficiency and accuracy.
The number of triangles (hereafter denoted by Delta) is an important metric to analyze massive graphs. It is also used to compute clustering coefficient in networks. This paper proposes a new algorithm called PES (Pri...
详细信息
The number of triangles (hereafter denoted by Delta) is an important metric to analyze massive graphs. It is also used to compute clustering coefficient in networks. This paper proposes a new algorithm called PES (Priority Edge Sampling) to estimate the number of triangles in the streaming model where we need to minimize the memory window. PES combines edge sampling and reservoir sampling. Compared with the state-of-the-art streaming algorithms, PES outperforms consistently. The results are verified extensively in 48 large real-world networks in different domains and structures. The performance ratio can be as large as 11. More importantly, the ratio grows with data size almost exponentially. This is especially important in the era of big data-while we can tolerate existing algorithms for smaller datasets, our method is indispensable when sampling very large data. In addition to empirical comparisons, we also proved that the estimator is unbiased, and derived the variance.
We study the Set Cover problem in the one-pass edge-arrival streaming model. In this model, the input stream consists of a sequence of tuples (S, u), indicating that element u is contained in set S. This setting captu...
详细信息
ISBN:
(纸本)9798400701276
We study the Set Cover problem in the one-pass edge-arrival streaming model. In this model, the input stream consists of a sequence of tuples (S, u), indicating that element u is contained in set S. This setting captures the streaming Dominating Set problem and is more general and harder to solve than the Set Cover set-arrival setting, where entire sets with all their elements arrive in the stream one-by-one. We prove the following results (n is the size of the universe, m is the number of sets): (1) A work by [Khanna, Konrad, ITCS'22] on streaming Dominating Set implies a one-pass (O) over tilde (root n)-approximation algorithm with space (O) over tilde (m) for edge-arrival Set Cover in adversarially ordered streams. We show that this space bound is best possible up to poly-log factors in that every alpha-approximation algorithm, for alpha = Omega(root n), requires space (Omega) over tilde (mn(2)/alpha(4)) in adversarially ordered streams, even if the algorithm is only required to output an alpha-approximation of the size of an optimal cover. (2) As our main result, we give a one-pass (O) over tilde (root n)-approximation algorithm with space (O) over tilde (m/root n) for edge-arrival Set Cover in random order streams. This result together with the lower bound mentioned above establishes a strong separation between the adversarial and random order settings. (3) Finally, in adversarial order streams, we showthat non-trivial algorithms with space o(m) can be achieved at the expense of increased approximation factors (O) over tilde (root n), which is in contrast to the set-arrival setting, where space (O) over tilde (n) is enough for a Theta(root n)-approximation, and space Omega(n) is needed for an o(n/log n)-approximation. We give an alpha-approximation algorithm for one-pass edge-arrival Set Cover with space (O) over tilde (mn/alpha(2)), for every alpha = (Omega) over tilde(root n).
We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a s...
详细信息
ISBN:
(纸本)9798350318944
We revisit Nisan's classical pseudorandom generator (PRG) for space-bounded computation (STOC 1990) and its applications in streaming algorithms. We describe a new generator, HashPRG, that can be thought of as a symmetric version of Nisan's generator over larger alphabets. Our generator allows a trade-off between seed length and the time needed to compute a given block of the generator's output. HashPRG can be used to obtain derandomizations with much better update time and without sacrificing space for a large number of data stream algorithms, for example: Andoni's F-p estimation algorithm for constant p > 2 (ICASSP, 2017) assumes a random oracle, but achieves optimal space and constant update time. Using HashPRG's time-space trade-off we eliminate the random oracle assumption while preserving the other properties. Previously no time-optimal derandomization was known. Using similar techniques, we give an algorithm for a relaxed version of l(p) sampling in a turnstile stream. Both of our algorithms use (O) over tilde (d(1-2/p)) bits of space and have O(1) update time. For 0 < p < 2, the 1 +/- epsilon approximate F-p estimation algorithm of Kane et al., (STOC, 2011) uses an optimal O(epsilon(-2) log d) bits of space but has an update time of O(log(2)(1/epsilon) log log(1/epsilon)). Using HashPRG, we show that if 1/root d <= epsilon <= 1/d(c) for an arbitrarily small constant c > 0, then we can obtain a 1 +/- epsilon approximate F-p estimation algorithm that uses the optimal O(epsilon(-2) log d) bits of space and has an update time of O(log d) in the Word RAM model, which is more than a quadratic improvement in the update time. We obtain similar improvements for entropy estimation. CountSketch, with the fine-grained error analysis of Minton and Price (SODA, 2014). For derandomization, they suggested a direct application of Nisan's generator, yielding a logarithmic multiplicative space overhead. With HashPRG we obtain an efficient derandomization yielding the same asym
In the k-mismatch problem we are given a pattern of length n and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resul...
详细信息
In the k-mismatch problem we are given a pattern of length n and a text and must find all locations where the Hamming distance between the pattern and the text is at most k. A series of recent breakthroughs have resulted in an ultra-efficient streaming algorithm for this problem that requires only O(k log n/k) space and O(log n/k (root k log k + log(3) n)) time per letter (Clifford, Kociumaka, Porat, SODA 2019). In this work, we consider a strictly harder problem called dictionary matching with k mismatches. In this problem, we are given a dictionary of d patterns, where the length of each pattern is at most n, and must find all substrings of the text that are within Hamming distance k from one of the patterns. We develop a streaming algorithm for this problem with O(kd log(k) d polylog n) space and O(k log(k) d polylog n +vertical bar output vertical bar) time per position of the text. The algorithm is randomised and outputs correct answers with high probability. On the lower bound side, we show that any streaming algorithm for dictionary matching with k mismatches requires Omega(kd) bits of space.
We present a new approach for finding matchings in dense graphs by building on Szemeredi's celebrated Regularity Lemma. This allows us to obtain non-trivial albeit slight improvements over longstanding bounds for ...
详细信息
ISBN:
(纸本)9781450399135
We present a new approach for finding matchings in dense graphs by building on Szemeredi's celebrated Regularity Lemma. This allows us to obtain non-trivial albeit slight improvements over longstanding bounds for matchings in streaming and dynamic graphs. In particular, we establish the following results for n-vertex graphs: (i) A deterministic single-pass streaming algorithm that finds a (1 - o(1))-approximate matching in o(n2) bits of space. This constitutes the first single-pass algorithm for this problem in sublinear space that improves over the 1/2-approximation of the greedy algorithm. (ii) A randomized fully dynamic algorithm that with high probability maintains a (1 - o(1))-approximate matching in o(n) worst-case update time per each edge insertion or deletion. The algorithm works even against an adaptive adversary. This is the first o(n) update-time dynamic algorithm with approximation guarantee arbitrarily close to one. Given the use of regularity lemma, the improvement obtained by our algorithms over trivial bounds is only by some (log* n)T(1) factor. Nevertheless, in each case, they show that the "right" answer to the problem is not what is dictated by the previous bounds. Finally, in the streaming model, we also present a randomized (1o(1))-approximation algorithm whose space can be upper bounded by the density of certain Ruzsa-Szemeredi (RS) graphs. While RS graphs by now have been used extensively to prove streaming lower bounds, ours is the first to use them as an upper bound tool for desigining improved streaming algorithms.
暂无评论