Count-Min Sketch (CMS) and HeavyKeeper (HK) are two realizations of a compact frequency estimator (CFE). These are a class of probabilistic data structures that maintain a compact summary of (typically) high-volume st...
详细信息
ISBN:
(纸本)9798400700507
Count-Min Sketch (CMS) and HeavyKeeper (HK) are two realizations of a compact frequency estimator (CFE). These are a class of probabilistic data structures that maintain a compact summary of (typically) high-volume streaming data, and provides approximately correct estimates of the number of times any particular element has appeared. CFEs are often the base structure in systems looking for the highest-frequency elements (i.e., top-K elements, heavy hitters, elephant flows). Traditionally, probabilistic guarantees on the accuracy of frequency estimates are proved under the implicit assumption that stream elements do not depend upon the internal randomness of the structure. Said another way, they are proved in the presence of data streams that are created by non-adaptive adversaries. Yet in many practical use-cases, this assumption is not well-matched with reality;especially, in applications where malicious actors are incentivized to manipulate the data stream. We show that the CMS and HK structures can be forced to make significant estimation errors, by concrete attacks that exploit adaptivity. We analyze these attacks analytically and experimentally, with tight agreement between the two. Sadly, these negative results seem unavoidable for (at least) sketch-based CFEs with parameters that are reasonable in practice. On the positive side, we give a new CFE (Count-Keeper) that can be seen as a composition of the CMS and HK structures. Count-Keeper estimates are typically more accurate (by at least a factor of two) than CMS for "honest" streams;our attacks against CMS and HK are less effective (and more resource intensive) when used against Count-Keeper;and Count-Keeper has a native ability to flag estimates that are suspicious, which neither CMS or HK (or any other CFE, to our knowledge) admits.
Next generation sequencing technology has been extensively used in biological and medical research. With abundant genomic data generated, new challenges also arise: the expanding capacity of genomic data pushes the bo...
详细信息
Next generation sequencing technology has been extensively used in biological and medical research. With abundant genomic data generated, new challenges also arise: the expanding capacity of genomic data pushes the boundaries of current searching and analysis methods. Two of the main challenges for genomic big data are how to efficiently search to identify datasets of interest, and how to deeply analyze the large volume of data to discover new knowledge in *** this dissertation, we present four research achievements that aim to tackle the above two challenges in genomic data. The AllSome Sequence Bloom Tree data structure and associated search algorithms are first introduced to help find datasets of interest, filter out futile ones, and narrow down the data size. To meet the demand of further deep analysis, several scalable algorithms for sequence analysis are introduced. Based on them, a genetic variant analysis toolkit is developed, which contains three methods (ISVDA, VarGeno and VarMatch), which address different directions of small genetic variant study. ISVDA is an iterative small variant discovery algorithm that can detect small genetic variants that are previously hard to detect. VarGeno is a fast and accurate single nucleotide polymorphism genotyping tool. VarMatch is introduced to find high confidence variants among multiple variant detection results. It can also be used to evaluate variant calling results.
We introduce Graphene, a method and protocol for interactive set reconciliation among peers in blockchains and related distributed systems. Through the novel combination of a Bloom filter and an Invertible Bloom Looku...
详细信息
ISBN:
(纸本)9781450359566
We introduce Graphene, a method and protocol for interactive set reconciliation among peers in blockchains and related distributed systems. Through the novel combination of a Bloom filter and an Invertible Bloom Lookup Table (IBLT), Graphene uses a fraction of the network bandwidth used by deployed work for one-and two-way synchronization. We show that, for this specific problem, Graphene is more efficient at reconciling n items than using a Bloom filter at the information theoretic bound. We contribute a fast and implementation-independent algorithm for parameterizing an IBLT so that it is optimally small in size and meets a desired decode rate with arbitrarily high probability. We characterize our performance improvements through analysis, detailed simulation, and deployment results for Bitcoin Cash, a prominent cryptocurrency. Our implementations of Graphene, IBLTs, and our IBLT optimization algorithm are all open-source code.
暂无评论