We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles in R-d, for any d > 0. Our learning model is: samples drawn according to the ...
详细信息
We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles in R-d, for any d > 0. Our learning model is: samples drawn according to the mixture are placed in arbitrary order in a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2l + 2 passes, for any l > 0, and requires memory at most (O) over tilde(epsilon(-2/l) k(2)d(4) + (4k)(d)), where E is the tolerable error of the algorithm. This exhibits a strong memory-pass tradeoff in terms of E: a few more passes significantly lower its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Karman first considered this problem for of d = 1, 2. Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but computation with such large inputs requires very restricted models of computation. (C) 2009 Elsevier B.V. All rights reserved.
We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles in R-d, for any d > 0. Our learning model is: samples drawn according to the ...
详细信息
ISBN:
(纸本)9783540752240
We present a multiple pass streaming algorithm for learning the density function of a mixture of k uniform distributions over rectangles in R-d, for any d > 0. Our learning model is: samples drawn according to the mixture are placed in arbitrary order in a data stream that may only be accessed sequentially by an algorithm with a very limited random access memory space. Our algorithm makes 2l + 2 passes, for any l > 0, and requires memory at most (O) over tilde(epsilon(-2/l) k(2)d(4) + (4k)(d)), where E is the tolerable error of the algorithm. This exhibits a strong memory-pass tradeoff in terms of E: a few more passes significantly lower its memory requirements, thus trading one of the two most important resources in streaming computation for the other. Chang and Karman first considered this problem for of d = 1, 2. Our learning algorithm is especially appropriate for situations where massive data sets of samples are available, but computation with such large inputs requires very restricted models of computation. (C) 2009 Elsevier B.V. All rights reserved.
A basic assumption of Health Level Seven (HL7) protocol is 'No limitation of message length'. However, most existing commercial HL7 interface engines do limit message length because they use the string array m...
详细信息
A basic assumption of Health Level Seven (HL7) protocol is 'No limitation of message length'. However, most existing commercial HL7 interface engines do limit message length because they use the string array method, which is run in the main memory for the HL7 message parsing process. Specifically, messages with image and multi-media data create a tong string array and thus cause the computer system to raise critical and fatal problem. Consequently, HL7 messages cannot handle the image and multi-media data necessary in modern medical records. This study aims to solve this problem with the 'streaming algorithm' method. This new method for HL7 message parsing applies the character-stream object which process character by character between the main memory and. hard disk device with the consequence that the processing load on main memory could be alleviated. The main functions of this new engine are generating, parsing, validating, browsing, sending, and receiving HL7 messages. Also, the engine can parse and generate XML-formatted HL7 messages. This new HL7 engine successfully exchanged HL7 messages with 10 megabyte size images and discharge summary information between two university hospitals. Published by Elsevier Ireland Ltd.
Detecting spreaders can help an intrusion detection system identify potential attackers. The existing work can only detect aggressive spreaders that scan a large number of distinct addresses in a short period of time....
详细信息
ISBN:
(纸本)9781424423248
Detecting spreaders can help an intrusion detection system identify potential attackers. The existing work can only detect aggressive spreaders that scan a large number of distinct addresses in a short period of time. However, stealthy spreaders may perform scanning deliberately at a low rate. We observe that these spreaders can easily evade the detection because their small traffic footprint will be covered by the large amount of background normal traffic that frequently flushes any spreader information out of the intrusion detection system's memory. We propose a new streaming scheme to detect stealthy spreaders that are invisible to the current systems. The new scheme stores information about normal traffic within a limited portion of the allocated memory, so that it will not interfere with spreaders' information stored elsewhere in the memory. The proposed scheme is light weight;it can detect invisible spreaders in highspeed networks while residing in SRAM. Through experiments using real Internet traffic traces, we demonstrate that our new scheme detects invisible spreaders efficiently while keeping both false-positives (normal sources misclassified as spreaders) and false-negatives (spreaders misclassified as normal sources) to low level.
Information on network host, connectivity patterns are important for network monitoring and traffic engineering. In this paper, all efficient streaming algorithm is proposed to estimate cardinality distributions inclu...
详细信息
ISBN:
(纸本)9781605580050
Information on network host, connectivity patterns are important for network monitoring and traffic engineering. In this paper, all efficient streaming algorithm is proposed to estimate cardinality distributions including connectivity distributions, e.g. percent of hosts with any given number of distinct communicating peers or flows.
Information on network host connectivity patterns are important for network monitoring and traffic engineering. In this paper, an efficient streaming algorithm is proposed to estimate cardinality distributions includi...
详细信息
ISBN:
(纸本)9781605580050
Information on network host connectivity patterns are important for network monitoring and traffic engineering. In this paper, an efficient streaming algorithm is proposed to estimate cardinality distributions including connectivity distributions, e.g. percent of hosts with any given number of distinct communicating peers or flows.
Over the past 15 years, our ability to collect massive data sets has increased dramatically. Concomitantly, our need to process, compress, store, analyze, and summarize these data sets has grown as well. Scientific, e...
详细信息
Over the past 15 years, our ability to collect massive data sets has increased dramatically. Concomitantly, our need to process, compress, store, analyze, and summarize these data sets has grown as well. Scientific, engineering, medical, and industrial applications require that we carry out these tasks efficiently and reasonably accurately. Data streams are one type of modern massive data sets, characterized by their size and by their distributed and dynamic properties. We give an expository discussion of data stream models and the algorithmic challenges that these models pose for computational statistical analysis, then present an overview of three streaming algorithms and a discussion of the computational challenges with each.
We present a I-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the ...
详细信息
We present a I-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature. (C) 2003 Elsevier B.V. All rights reserved.
We present a I-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the ...
详细信息
ISBN:
(纸本)3540438645
We present a I-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature. (C) 2003 Elsevier B.V. All rights reserved.
We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomiz...
详细信息
ISBN:
(纸本)9781581136746
We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the k--Median problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2O(1/ε) approximation using O(nε) space. Next we give a streaming algorithm for the k--Median problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.
暂无评论