Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in battery-powered sensors. When approximation is accepta...
详细信息
Wireless sensor networks generate a vast amount of data. This data, however, must be sparingly extracted to conserve energy, usually the most precious resource in battery-powered sensors. When approximation is acceptable, a model-driven approach to query processing is effective in saving energy by avoiding contacting nodes whose values can be predicted or are unlikely to be in the result set. To optimize queries such as top-k, however, reasoning directly with models of joint probability distributions can be prohibitively expensive. Instead of using models explicitly, we propose to use samples of past sensor readings. Not only are such samples simple to maintain, but they are also computationally efficient to use in query optimization. With these samples, we can formulate the problem of optimizing approximate top-k queries under an energy constraint as a linear program. We demonstrate the power and flexibility of our sampling-based approach by developing a series of topk query planning algorithms with linear programming, which are capable of efficiently producing plans with better performance and novel features. We show that our approach is both theoretically sound and practically effective on simulated and real-world datasets.
The adoption of XML to represent any kind of data and documents, even complex and huge, is becoming a matter of fact. However, interfacing algorithms and applications with XML Parsers requires to adapt algorithms and ...
详细信息
The adoption of XML to represent any kind of data and documents, even complex and huge, is becoming a matter of fact. However, interfacing algorithms and applications with XML Parsers requires to adapt algorithms and applications: event-based SAX Parsers need algorithms that react to events generated by the parser. But parsing/loading XML documents provides poor performance (if compared to reading flat files): therefore, several researches are trying to address this problem by improving the parsing phase, e.g., by adopting condensed or binary representations of XML documents. This paper deals with the other side of the coin, i.e., the problem of coupling algorithms with XML Parsers, in a way that does not require to change the active (polling-based) nature of many algorithms and provides acceptable performance during execution; this problem becomes even more important when we consider Java algorithms, that usually are less efficient than C or C++ algorithms. This paper presents a study about the problem of loosely coupling Java algorithms with XML Parsers. The coupling is loose because the algorithm should be unaware of the particular interface provided by parsers. We consider several coupling techniques, and we compare them by analyzing their performance. The evaluation leads us to identify the coupling techniques that perform better, depending on the specific algorithm’s needs and application scenario.
Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the decision making process of a purchase, the higher are the chances of...
详细信息
Speed to market is critical to companies that are driven by sales in a competitive market. The earlier a potential customer can be approached in the decision making process of a purchase, the higher are the chances of converting that prospect into a customer. Traditional methods to identify sales leads such as company surveys and direct marketing are manual, expensive and not scalable. Over the past decade the World Wide Web has grown into an information-mesh, with most important facts being reported through Web sites. Several news papers, press releases, trade journals, business magazines and other related sources are on-line. These sources could be used to identify prospective buyers automatically. In this paper, we present a system called ETAP (Electronic Trigger Alert Program) that extracts trigger events from Web data that help in identifying prospective buyers. Trigger events are events of corporate relevance and indicative of the propensity of companies to purchase new products associated with these events. Examples of trigger events are change in management, revenue growth and mergers & acquisitions. The unstructured nature of information makes the extraction task of trigger events difficult. We pose the problem of trigger events extraction as a classification problem and develop methods for learning trigger event classifiers using existing classification methods. We present methods to automatically generate the training data required to learn the classifiers. We also propose a method of feature abstraction that uses named entity recognition to solve the problem of data sparsity. We score and rank the trigger events extracted from ETAP for easy browsing. Our experiments show the effectiveness of the method and thus establish the feasibility of automatic sales lead generation using the Web data.
We consider the problem of multi-task learning, that is, learning multiple related functions. Our approach is based on a hierarchical Bayesian framework, that exploits the equivalence between parametric linear models ...
详细信息
ISBN:
(纸本)1595931805
We consider the problem of multi-task learning, that is, learning multiple related functions. Our approach is based on a hierarchical Bayesian framework, that exploits the equivalence between parametric linear models and nonparametric Gaussian processes (GPs). The resulting models can be learned easily via an EM-algorithm. Empirical studies on multi-label text categorization suggest that the presented models allow accurate solutions of these multi-task problems.
Predictive State Representations (PSRs) have shown a great deal of promise as an alternative to Markov models. However, learning a PSR from a single stream of data generated from an environment remains a challenge. In...
详细信息
ISBN:
(纸本)1595931805
Predictive State Representations (PSRs) have shown a great deal of promise as an alternative to Markov models. However, learning a PSR from a single stream of data generated from an environment remains a challenge. In this work, we present a formalism of PSRs and the domains they model. This formalization suggests an algorithm for learning PSRs that will (almost surely) converge to a globally optimal model given sufficient training data.
data-dependencies play an important role in the performance of learning algorithms. In this paper we analyze the concepts of data dependencies in the context of artificial systems. When a problem and its solution are ...
详细信息
data-dependencies play an important role in the performance of learning algorithms. In this paper we analyze the concepts of data dependencies in the context of artificial systems. When a problem and its solution are viewed as points in a system configuration, variations in the problem configurations can be used to study the variations in the solution configurations and vice versa. These variations could be used to infer solutions to unknown instances of problems based on the solutions to known instances, thus reducing the problem of learning to that of identifying the relations among problems and their solutions. We use this concept in constructing a formal framework for a learning mechanism based on the relations among data attributes. As part of the framework we provide metrics - quality and quantity - for data samples and establish a knowledge conservation theorem. We explain how these concepts can be used in practice by considering an example problem and discuss the limitations.
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has...
详细信息
ISBN:
(纸本)1595931805
Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learning a function denned only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to labeled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learning algorithms into semi-supervised learners. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reflect the underlying geometry of the data. We derive explicit formulas for the corresponding new kernels. Our approach demonstrates state of the art performance on a variety of classification tasks.
A logistic regression classification algorithm is developed for problems in which the feature vectors may be missing data (features). Single or multiple imputation for the missing data is avoided by performing analyti...
详细信息
ISBN:
(纸本)1595931805
A logistic regression classification algorithm is developed for problems in which the feature vectors may be missing data (features). Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional density function (conditioned on the non-missing data). Conditional density functions are estimated using a Gaussian mixture model (GMM), with parameter estimation performed using both expectation maximization (EM) and Variational Bayesian EM (VB-EM). Using widely available real data, we demonstrate the general advantage of the VB-EM GMM estimation for handling incomplete data, vis-à-vis the EM algorithm. Moreover, it is demonstrated that the approach proposed here is generally superior to standard imputation procedures.
Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification ...
详细信息
ISBN:
(纸本)1595931805
Unsupervised learning methods often involve summarizing the data using a small number of parameters. In certain domains, only a small subset of the available data is relevant for the problem. One-Class Classification or One-Class Clustering attempts to find a useful subset by locating a dense region in the data. In particular, a recently proposed algorithm called One-Class Information Ball (OC-IB) shows the advantage of modeling a small set of highly coherent points as opposed to pruning outliers. We present several modifications to OC-IB and integrate it with a global search that results in several improvements such as deterministic results, optimality guarantees, control over cluster size and extension to other cost functions. Empirical studies yield significantly better results on various real and artificial data.
To achieve good generalization in supervised learning, the training and testing examples are usually required to be drawn from the same source distribution. In this paper we propose a method to relax this requirement ...
详细信息
ISBN:
(纸本)1595931805
To achieve good generalization in supervised learning, the training and testing examples are usually required to be drawn from the same source distribution. In this paper we propose a method to relax this requirement in the context of logistic regression. Assuming Dp and Da are two sets of examples drawn from two mismatched distributions, where D a are fully labeled and Dp partially labeled, our objective is to complete the labels of Dp. We introduce an auxiliary variable μ for each example in Da to reflect its mismatch with Dp. Under an appropriate constraint the μ's are estimated as a byproduct, along with the classifier. We also present an active learning approach for selecting the labeled examples in Dp. The proposed algorithm, called "Migratory-Logit" or M-Logit, is demonstrated successfully on simulated as well as real data sets.
暂无评论