Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears i...
详细信息
ISBN:
(纸本)9781605582054
Consider the following problem: given sets of unlabeled observations, each set with known label proportions, predict the labels of another set of observations, also with known label proportions. This problem appears in areas like e-commerce, spam filtering and improper content detection. We present consistent estimators which can reconstruct the correct labels with high probability in a uniform convergence sense. Experiments show that our method works well in practice. Copyright 2008 by the author(s)/owner(s).
We describe and explore a new perspective on the sample complexity of active learning. In many situations where it was generally believed that active learning does not help, we show that active learning does help in t...
详细信息
We describe and explore a new perspective on the sample complexity of active learning. In many situations where it was generally believed that active learning does not help, we show that active learning does help in the limit, often with exponential improvements in sample complexity. This contrasts with the traditional analysis of active learning problems such as non-homogeneous linear separators or depth-limited decision trees, in which Ω(1/∈) lower bounds are common. Such lower bounds should be interpreted carefully;indeed, we prove that it is always possible to learn an ∈-good classifier with a number of samples asymptotically smaller than this. These new insights arise from a subtle variation on the traditional definition of sample complexity, not previously recognized in the active learning literature.
A Detecting Peak's Number (DPN) technique is proposed for multimodal optimization. In DPN technique, we want to know the peak's number of locally multimodal domain of every individual, firstly we use the idea ...
详细信息
A Detecting Peak's Number (DPN) technique is proposed for multimodal optimization. In DPN technique, we want to know the peak's number of locally multimodal domain of every individual, firstly we use the idea of orthogonal intersection for getting the exploration direction in every locally multimodal domain, and then we attempt to detect peak's number in every one-dimension direction as the result of detecting of locally multimodal domain. At last we design an evolution algorithm (DPNA) based on the characters of DNP technique, which contain four characters: niching, variable population, variable radius and life time, and then give a series of experiment results which show the effectiveness of algorithm, as the DPNA is not only adapting to obtaining multiple optima or suboptima, but also effective for problem of ill-scaled and locally multimodal domain described in [11].
Concept discovery and modeling are fundamental problems in machinelearning research. Real world concepts are usually high-dimensional and have complicated distributions along their dimensions. Gaussian Mixture Models...
详细信息
ISBN:
(纸本)9789746152969
Concept discovery and modeling are fundamental problems in machinelearning research. Real world concepts are usually high-dimensional and have complicated distributions along their dimensions. Gaussian Mixture Models(GMM) have proved useful in modeling such complicated distributions. We propose a data-driven concept modeling and discovery framework using GMM, with on-line updating mechanism for fast computation suitable for real world applications. Experiments show the efficacy and efficiency of the proposed algorithm.
We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a...
详细信息
ISBN:
(纸本)160560352X
We study boosting in the filtering setting, where the booster draws examples from an oracle instead of using a fixed training set and so may train efficiently on very large datasets. Our algorithm, which is based on a logistic regression technique proposed by Collins, Schapire, & Singer, requires fewer assumptions to achieve bounds equivalent to or better than previous work. Moreover, we give the first proof that the algorithm of Collins et al. is a strong PAC learner, albeit within the filtering setting. Our proofs demonstrate the algorithm's strong theoretical properties for both classification and conditional probability estimation, and we validate these results through extensive experiments. Empirically, our algorithm proves more robust to noise and overfitting than batch boosters in conditional probability estimation and proves competitive in classification.
Quite a bit is known about minimizing different kinds of regret in experts problems, and how these regret types relate to types of equilibria in the multiagent setting of repeated matrix games. Much less is known abou...
详细信息
ISBN:
(纸本)9781605582054
Quite a bit is known about minimizing different kinds of regret in experts problems, and how these regret types relate to types of equilibria in the multiagent setting of repeated matrix games. Much less is known about the possible kinds of regret in online convex programming problems (OCPs), or about equilibria in the analogous multiagent setting of repeated convex games. This gap is unfortunate, since convex games are much more expressive than matrix games, and since many important machinelearning problems can be expressed as OCPs. In this paper, we work to close this gap: we analyze a spectrum of regret types which lie between external and swap regret, along with their corresponding equilibria, which lie between coarse correlated and correlated equilibrium. We also analyze algorithms for minimizing these regret types. As examples of our framework, we derive algorithms for learning correlated equilibria in polyhedral convex games and extensive-form correlated equilibria in extensive-form games. The former is exponentially more efficient than previous algorithms, and the latter is the first of its type. Copyright 2008 by the author(s)/owner(s).
We generalize Shimizu et al's (2006) ICA-based approach for discovering linear non-Gaussian acyclic (LiNGAM) Structural Equation Models (SEMs) from causally sufficient, continuous-valued observational data. By rel...
详细信息
ISBN:
(纸本)0974903949
We generalize Shimizu et al's (2006) ICA-based approach for discovering linear non-Gaussian acyclic (LiNGAM) Structural Equation Models (SEMs) from causally sufficient, continuous-valued observational data. By relaxing the assumption that the generating SEM's graph is acyclic, we solve the more general problem of linear non-Gaussian (LiNG) SEM discovery. LiNG discovery algorithms output the distribution equivalence class of SEMs which, in the large sample limit, represents the population distribution. We apply a LiNG discovery algorithm to simulated data. Finally, we give sufficient conditions under which only one of the SEMs in the output class is "stable".
This paper introduces Adaptive Feature Thresholding (AFT) which is a novel method of person-dependent off-line signature verification. AFT enhances how a simple image feature of a signature is converted to a binary fe...
详细信息
This paper introduces Adaptive Feature Thresholding (AFT) which is a novel method of person-dependent off-line signature verification. AFT enhances how a simple image feature of a signature is converted to a binary feature vector by significantly improving its representation in relation to the training signatures. The similarity between signatures is then easily computed from their corresponding binary feature vectors. AFT was tested on the CEDAR and GPDS benchmark datasets, with classification using either a manual or an automatic variant. On the CEDAR dataset we achieved a classification accuracy of 92% for manual and 90% for automatic, while on the GPDS dataset we achieved over 87% and 85% respectively. For both datasets AFT is less complex and requires fewer images features than the existing state of the art methods, while achieving competitive results.
In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models. The Pairwise-Lin...
详细信息
ISBN:
(纸本)9781605581934
In this work, we address the problem of joint modeling of text and citations in the topic modeling framework. We present two different models called the Pairwise-Link-LDA and the Link-PLSA-LDA models. The Pairwise-Link-LDA model combines the ideas of LDA [4] and Mixed Membership Block Stochastic Models [1] and allows modeling arbitrary link structure. However, the model is computationally expensive, since it involves modeling the presence or absence of a citation (link) between every pair of documents. The second model solves this problem by assuming that the link structure is a bipartite graph. As the name indicates, Link-PLSA-LDA model combines the LDA and PLSA models into a single graphical model. Our experiments on a subset of Citeseer data show that both these models are able to predict unseen data better than the baseline model of Erosheva and Lafferty [8], by capturing the notion of topical similarity between the contents of the cited and citing documents. Our experiments on two different data sets on the link prediction task show that the Link-PLSA-LDA model performs the best on the citation prediction task, while also remaining highly scalable. In addition, we also present some interesting visualizations generated by each of the models. Copyright 2008 ACM.
暂无评论