Transductive semi-supervised learning methods aim at automatically labeling large datasets by leveraging information provided by few manually labeled data points and the intrinsic structure of the dataset. Many such m...
详细信息
Transductive semi-supervised learning methods aim at automatically labeling large datasets by leveraging information provided by few manually labeled data points and the intrinsic structure of the dataset. Many such methods based on a graph signal representation of a dataset have been proposed, in which the nodes correspond to the data points, the edges connect similar points, and the graph signal is the mapping between the nodes and the la- bels. Most of the existing methods use deterministic signal models and try to recover the graph signal using a regularized or constrained convex optimiza- tion approach, where the regularization/constraint term enforce some sort of smoothness of the graph signal. This thesis takes a different route and inves- tigates a probabilisticgraphical modeling approach in which the graph signal is considered a Markov random field defined over the underlying network structure. The measurement process, modeling the initial manually obtained labels, and smoothness assumptions are imposed by a probability distribution defined over the Markov network corresponding to the data graph. Various approximate inference methods such as loopy belief propagation and the mean field methods are studied by means of numerical experiments involving both synthetic and real-world datasets.
Classification of multilabel documents is essential to information retrieval and text mining. Most of existing approaches to multilabel text classification do not pay attention to relationship between class labels and...
详细信息
ISBN:
(纸本)9781479934003
Classification of multilabel documents is essential to information retrieval and text mining. Most of existing approaches to multilabel text classification do not pay attention to relationship between class labels and input documents and also rely on labeled data all the time for classification. In fact, unlabeled data is readily available whereas generation of labeled data is expensive and error prone as it needs human annotation. In this paper, we propose a novel multilabel document classification approach based on semi-supervised mixture model of Watson distributions on document manifold which explicitly considers the manifold structure of document space to exploit efficiently both labeled and unlabeled data for classification. Our proposed approach models all labels within a dataset simultaneously, which lends itself well to the task of considering the relationship between these labels. The experimental results show that proposed method outperforms the state-of-the-art methods applying to multilabeled text classification.
Multi-label classification consists of learning a function that is capable of mapping an object to a set of relevant labels. It has applications such as the association of genes with biological functions, semantic cla...
详细信息
Multi-label classification consists of learning a function that is capable of mapping an object to a set of relevant labels. It has applications such as the association of genes with biological functions, semantic classification of scenes and text categorization. Traditional classification (i. e., single-label) is therefore a particular case of multi-label classification in which each object is associated with exactly one label. A successful approach to constructing classifiers is to obtain a probabilistic model of the relation between object attributes and labels. This model can then be used to classify objects, finding the most likely prediction by computing the marginal probability or the most probable explanation (MPE) of the labels given the attributes. Depending on the probabilisticmodels family chosen, such inferences may be intractable when the number of labels is large. Sum-Product Networks (SPN) are deep probabilisticmodels, that allow tractable marginal inference. Nevertheless, as with many other probabilisticmodels, performing MPE inference is NP- hard. Although, SPNs have already been used successfully for traditional classification tasks (i. e. single-label), there is no in-depth investigation on the use of SPNs for Multi-Label classification. In this work we investigate the use of SPNs for Multi-Label classification. We compare several algorithms for learning SPNs combined with different proposed approaches for classification. We show that SPN-based multi-label classifiers are competitive against state-of-the-art classifiers, such as Random k-Labelsets with Support Vector Machine and MPE inference on CutNets, in a collection of benchmark datasets.
It is now widely acknowledged that machine learning models, trained on data without due care, often exhibit discriminatory behavior. Traditional fairness research has mainly focused on supervised learning tasks, parti...
详细信息
ISBN:
(纸本)9798400710940
It is now widely acknowledged that machine learning models, trained on data without due care, often exhibit discriminatory behavior. Traditional fairness research has mainly focused on supervised learning tasks, particularly classification. While fairness in unsupervised learning has received some attention, the literature has primarily addressed fair representation learning of continuous embeddings. This paper, however, takes a different approach by investigating fairness in unsupervised learning using graphicalmodels with discrete latent variables. We develop a fair stochastic variational inference method for discrete latent variables. Our approach uses a fairness penalty on the variational distribution that reflects the principles of intersectionality, a comprehensive perspective on fairness from the fields of law, social sciences, and humanities. Intersectional fairness brings the challenge of data sparsity in minibatches, which we address via a stochastic approximation approach. We first show the utility of our method in improving equity and fairness for clustering using naive Bayes and Gaussian mixture models on benchmark datasets. To demonstrate the generality of our approach and its potential for real-world impact, we then develop a specialized graphical model for criminal justice risk assessments, and use our fairness approach to prevent the inferences from encoding unfair societal biases.
Machine learning methods often face a tradeoff between the accuracy of discriminative models and the lower sample complexity of their generative counterparts. This inspires a need for hybrid methods. In this paper we ...
详细信息
Machine learning methods often face a tradeoff between the accuracy of discriminative models and the lower sample complexity of their generative counterparts. This inspires a need for hybrid methods. In this paper we present the graphical ensemble classifier (GEC), a novel combination of logistic regression and naive Bayes. By partitioning the feature space based on known independence structure, GEC is able to handle datasets with a diverse set of features and achieve higher accuracy than a purely discriminative model from less training data. In addition to describing the theoretical basis of our model, we show the practical effectiveness on artificial data, along with the 20-newsgroups, MNIST, and MediFor datasets.
Programmers strive to design programs that are flexible, updateable, and maintainable. However, several factors such as lack of time, high costs, and workload lead to the creation of software with inadequacies known a...
详细信息
Programmers strive to design programs that are flexible, updateable, and maintainable. However, several factors such as lack of time, high costs, and workload lead to the creation of software with inadequacies known as anti-patterns. To identify and refactor software anti-patterns, many research studies have been conducted using machine learning. Even though some of the previous works were very accurate in identifying anti-patterns, a method that takes into account the relationships between different structures is still needed. Furthermore, a practical method is needed that is trained according to the characteristics of each program. This method should be able to identify anti-patterns and perform the necessary refactorings. This paper proposes a framework based on probabilistic graphical models for identifying and refactoring anti-patterns. A graphical model is created by extracting the class properties from the source code. As a final step, a Bayesian network is trained, which determines whether anti-patterns are present or not based on the characteristics of neighboring classes. To evaluate the proposed approach, the model is trained on six different anti-patterns and six different Java programs. The proposed model has identified these anti-patterns with a mean accuracy of 85.16 percent and a mean recall of 79%. Additionally, this model has been used to introduce several methods for refactoring, and it has been shown that these refactoring methods will ultimately create a system with less coupling and higher cohesion.
Several algorithms have been proposed towards discovering the graphical structure of Bayesian networks. Most of these algorithms are restricted to observational data and some enable us to incorporate knowledge as cons...
详细信息
Several algorithms have been proposed towards discovering the graphical structure of Bayesian networks. Most of these algorithms are restricted to observational data and some enable us to incorporate knowledge as constraints in terms of what can and cannot be discovered by an algorithm. A common type of such knowledge involves the temporal order of the variables in the data. For example, knowledge that event B occurs after observing A and hence, the constraint that B cannot cause A. This paper investigates real-world case studies that incorporate interesting properties of objective temporal variable order, and the impact these temporal constraints have on the learnt graph. The results show that most of the learnt graphs are subject to major modifications after incorporating incomplete temporal objective information. Because temporal information is widely viewed as a form of knowledge that is subjective, rather than as a form of data that tends to be objective, it is generally disregarded and reduced to an optional piece of information that only few of the structure learning algorithms may consider. The paper argues that objective temporal information should form part of observational data, to reduce the risk of disregarding such information when available and to encourage its reusability across related studies.
Data reconciliation is a widely utilised technique in process industries to obtain consistent estimates of the process variables from measurements corrupted with random error and gross error, taking process models as ...
详细信息
Data reconciliation is a widely utilised technique in process industries to obtain consistent estimates of the process variables from measurements corrupted with random error and gross error, taking process models as constraint. In the existing formulations for data reconciliation, process models are assumed to be error free. However, in practice, process models can suffer from model inaccuracies, leading to uncertainties in states. This paper introduces a new method for data reconciliation developed in the framework of Bayesian network, accounting for the state uncertainties. The solution is obtained by utilising a Bayesian network model translated from the process model and using statistical inference techniques to estimate the reconciled values of the states. A novel method to construct acyclic Bayesian network for process networks with recycle streams is proposed. This method is also extended for data reconciliation of partially measured systems. The proposed data reconciliation schemes is demonstrated on two case studies. (c) 2021 Elsevier Ltd. All rights reserved.
Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, s...
详细信息
Background: De Bruijn graphs are key data structures for the analysis of next-generation sequencing data. They efficiently represent the overlap between reads and hence, also the underlying genome sequence. However, sequencing errors and repeated subsequences render the identification of the true underlying sequence difficult. A key step in this process is the inference of the multiplicities of nodes and arcs in the graph. These multiplicities correspond to the number of times eachk-mer (resp.k+1-mer) implied by a node (resp. arc) is present in the genomic sequence. Determining multiplicities thus reveals the repeat structure and presence of sequencing errors. Multiplicities of nodes/arcs in the de Bruijn graph are reflected in their coverage, however, coverage variability and coverage biases render their determination ambiguous. Current methods to determine node/arc multiplicities base their decisions solely on the information in nodes and arcs individually, under-utilising the information present in the sequencing data. Results: To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field (CRF) model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. Conclusions: We demonstrate that the CRF model yields significant improvements in accuracy and a more robust expectation-maximisation parameter estimation. Truek-mers can be distinguished from erroneousk-mers with a higher F(1)score than existing methods. A C++11 implementation is available atunder the GNU AGPL v3.0 license.
We present a novel approach to inverse problems in imaging based on a hierarchical Bayesian-MAP (HB-MAP) formulation. In this paper we specifically focus on the difficult and basic inverse problem of multi-sensor (tom...
详细信息
We present a novel approach to inverse problems in imaging based on a hierarchical Bayesian-MAP (HB-MAP) formulation. In this paper we specifically focus on the difficult and basic inverse problem of multi-sensor (tomographic) imaging wherein the source object of interest is viewed from multiple directions by independent sensors. Given the measurements recorded by these sensors, the problem is to reconstruct the image (of the object) with a high degree of fidelity. We employ a probabilisticgraphical modeling extension of the compound Gaussian distribution as a global image prior into a hierarchical Bayesian inference procedure. Since the prior employed by our HB-MAP algorithm is general enough to subsume a wide class of priors including those typically employed in compressive sensing (CS) algorithms, HB-MAP algorithm offers a vehicle to extend the capabilities of current CS algorithms to include truly global priors. After rigorously deriving the regression algorithm for solving our inverse problem from first principles, we demonstrate the performance of the HB-MAP algorithm on Monte Carlo trials and on real empirical data (natural scenes). In all cases we find that our algorithm outperforms previous approaches in the literature including filtered back-projection and a variety of state-of-the-art CS algorithms. We conclude with directions of future research emanating from this work.
暂无评论