In this paper, we propose a new expectation-maximization (EM) algorithm which speeds up the training of feedforward networks with local activation functions such as the Radial Basis Function (RBF) nctw ork. The core o...
详细信息
Unsupervised ensemble learning refers to methods devised for a particular task that combine data pro-vided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the...
详细信息
Unsupervised ensemble learning refers to methods devised for a particular task that combine data pro-vided by decision learners taking into account their reliability, which is usually inferred from the data. Here, the variant calling step of the next generation sequencing technologies is formulated as an unsuper-vised ensemble classification problem. A variant calling algorithm based on the expectation-maximizationalgorithm is further proposed that estimates the maximum-a-posteriori decision among a number of classes larger than the number of different labels provided by the learners. Experimental results with real human DNA sequencing data show that the proposed algorithm is competitive compared to state-of -the-art variant callers as GATK, HTSLIB, and Platypus.(c) 2022 The Author(s). Published by Elsevier *** is an open access article under the CC BY-NC-ND license ( http://***/licenses/by-nc-nd/4.0/ )
Background: The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. Mixture mode...
详细信息
Background: The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. Mixture model method (MMM hereafter) is a nonparametric statistical method often used for microarray processing applications, but is known to over-fit the data if the number of replicates is small. In addition, the results of the MMM may not be repeatable when dealing with a small number of replicates. In this paper, we propose a new version of MMM to ensure the repeatability of the results in different runs, and reduce the sensitivity of the results on the parameters. Results: The proposed technique is applied to the two different data sets: Leukaemia data set and a data set that examines the effects of low phosphate diet on regular and Hyp mice. In each study, the proposed algorithm successfully selects genes closely related to the disease state that are verified by biological information. Conclusion: The results indicate 100% repeatability in all runs, and exhibit very little sensitivity on the choice of parameters. In addition, the evaluation of the applied method on the Leukaemia data set shows 12% improvement compared to the MMM in detecting the biologically-identified 50 expressed genes by Thomas et al. The results witness to the successful performance of the proposed algorithm in quantitative pathogenesis of diseases and comparative evaluation of treatment methods.
Low-rank matrix factorization (LRMF) has received much popularity owing to its successful applications in both computer vision and data mining. By assuming noise to come from a Gaussian, Laplace or mixture of Gaussian...
详细信息
Low-rank matrix factorization (LRMF) has received much popularity owing to its successful applications in both computer vision and data mining. By assuming noise to come from a Gaussian, Laplace or mixture of Gaussian distributions, significant efforts have been made on optimizing the (weighted) L-1 or L-2-norm loss between an observed matrix and its bilinear factorization. However, the type of noise distribution is generally unknown in real applications and inappropriate assumptions will inevitably deteriorate the behavior of LRMF. On the other hand, real data are often corrupted by skew rather than symmetric noise. To tackle this problem, this paper presents a novel LRMF model called AQ-LRMF by modeling noise with a mixture of asymmetric Laplace distributions. An efficient algorithm based on the expectation-maximization (EM) algorithm is also offered to estimate the parameters involved in AQ-LRMF. The AQ-LRMF model possesses the advantage that it can approximate noise well no matter whether the real noise is symmetric or skew. The core idea of AQ-LRMF lies in solving a weighted L-1 problem with weights being learned from data. The experiments conducted on synthetic and real data sets show that AQ-LRMF outperforms several state-of-the-art techniques. Furthermore, AQ-LRMF also has the superiority over the other algorithms in terms of capturing local structural information contained in real images. (C) 2020 Elsevier Ltd. All rights reserved.
Background: ChIP-chip data are routinely used to identify transcription factor binding targets. However, the presence of false positives and false negatives in ChIP-chip data complicates and hinders analyses, especial...
详细信息
Background: ChIP-chip data are routinely used to identify transcription factor binding targets. However, the presence of false positives and false negatives in ChIP-chip data complicates and hinders analyses, especially when the binding targets for a specific transcription factor are compared across conditions or species. Results: We propose an expectationmaximization based approach to infer the underlying true counts of "positives" and "negatives" from the observed counts. Based on this approach, we study the effect of false positives and false negatives on inferences related to transcription regulation. Conclusion: Our results indicate that if there is a significant degree of association among the binding targets across conditions/species (log odds ratio > 4), moderate values of false positive and false negative rates (0.005 and 0.4 respectively) would not change our inference qualitatively (i.e. the presence or absence of conservation) based on the observed experimental data despite a significant change in the observed counts. However, if the underlying association is marginal, with odds ratios close to 1, moderate to large values of false positive and false negative rates (0.01 and 0.2 respectively) could mask the underlying association.
Background: The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster ...
详细信息
Background: The main limitations of most existing clustering methods used in genomic data analysis include heuristic or random algorithm initialization, the potential of finding poor local optima, the lack of cluster number detection, an inability to incorporate prior/expert knowledge, black-box and non-adaptive designs, in addition to the curse of dimensionality and the discernment of uninformative, uninteresting cluster structure associated with confounding variables. Results: In an effort to partially address these limitations, we develop the VIsual Statistical Data Analyzer (VISDA) for cluster modeling, visualization, and discovery in genomic data. VISDA performs progressive, coarse-to-fine ( divisive) hierarchical clustering and visualization, supported by hierarchical mixture modeling, supervised/unsupervised informative gene selection, supervised/unsupervised data visualization, and user/prior knowledge guidance, to discover hidden clusters within complex, high-dimensional genomic data. The hierarchical visualization and clustering scheme of VISDA uses multiple local visualization subspaces (one at each node of the hierarchy) and consequent subspace data modeling to reveal both global and local cluster structures in a "divide and conquer" scenario. Multiple projection methods, each sensitive to a distinct type of clustering tendency, are used for data visualization, which increases the likelihood that cluster structures of interest are revealed. Initialization of the full dimensional model is based on first learning models with user/prior knowledge guidance on data projected into the low-dimensional visualization spaces. Model order selection for the high dimensional data is accomplished by Bayesian theoretic criteria and user justification applied via the hierarchy of low-dimensional visualization subspaces. Based on its complementary building blocks and flexible functionality, VISDA is generally applicable for gene clustering, sample clustering, and phenoty
Due to the complex structures and the multi-functionality of modern products, there are usually two or more performance characteristics which can reflect a product's degradation states. The degradation processes c...
详细信息
Due to the complex structures and the multi-functionality of modern products, there are usually two or more performance characteristics which can reflect a product's degradation states. The degradation processes corresponding to these performance characteristics are dependent in general, which brings challenges to the degradation data analysis. In this paper, a gamma process based degradation model is developed for the bivariate dependent degradation data, where the dependency between the two degradation processes is captured by a common random effect naturally. The expectation maximization algorithm is employed to estimate the model parameters. Then, a real-time prediction method for a product's remaining useful life is proposed using the Bayesian method. Finally, both the simulation study and the case study are provided for illustration, whose results demonstrate that the proposed model as well as the corresponding inference methods does work well.
Point set registration is a fundamental problem in many domains. This paper proposes a novel pair-wise registration algorithm based on the rigid transformation consensus. It starts by building a point correspondence s...
详细信息
Point set registration is a fundamental problem in many domains. This paper proposes a novel pair-wise registration algorithm based on the rigid transformation consensus. It starts by building a point correspondence set, which contains both inliers and outliers. Due to non-overlapping regions, it associates each point correspondence with a latent variable and formulates pair-wise registration as a maximum likelihood estimation problem, which is optimized by the expectation-maximum algorithm. Since all inliers follows the consensus of one similar rigid transformation, each correspondence is assigned a posterior probability to indicate whether it is inlier or outlier. To obtain the desired result, it requires to alternatively implement the establishment of point correspondence and maximum likelihood estimation. Given initial rigid transformation, the proposed algorithm is able to obtain a desired registration result for the pair-wise registration. Experiments tested on public available data sets illustrate its superior performance on accuracy and efficiency over previous algorithms.
Background: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free com...
详细信息
Background: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering. Results: We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectationmaximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://***/luscinius/afcluster. Conclusions: We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read.
We develop a robust and fully unsupervised algorithm for the detection of action potentials from extracellularly recorded data. Using the continuous wavelet transform allied to probabilistic mixture models and Bayesia...
详细信息
We develop a robust and fully unsupervised algorithm for the detection of action potentials from extracellularly recorded data. Using the continuous wavelet transform allied to probabilistic mixture models and Bayesian probability theory, the detection of action potentials is posed as a model selection problem. Our technique provides a robust performance over a wide range of simulated conditions, and compares favorably to selected supervised and unsupervised detection techniques.
暂无评论