Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample f...
详细信息
Learning of large-scale networks of interactions from microarray data is an important and challenging problem in bioinformatics. A widely used approach is to assume that the available data constitute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data.
The microarray data contains the high volume of genes having multiple values of expressions and small number of samples. Therefore, the selection of gene from microarray data is an extremely challenging and important ...
详细信息
The microarray data contains the high volume of genes having multiple values of expressions and small number of samples. Therefore, the selection of gene from microarray data is an extremely challenging and important issue to analyze the biological behavior of features. In this context, dynamic scaling factor based differential evolution (DE) with multi-layer perceptron (MLP) is designed for selection of genes from pathway information of microarray data. At first DE is employed to select the relevant and lesser number of genes. Then MLP is used to build a classifier model over the selected genes. A suitable and efficient representation of vector is designed for DE. The fitness function is derived separately as T-score, classification accuracy and weight sum approach of both. Simulation and further analysis is performed in terms of sensitivity, specificity, accuracy and F-score. Moreover, statistical and biological analysis are also conducted.
data mining allow users to discover novelty in huge amounts of data. Frequent pattern methods have proved to be efficient, but the extracted patterns are often too numerous and thus difficult to analyze by end users. ...
详细信息
data mining allow users to discover novelty in huge amounts of data. Frequent pattern methods have proved to be efficient, but the extracted patterns are often too numerous and thus difficult to analyze by end users. In this paper, we focus on sequential pattern mining and propose a new visualization system to help end users analyze the extracted knowledge and to highlight novelty according to databases of referenced biological documents. Our system is based on three visualization techniques: clouds, solar systems, and treemaps. We show that these techniques are very helpful for identifying associations and hierarchical relationships between patterns among related documents. Sequential patterns extracted from gene data using our system were successfully evaluated by two biology laboratories working on Alzheimer's disease and cancer. (C) 2011 Elsevier Inc. All rights reserved.
microarray data analysis is a major line of research in bioinformatics. A significant trend in bioinformatics is identifying genes or gene groups that differentiate diseased tissues. Classification is necessary to mak...
详细信息
microarray data analysis is a major line of research in bioinformatics. A significant trend in bioinformatics is identifying genes or gene groups that differentiate diseased tissues. Classification is necessary to make microarray data useful for application in medicine, and in related research such as disease diagnosis. Classification models have been developed using statistical methods such as logistic and multi-normal regression for data mining. However, the complexities of real-world classification problems, such as those in the medical domain, are highly dimensional. General statistical methods are inadequate for these complex problems. This study proposes simplified swarm optimization (SSO), an efficient methodology for discovering breast cancer classification rules. The data set was derived from, the Stanford microarray database. The proposed approach enables simultaneous feature selection and pattern recognition. Experimental results indicate that SSO outperforms general data mining methods such as decision tree, neural network, support vector machine, etc. The proposed approach has potential applications in hospital decision-making and research such as predictive medicine.
This paper focuses on the stability-based approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the av...
详细信息
This paper focuses on the stability-based approach for estimating the number of clusters K in microarray data. The cluster stability approach amounts to performing clustering successively over random subsets of the available data and evaluating an index which expresses the similarity of the successive partitions obtained. We present a method for automatically estimating K by starting from the distribution of the similarity index. We investigate how the selection of the hierarchical clustering (HQ method, respectively, the similarity index, influences the estimation accuracy. The paper introduces a new similarity index based on a partition distance. The performance of the new index and that of other well-known indices are experimentally evaluated by comparing the "true" data partition with the partition obtained at each level of an HC tree. A case study is conducted with a publicly available Leukemia dataset.
An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a...
详细信息
An important issue in the analysis of gene expression microarray data is concerned with the extraction of valuable genetic interactions from high dimensional data sets containing gene expression levels collected for a small sample of assays. Past and ongoing research efforts have been focused on biomarker selection for phenotype classification. Usually, many genes convey useless information for classifying the outcome and should be removed from the analysis;on the other hand, some of them may be highly correlated, which reveals the presence of redundant expressed information. In this paper we propose a method for the selection of highly predictive genes having a low redundancy in their expression levels. The predictive accuracy of the selection is assessed by means of Classification and Regression Trees (CART) models which enable assessment of the performance of the selected genes for classifying the outcome variable and will also uncover complex genetic interactions. The method is illustrated throughout the paper using a public domain colon cancer gene expression data set. (C) 2013 Elsevier Ltd. All rights reserved.
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets that r...
详细信息
Extensive studies have shown that mining microarray data sets is important in bioinformatics research and biomedical applications. In this paper, we explore a novel type of gene-sample-time microarray data sets that records the expression levels of various genes under a set of samples during a series of time points. In particular, we propose the mining of coherent gene clusters from such data sets. Each cluster contains a subset of genes and a subset of samples such that the genes are coherent on the samples along the time series. The coherent gene clusters may identify the samples corresponding to some phenotypes (e.g., diseases), and suggest the candidate genes correlated to the phenotypes. We present two efficient algorithms, namely the Sample-Gene Search and the Gene-Sample Search, to mine the complete set of coherent gene clusters. We empirically evaluate the performance of our approaches on both a real microarray data set and synthetic data sets. The test results have shown that our approaches are both efficient and effective to find meaningful coherent gene clusters.
Interesting biological information as, for example, gene expression data (microarrays), can be extracted from publicly available genomic data. As a starting point in order to narrow down the great possibilities of wet...
详细信息
Interesting biological information as, for example, gene expression data (microarrays), can be extracted from publicly available genomic data. As a starting point in order to narrow down the great possibilities of wet lab experiments, global high throughput data and available knowledge should be used to infer biological knowledge and emit biological hypothesis. Here, based on microarray data, we propose the use of cluster and classification methods that have become very popular and are implemented in freely available software in order to predict the participation in virulence mechanisms of different proteins coded by genes of the pathogen Streptococcus pyogenes. Confidence of predictions is based on classification errors of known genes and repetitive prediction by more than three methods. A special emphasis is done on the nonlinear kernel classification methods used. We propose a list of interesting candidates that could be virulence factors or that participate in the virulence process of S. pyogenes. Biological validations should start using this list of candidates as they show similar behavior to known virulence factors.
Accurate classification of microarray data is very important for medical decision making. Past studies have shown that class-conditional independent component analysis (CC-ICA) is capable of improving the performance ...
详细信息
Accurate classification of microarray data is very important for medical decision making. Past studies have shown that class-conditional independent component analysis (CC-ICA) is capable of improving the performance of naive Bayes classifier in microarray data analysis. However, when a microarray dataset has a small number of samples for some classes, the application of CC-ICA may become infeasible. This paper extends CC-ICA and proposes a partition-conditional independent component analysis (PC-ICA) method for naive Bayes classification of microarray data. Compared to ICA and CC-ICA, PC-ICA represents an in-between concept for feature extraction. Our experimental results on two microarray datasets show that PC-ICA is more effective than ICA in improving the performance of naive Bayes classification of microarray data. (C) 2010 Elsevier Ltd. All rights reserved.
Development on microarray technology may lead to opportunities in bioinformatics and makes it possible to diagnose cancer on the level of gene expression. Many adverse factors, such as small number of samples with hig...
详细信息
Development on microarray technology may lead to opportunities in bioinformatics and makes it possible to diagnose cancer on the level of gene expression. Many adverse factors, such as small number of samples with high-dimensional characteristics and data class imbalances, pose challenges to traditional machine learning methods. Numerous researchers had worked on these problems and obtained significant achievements. This paper describes the data sets used in study, summarizes the approaches for cancer diagnosis based on microarray data, and provides outlook on future research direction.
暂无评论