Gene expression data are a key factor for the success of medical diagnosis, and two-stage classification methods are therefore developed for processing microarray data. The first stage for this kind of classification ...
详细信息
Gene expression data are a key factor for the success of medical diagnosis, and two-stage classification methods are therefore developed for processing microarray data. The first stage for this kind of classification methods is to select a pre-specified number of genes, which are likely to be the most relevant to the occurrence of a disease, and passes these genes to the second stage for classification. In this paper, we use four gene selection mechanisms and two classification tools to compose eight two-stage classification methods, and test these eight methods on eight microarray data sets for analyzing their performance. The first interesting finding is that the genes chosen by different categories of gene selection mechanisms are less than half in common but result in insignificantly different classification accuracies. A subset-gene-ranking mechanism can be beneficial in classification accuracy, but its computational effort is much heavier. Whether the classification tool employed at the second stage should be accompanied with a dimension reduction technique depends on the characteristics of a data set. (c) 2006 Elsevier Ltd. All rights reserved.
A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally...
详细信息
A biclustering algorithm, based on a greedy technique and enriched with a local search strategy to escape poor local minima, is proposed. The algorithm starts with an initial random solution and searches for a locally optimal solution by successive transformations that improve a gain function. The gain function combines the mean squared residue, the row variance, and the size of the bicluster. Different strategies to escape local minima are introduced and compared. Experimental results on several microarray data sets show that the method is able to find significant biclusters, also from a biological point of view. (c) 2007 Elsevier Inc. All rights reserved.
Missing values in microarray data can significantly affect subsequent analysis, thus it is important to estimate these missing values accurately. In this paper, a sequential local least squares imputation (SLLSimpute)...
详细信息
Missing values in microarray data can significantly affect subsequent analysis, thus it is important to estimate these missing values accurately. In this paper, a sequential local least squares imputation (SLLSimpute) method is proposed to solve this problem. It estimates missing values sequentially from the gene containing the fewest missing values and partially utilizes these estimated values. In addition, an automatic parameter selection algorithm, which can generate an appropriate number of neighboring genes for each target gene, is presented for parameter estimation. Experimental results confirmed that SLLSimpute method exhibited better estimation ability compared with other currently used imputation methods. (C) 2008 Elsevier Ltd. All rights reserved.
In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component...
详细信息
In microarray data analysis, each gene expression sample has thousands of genes and reducing such high dimensionality is useful for both visualization and further clustering of samples. Traditional principal component analysis (PCA) is a commonly used method which has problems. Nonnegative Matrix Factorization (NMF) is a new dimension reduction method. In this paper we compare NMF and PCA for dimension reduction. The reduced data is used for visualization, and clustering analysis via k-means on 11 real gene expression datasets. Before the clustering analysis, we apply NMF and PCA for reduction in visualization. The results on one leukemia dataset show that NMF can discover natural clusters and clearly detect one mislabeled sample while PCA cannot. For clustering analysis via k-means, NMF most typically outperforms PCA. Our results demonstrate the superiority of NMF over PCA in reducing microarray data. (C) 2007 Elsevier Inc. All rights reserved.
Feature selection has been used widely for a variety of data, yielding higher speeds and reduced computational cost for the classification process. However, it is in microarray datasets where its advantages become mor...
详细信息
ISBN:
(纸本)9781424417391
Feature selection has been used widely for a variety of data, yielding higher speeds and reduced computational cost for the classification process. However, it is in microarray datasets where its advantages become more evident and are more required. In this paper we present a novel approach to accomplish this based on the concept of discernibility that we introduce to depict how separated the classes of a dataset are. We develop and test two independent feature selection methods that follow this approach. The results of oar experiments on four microarray datasets show that discernibility-based feature selection reduces the dimensionality of the datasets involved without compromising the performance of the classifiers.
For cancer prediction using large-scale gene expression data, it often helps to incorporate gene interactions in the model. However it is not straightforward to simultaneously select important genes while modeling gen...
详细信息
For cancer prediction using large-scale gene expression data, it often helps to incorporate gene interactions in the model. However it is not straightforward to simultaneously select important genes while modeling gene interactions. Some heuristic approaches have been proposed in the literature. In this paper, we study a unified modeling approach based on the l(1) penalized likelihood estimation that can simultaneously select important genes and model gene interactions. We will illustrate its competitive performance through simulation studies and applications to public microarray data. (c) 2012 Elsevier Ltd. All rights reserved.
Several biclustering algorithms have been proposed in different fields of microarray data analysis. We present a new approach that improves their performance in using the ensemble methods. An ensemble biclustering is ...
详细信息
Several biclustering algorithms have been proposed in different fields of microarray data analysis. We present a new approach that improves their performance in using the ensemble methods. An ensemble biclustering is considered and formalized by a problem of binary triclustering. We propose a simple and efficient algorithm to solve it. To illustrate the interest of our ensemble approach, numerical experiments are performed on both artificial and real datasets with two biclustering algorithms commonly used in bioinformatics. (C) 2012 Elsevier Ltd. All rights reserved.
In microarray data, clustering is the fundamental task for separating genes into biologically functional groups or for classifying tissues and phenotypes. Recently, with innovative gene expression microarray data tech...
详细信息
ISBN:
(纸本)9783540785675
In microarray data, clustering is the fundamental task for separating genes into biologically functional groups or for classifying tissues and phenotypes. Recently, with innovative gene expression microarray data technologies, thousands of expression levels of genes (features) can be measured simultaneously in a single experiment. The large number of genes with a lot of noise causes high complexity for cluster analysis. This challenge has raised the demand for feature selection - an effective dimensionality reduction technique that removes noisy features. In this paper we propose a novel filter method for feature selection. The suggested method, called ClosestFS, is based on a distance measure. For each feature, the distance is evaluated by computing its impact on the histogram for the whole data. Our experimental results show that the quality of clustering results (evaluated by several widely used measures) of K-means algorithm using ClosestFS as the pre-processing step is significantly better than that of the pure K-means.
Lymph node metastasis is an important prognostic factor in oral squamous cell carcinoma. However, the lack of significant biomarkers for lymph node metastasis can cause patients to be inappropriately treated and produ...
详细信息
Lymph node metastasis is an important prognostic factor in oral squamous cell carcinoma. However, the lack of significant biomarkers for lymph node metastasis can cause patients to be inappropriately treated and produce a poor prognosis. Therefore, there is a need to identify gene sets that are associated with lymph node metastasis. In this study, we used three expression datasets obtained from a public database and selected candidate gene sets that were related with lymph node metastasis from two datasets and a combined dataset. We evaluated the selected gene set using OOB error rates in a validation dataset. The gene set detected from the combined dataset classified the lymph node status more accurately in the validation dataset and clear expression patterns classifying the lymph node status based on chromosomal location were observed. The combined dataset holds promise for use as a more accurate candidate gene set for the diagnosis of lymph node metastasis and the selected gene set could be used for biological validation in further studies. (C) 2011 Elsevier Ltd. All rights reserved.
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in deta...
详细信息
The main aim of this work is the study of clustering dependent data by means of copula functions. Copulas are popular multivariate tools whose importance within clustering methods has not been investigated yet in detail. We propose a new algorithm (CoClust in brief) that allows to cluster dependent data according to the multivariate structure of the generating process without any assumption on the margins. Moreover, the approach does not require either to choose a starting classification or to set a priori the number of clusters;in fact, the CoClust selects them by using a criterion based on the log-likelihood of a copula fit. We test our proposal on simulated data for different dependence scenarios and compare it with a model-based clustering technique. Finally, we show applications of the CoClust to real microarray data of breast-cancer patients.
暂无评论