In this paper, we propose a novel dimension reduction algorithm that implements an information fusion of Centroid-based feature selection and partial least squares (PLS) based feature extraction. This paper focuses on...
详细信息
ISBN:
(纸本)9781467395984
In this paper, we propose a novel dimension reduction algorithm that implements an information fusion of Centroid-based feature selection and partial least squares (PLS) based feature extraction. This paper focuses on mining the potential information hidden in multiclass microarraydata and interpreting the results provided by the potential information. Firstly, a centroid concept has been introduced to define the objective function of feature selection. In order to obtain the sparse solution, logistic regression with L1 regularization has been incorporated into the objective function. The Centroid-based feature selection is then proposed to solve the optimization problem. By using the One-Versus-All (OVA) techniques, the Centroid-based feature selection is extended to solve multiclass problems. Secondly, we perform feature important analysis on microarraydata by Centroid-based feature selection to determine the information feature subset (biomarkers). Finally, PLS-based feature extraction is conducted on the selected feature subset to extract the features that best reflect the nature of classification. The proposed algorithm is compared with three state-of-the-art algorithms using eight multiclass microarraydatasets. The experimental results demonstrate that the proposed algorithm performs effectively and is competitive. Furthermore, mining the potential information of the microarraydataset improves the interpretability of the results.
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the tho...
详细信息
The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds);that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear Discriminant analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the Generalized Singular Value Decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets.
DNA microarrays enable the detection of genetic changes attributable to cancer by simultaneously analyzing the expression of thousands of genes. However, the identification of most relevant genes from thousands of gen...
详细信息
ISBN:
(纸本)9781538682159
DNA microarrays enable the detection of genetic changes attributable to cancer by simultaneously analyzing the expression of thousands of genes. However, the identification of most relevant genes from thousands of gene expressions available in each biological sample, for cancer classification pose a great challenge. Although researchers have applied BPSO based wrapper approaches to get most relevant genes prior to cancer classification, these approaches didn't achieve good classification accuracy due to the premature convergence caused by local stagnation problem. This paper proposes an improved Binary Particle Swarm Optimization (iBPSO) to tackle these issues. The proposed iBPSO based wrapper is examined using Naive-Bayes (NB), k-Nearest Neighbor (kNN), and Support Vector Machines (SVM) classifiers with stratified 5-fold cross-validation. The proposed iBPSO exhibited its efficacy in terms of classification accuracy and the number of selected genes in comparison to standard BPSO on six benchmark cancer microarraydatasets. Our proposed iBPSO also effectively escapes from local minima stagnation.
Biclustering is a key step in analyzing gene expression data by identifying patterns where subset of genes are co-related based on a subset of conditions. This paper proposes a new distance based possibilistic biclust...
详细信息
ISBN:
(纸本)9781424429011
Biclustering is a key step in analyzing gene expression data by identifying patterns where subset of genes are co-related based on a subset of conditions. This paper proposes a new distance based possibilistic biclustering algorithm (DPBC), in which the average distances between rows and between columns of the bicluster are minimized and at the same time the size of the bicluster is maximized by computing the zeros of the derivative of appropriate objective function. The proposed algorithm uses the possibilistic clustering paradigm similar to another existing possibilistic biclustering algorithm PBC. Whereas PBC is based on residue our approach is applicable to any accepted definition for distances between pairs of rows or columns. Experimental study on the human dataset and several artificial datasets having different noise levels shows that the DPBC algorithm can offer substantial improvements over the previously proposed algorithms.
Recent research on large scale microarrayanalysis has explored the use of Relevance Networks to find networks of genes that are associated to each other in gene expression data. In this work, we compare Relevance Net...
详细信息
作者:
Rosenfeld, SWang, TKim, YMilner, JNCI
Biometry Res Grp Div Canc Prevent Dept Hlth & Human ServNIH Rockville MD 20892 USA USDA
Phytonutrients Lab Beltsville MD 20705 USA NCI
Div Canc Prevent Nutr Sci Res Grp Dept Hlth & Human ServNIH Rockville MD 20892 USA
A computational model for simulation of the cDNA microarray experiments has been created. The simulation allows one to foresee the statistical properties of replicated experiments without actually performing them. We ...
详细信息
A computational model for simulation of the cDNA microarray experiments has been created. The simulation allows one to foresee the statistical properties of replicated experiments without actually performing them. We introduce a new concept, the so-called bio-weight, which allows for reconciliation between conflicting meanings of biological and statistical significance in microarray experiments. It is shown that, for a small sample size, the bio-weight is a more powerful criterion of the presence of a signal in microarraydata as compared to the standard approach based on t test. Joint simulation of microarray and quantitative PCR data shows that the genes recovered by using the bio-weight have better chances to be confirmed by PCR than those obtained by the t test technique. We also employ extreme value considerations to derive plausible cutoff levels for hypothesis testing.
microarray data analysis is notoriously challenging as it involves a huge number of genes compared to only a limited number of samples. Gene selection, to detect the most significantly differentially expressed genes u...
详细信息
ISBN:
(纸本)1424406234
microarray data analysis is notoriously challenging as it involves a huge number of genes compared to only a limited number of samples. Gene selection, to detect the most significantly differentially expressed genes under different categories of conditions, is both computationally and biologically interesting, and has become a central research focus in all studies that use gene expression microarray technology. Despite many existing efforts, better gene selection methods that can effectively identify biologically significant biomarkers, yet computationally efficient, are still in need. In this paper, a model-free greedy (MFG) gene selection method is proposed, which implements several intuitive heuristics but doesn't assume any statistical distribution on the expression data. The experimental results on three real microarraydatasets showed that the MFG method combined with a Support Vector Machine (SVM) classifier or a k-Nearest Neighbor (KNN) classifier is efficient and robust in identifying discriminatory genes.
Given the promoter sequence of a microRNA, we attempt to predict its expression using a regression model learnt from the expression levels of other microRNAs obtained through a microarray experiment. To our knowledge,...
详细信息
ISBN:
(纸本)9783319111278
Given the promoter sequence of a microRNA, we attempt to predict its expression using a regression model learnt from the expression levels of other microRNAs obtained through a microarray experiment. To our knowledge, this is the first study that evaluates the predictability of microRNA expression from sequence. The promising results encourage the use of the system as a supporting means for microarray missing data imputation or completing old experiments with new explorations.
Many methods have been proposed to identify informative subsets of genes in microarray studies in order to focus the research. For instance, the recently proposed binarization of consensus partition matrices (Bi-CoPaM...
详细信息
ISBN:
(纸本)9781479903566
Many methods have been proposed to identify informative subsets of genes in microarray studies in order to focus the research. For instance, the recently proposed binarization of consensus partition matrices (Bi-CoPaM) method has, amongst its various features, the ability to generate tight clusters of genes while leaving many genes unassigned from all clusters. We propose exploiting this particular feature by applying the Bi-CoPaM over genome-wide microarraydata from multiple datasets to generate more clusters than required. Then, these clusters are tightened so that most of their genes are left unassigned from all clusters, and most of the clusters are left totally empty. The tightened clusters, which are still not empty, include those genes that are consistently co-expressed in multiple datasets when examined by various clustering methods. An example of this is demonstrated in this paper for cyclic and acyclic genes as well as for genes that are highly expressed and that are not. Thus, the results of our proposed approach cannot be reproduced by other methods of genes' periodicity identification or by other methods of clustering.
暂无评论