Computational verb (CV) theory is a relatively new research field in mathematics and has been applied to many different fields. In the field of pattern recognition, the CV-based rule induction algorithm can generate s...
详细信息
Computational verb (CV) theory is a relatively new research field in mathematics and has been applied to many different fields. In the field of pattern recognition, the CV-based rule induction algorithm can generate some simple rules with CVs and adverbs by linguistically interpretable forms. In this paper, we present an interpretable rule extraction framework based on CV rule theory for the classification of microarray data. In contrast to the existing rule-based methods, the CV method enables to explicitly express the relationships of the genes based on some mathematical templates and hence enhance the understanding on the data results. Stay is a typical verb used in the CV to describe the trend of changes. In our algorithm, Stay is applied to generate CVR by a gene pair, named SCVR. The corresponding evolving and similarity functions for calculating the difference between SCVR rules are also presented to illustrate this process. Similar to other rule-based methods, the SCVR can achieve significant gene selection and cancer classification task concurrently. To evaluate the performance of our proposed approach, we conduct the experiments on several binary class and multiclass microarray datasets. Experiments confirm that the proposed method can outperform many rule-based classiers with the fusion of five rules.
An important goal of microarray studies is the detection of genes that show significant changes in expression when two classes of biological samples are being compared. We present an ANOVA-style mixed model with param...
详细信息
An important goal of microarray studies is the detection of genes that show significant changes in expression when two classes of biological samples are being compared. We present an ANOVA-style mixed model with parameters for array normalization, overall level of gene expression, and change of expression between the classes. For the latter we assume a mixing distribution with a probability mass concentrated at zero, representing genes with no changes, and a normal distribution representing the level of change for the other genes. We estimate the parameters by optimizing the marginal likelihood. To make this practical, Laplace approximations and a backfitting algorithm are used. The performance of the model is studied by simulation and by application to publicly available data sets.
Background: A common task in microarray data analysis is to identify informative genes that are differentially expressed between two different states. Owing to the high-dimensional nature of microarray data, identific...
详细信息
Background: A common task in microarray data analysis is to identify informative genes that are differentially expressed between two different states. Owing to the high-dimensional nature of microarray data, identification of significant genes has been essential in analyzing the data. However, the performances of many gene selection techniques are highly dependent on the experimental conditions, such as the presence of measurement error or a limited number of sample replicates. Results: We have proposed new filter-based gene selection techniques, by applying a simple modification to significance analysis of microarrays (SAM). To prove the effectiveness of the proposed method, we considered a series of synthetic datasets with different noise levels and sample sizes along with two real datasets. The following findings were made. First, our proposed methods outperform conventional methods for all simulation set-ups. In particular, our methods are much better when the given data are noisy and sample size is small. They showed relatively robust performance regardless of noise level and sample size, whereas the performance of SAM became significantly worse as the noise level became high or sample size decreased. When sufficient sample replicates were available, SAM and our methods showed similar performance. Finally, our proposed methods are competitive with traditional methods in classification tasks for microarrays. Conclusions: The results of simulation study and real data analysis have demonstrated that our proposed methods are effective for detecting significant genes and classification tasks, especially when the given data are noisy or have few sample replicates. By employing weighting schemes, we can obtain robust and reliable results for microarray data analysis.
Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In th...
详细信息
Feature selection is often required as a preliminary step for many pattern recognition problems. However, most of the existing algorithms only work in a centralized fashion, i.e. using the whole dataset at once. In this research a new method for distributing the feature selection process is proposed. It distributes the data by features, i.e. according to a vertical distribution, and then performs a merging procedure which updates the feature subset according to improvements in the classification accuracy. The effectiveness of our proposal is tested on microarray data, which has brought a difficult challenge for researchers due to the high number of gene expression contained and the small samples size. The results on eight microarray datasets show that the execution time is considerably shortened whereas the performance is maintained or even improved compared to the standard algorithms applied to the non-partitioned datasets. (C) 2015 Elsevier B.V. All rights reserved.
microarray expression data contains observations from thousands of genes across hundreds of samples. To extract meaningful information from these large datasets, the dimensionality reduction technique known as non-neg...
详细信息
ISBN:
(纸本)9781538626528
microarray expression data contains observations from thousands of genes across hundreds of samples. To extract meaningful information from these large datasets, the dimensionality reduction technique known as non-negative matrix factorization, or NMF, is introduced. This tool transforms the data and makes it more amenable to clustering. NMF was applied to a yeast microarray dataset. Three main clusters were discovered, corresponding to three distinct metabolic cycles. The data were also clustered using the k-means algorithm, and the clustering result was highly similar to that obtained by NMF.
Background and Objective: The limited number of samples and high-dimensional features in microarray data make selecting a small number of features for disease diagnosis a challenging problem. Traditional feature selec...
详细信息
Background and Objective: The limited number of samples and high-dimensional features in microarray data make selecting a small number of features for disease diagnosis a challenging problem. Traditional feature selection methods based on evolutionary algorithms are difficult to search for the optimal set of features in a limited time when dealing with the high-dimensional feature selection problem. New solutions are proposed to solve the above problems. Methods: In this paper, we propose a hybrid feature selection method (C-IFBPFE) for biomarker identification in microarray data, which combines clustering and improved binary particle swarm optimization while incorporating an embedded feature elimination strategy. Firstly, an adaptive redundant feature judgment method based on correlation clustering is proposed for feature screening to reduce the search space in the subsequent stage. Secondly, we propose an improved flipping probability-based binary particle swarm optimization (IFBPSO), better applicable to the binary particle swarm optimization problem. Finally, we also design a new feature elimination (FE) strategy embedded in the binary particle swarm optimization algorithm. This strategy gradually removes poorer features during iterations to reduce the number of features and improve accuracy. Results: We compared C-IFBPFE with other published hybrid feature selection methods on eight public datasets and analyzed the impact of each improvement. The proposed method outperforms other current state-of-the-art feature selection methods in terms of accuracy, number of features, sensitivity, and specificity. The ablation study of this method validates the efficacy of each component, especially the proposed feature elimination strategy significantly improves the performance of the algorithm. Conclusions: The hybrid feature selection method proposed in this paper helps address the issue of highdimensional microarray data with few samples. It can select a small subset of
The general purpose of clustering analysis of microarray data is to organize the data into meaningful groups based on their closeness. Although various algorithms have been proposed for the clustering of microarray da...
详细信息
The general purpose of clustering analysis of microarray data is to organize the data into meaningful groups based on their closeness. Although various algorithms have been proposed for the clustering of microarray data, the main difficulty remains to be the determination of the optimal number of clusters. To complicate the problem further, meaningful groups or closeness cannot be well defined due to the fuzziness nature of the data. This paper proposes a dynamic validity index to overcome this problem. The proposed index, in addition of the dynamic aspects, also takes care of both the intra- and the inter-distances of the clusters. An algorithm based on the proposed dynamic validity index and the traditional K-means method was developed. To make the proposed dynamic validity index more flexible, a modulating parameter gamma is introduced. This parameter can be used to take care of noisy data and balance the importance between compactness and separateness in the clusters. To illustrate the effectiveness of the approach, a numerical example by using the human serum data from the literature was solved and the sensitivity and robustness of the approach are examined. (c) 2004 Elsevier Inc. All rights reserved.
Purpose - The categorization response model through gene expression patterns turns into one of the most favorable utilizations of the microarray technology. In this study, the aim is to propose a grid computing-based ...
详细信息
Purpose - The categorization response model through gene expression patterns turns into one of the most favorable utilizations of the microarray technology. In this study, the aim is to propose a grid computing-based meta-evolutionary mining approach as a categorization response model for gene selection and cancer classification. Design/methodology/approach - The proposed approach is based on the grid computing infrastructure for establishing the best attributes set selected from a big microarray data. The novel discriminant analysis is based on vector distant of median method as the evaluation function of meta-evolutionary mining approach. In this study, the proposed approach lays stress on finding the best attributes set for constructing a categorization response model with highest categorization accuracy. Findings - Examples for several benchmarking cancer microarray data sets were used to evaluate the proposed approach, whose results are also compared with other approaches in literatures. Experimental results from four benchmarking problems indicate that the proposed approach works effectively and efficiently, and the results of the proposed methods are superior to or as well as other existing methods in literatures. Originality/value - The novel discriminant analysis is based on vector distant of median method as the evaluation function of meta-evolutionary mining approach to discover the best feature subset automatically from the microarray tumor database. In this study, the proposed approach lays stress on finding the best attributes set for constructing a categorization response model with highest categorization accuracy.
microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complet...
详细信息
microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complete data as input, it is crucial to be able to estimate the missing values. In this study, we propose a Global Learning with Local Preservation method (GL2P) for imputation of missing values in microarray data. GL2P consists of two components: a local similarity measurement module and a global weighted imputation module. The former uses a local structure preservation scheme to exploit as much information as possible from the observable data, and the latter is responsible for estimating the missing values of a target gene by considering all of its neighbors rather than a subset of them. Furthermore, GL2P imputes the missing values in ascending order according to the rate of missing data for each target gene to fully utilize previously estimated values. To validate the proposed method, we conducted extensive experiments on six benchmarked microarray datasets. We compared GL2P with eight state-of-the-art imputation methods in terms of four performance metrics. The experimental results indicate that GL2P outperforms its competitors in terms of imputation accuracy and better preserves the structure of differentially expressed genes. In addition, GL2P is less sensitive to the number of neighbors than other local learning-based imputation. methods. (C) 2016 Elsevier Ltd. All rights reserved.
This paper introduces a novel method for gene selection based on a modification of analytic hierarchy process (AHP). The modified AHP (MAHP) is able to deal with quantitative factors that are statistics of five indivi...
详细信息
This paper introduces a novel method for gene selection based on a modification of analytic hierarchy process (AHP). The modified AHP (MAHP) is able to deal with quantitative factors that are statistics of five individual gene ranking methods: two-sample t-test, entropy test, receiver operating characteristic curve, Wilcoxon test, and signal to noise ratio. The most prominent discriminant genes serve as inputs to a range of classifiers including linear discriminant analysis, k-nearest neighbors, probabilistic neural network, support vector machine, and multilayer perceptron. Gene subsets selected by MAHP are compared with those of four competing approaches: information gain, symmetrical uncertainty, Bhattacharyya distance and ReliefF. Four benchmark microarray datasets: diffuse large B-cell lymphoma, leukemia cancer, prostate and colon are utilized for experiments. As the number of samples in microarray datadatasets are limited, the leave one out cross validation strategy is applied rather than the traditional cross validation. Experimental results demonstrate the significant dominance of the proposed MAHP against the competing methods in terms of both accuracy and stability. With a benefit of inexpensive computational cost, MAHP is useful for cancer diagnosis using DNA gene expression profiles in the real clinical practice. (C) 2015 Elsevier B.V. All rights reserved.
暂无评论