The objective of this paper was to perform a comparative analysis of the computational intelligence algorithms to identify breast cancer in its early stages. Two types of data representations were considered: microarr...
详细信息
The objective of this paper was to perform a comparative analysis of the computational intelligence algorithms to identify breast cancer in its early stages. Two types of data representations were considered: microarray based and medical imaging based. In contrast to previous researches, this research also considered the imbalanced nature of these data. It was observed that the SMO algorithm performed better for the majority of the test data, especially for microarray based data when accuracy was used as performance measure. Considering the imbalanced characteristic of the data, the Naive Bayes algorithm was seen to perform highly in terms of true positive rate (TPR). Regarding the influence of SMOTE, a well-known imbalanced data classification technique, it was observed that there was a notable performance improvement for J48, while the performance of SMO remained comparable for the majority of the datasets. Overall, the results indicated SMO as the most potential candidate for the microarray and image dataset considered in this research. (C) 2012 Elsevier Ltd. All rights reserved.
By converting the expression values of each sample into the corresponding rank values, the rank-based approach enables the direct integration of multiple microarray data produced by different laboratories and/or diffe...
详细信息
By converting the expression values of each sample into the corresponding rank values, the rank-based approach enables the direct integration of multiple microarray data produced by different laboratories and/or different techniques. In this study, we verify through statistical and experimental methods that informative genes can be extracted from multiple microarray data integrated by the rank-based approach (briefly, integrated rank-based microarray data). First, after showing that a nonparametric technique can be used effectively as a scoring metric for rank-based microarray data, we prove that the scoring results from integrated rank-based microarray data are statistically significant. Next, through experimental comparisons, we show that the informative genes from integrated rank-based microarray data are statistically more significant than those of single-microarray data. In addition, by comparing the lists of informative genes extracted from experimental data, we show that the rankbased data integration method extracts more significant genes than the z-score-based normalization technique or the rank products technique. Public cancer microarray data were used for our experiments and the marker genes list from the CGAP database was used to compare the extracted genes. The GO database and the GSEA method were also used to analyze the functionalities of the extracted genes.
Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques...
详细信息
Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test. (C) 2017 Elsevier B.V. All rights reserved.
The rising interest in integrative approach has shifted gene selection from purely data-centric to incorporating additional biological knowledge. Integrative gene selection is viewed as a promising approach in micr...
详细信息
The rising interest in integrative approach has shifted gene selection from purely data-centric to incorporating additional biological knowledge. Integrative gene selection is viewed as a promising approach in microarray data classification that took into consideration the complex relationships among genes. However, in most of the existing methods, the selection of genes is still based on expression values alone and biological knowledge is integrated at the end of analysis to verify experimental results or to gain biological insights. Thus, this paper proposed an integrative gene selection based on filter method and association analysis for selecting genes that are not only differentially expressed but also informative for classification. Association analysis is employed to integrate microarray data with multiple types of biological knowledge simultaneously, and to identify groups of genes that are frequently co-occurred in target samples. It has been tested on four cancer-related datasets, and two types of biological knowledge are incorporated, namely Gene Ontology (GO) and KEGG Pathways (KEGG). The experimental results show that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with lesser number of genes. The performance of the integrative models verified the efficiency and scalability of association analysis in mining microarray data.
The analysis of microarray data is a widespread functional genomics approach that allows for the monitoring of the expression of thousands of genes at once. The analysis of the great amount of data generated in a micr...
详细信息
The analysis of microarray data is a widespread functional genomics approach that allows for the monitoring of the expression of thousands of genes at once. The analysis of the great amount of data generated in a microarray experiment requires powerful statistical techniques. One of the first tasks of the analysis of microarray data is to cluster data into biologically meaningful groups according to their expression patterns. In this article, we discuss classical as well as recent clustering techniques for microarray data. We pay particular attention to both theoretical and practical issues and give some general indications that might be useful to practitioners.
Background: microarray devices permit a genome-scale evaluation of gene function. This technology has catalyzed biomedical research and development in recent years. As many important diseases can be traced down to the...
详细信息
Background: microarray devices permit a genome-scale evaluation of gene function. This technology has catalyzed biomedical research and development in recent years. As many important diseases can be traced down to the gene level, a long-standing research problem is to identify specific gene expression patterns linking to metabolic characteristics that contribute to disease development and progression. The microarray approach offers an expedited solution to this problem. However, it has posed a challenging issue to recognize disease-related genes expression patterns embedded in the microarray data. In selecting a small set of biologically significant genes for classifier design, the nature of high data dimensionality inherent in this problem creates substantial amount of uncertainty. Results: Here we present a model for probability analysis of selected genes in order to determine their importance. Our contribution is that we show how to derive the P value of each selected gene in multiple gene selection trials based on different combinations of data samples and how to conduct a reliability analysis accordingly. The importance of a gene is indicated by its associated P value in that a smaller value implies higher information content from information theory. On the microarray data concerning the subtype classification of small round blue cell tumors, we demonstrate that the method is capable of finding the smallest set of genes ( 19 genes) with optimal classification performance, compared with results reported in the literature. Conclusion: In classifier design based on microarray data, the probability value derived from gene selection based on multiple combinations of data samples enables an effective mechanism for reducing the tendency of fitting local data particularities.
Dimension reduction is a crucial technique in machine learning and data mining, which is widely used in areas of medicine, bioinformatics and genetics. In this paper, we propose a two-stage local dimension reduction a...
详细信息
Dimension reduction is a crucial technique in machine learning and data mining, which is widely used in areas of medicine, bioinformatics and genetics. In this paper, we propose a two-stage local dimension reduction approach for classification on microarray data. In first stage, a new Li-regularized feature selection method is defined to remove irrelevant and redundant features and to select the important features (biomarkers). In the next stage, PLS-based feature extraction is implemented on the selected features to extract synthesis features that best reflect discriminating characteristics for classification. The suitability of the proposal is demonstrated in an empirical study done with ten widely used microarray datasets, and the results show its effectiveness and competitiveness compared with four state-of-the-art methods. The experimental results on St Jude dataset shows that our method can be effectively applied to microarray data analysis for subtype prediction and the discovery of gene coexpression. (C) 2016 Elsevier Ltd. All rights reserved.
Background: As the magnitude of the experiment increases, it is common to combine various types of microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important ...
详细信息
Background: As the magnitude of the experiment increases, it is common to combine various types of microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important to analyze microarray data together to derive a combined conclusion after accounting for heterogeneity among data sets. One of the main objectives of the microarray experiment is to identify differentially expressed genes among the different experimental groups. We propose the linear mixed effect model for the integrated analysis of the heterogeneous microarray data sets. Results: The proposed linear mixed effect model was illustrated using the data from 133 microarrays collected at three different hospitals. Though simulation studies, we compared the proposed linear mixed effect model approach with the meta-analysis and the ANOVA model approaches. The linear mixed effect model approach was shown to provide higher powers than the other approaches. Conclusions: The linear mixed effect model has advantages of allowing for various types of covariance structures over ANOVA model. Further, it can handle easily the correlated microarray data such as paired microarray data and repeated microarray data from the same subject.
microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. The application of microarray data on cancer-type classification has rece...
详细信息
microarray data can provide valuable results for a variety of gene expression profile problems and contribute to advances in clinical medicine. The application of microarray data on cancer-type classification has recently gained in popularity. The properties of microarray data contain a large number of features ( genes) with high dimensions, and one in the multi-class category. These facts make testing and training of general classification methods difficult. Reducing the number of genes and achieving lower classification error rates are the main issues to be solved. The classification of microarray data samples can be regarded as a feature selection and classifier design problem. The goal of feature selection is to select those subsets of differentially expressed genes that are potentially relevant for distinguishing the sample classes. Classical genetic algorithms (GAs) may suffer from premature convergence and thus lead to poor experimental results. In this paper, combat genetic algorithm (CGA) is used to implement the feature selection, and a K-nearest neighbor with the leave-one-out cross-validation method serves as a classifier of the CGA fitness function for the classification problem. The proposed method was applied to 10 microarray data sets that were obtained from the literature. The experimental results show that the proposed method not only effectively reduced the number of gene expression levels but also achieved lower classification error rates.
In the last few years, model-based clustering techniques have become widely used in the context of microarray data analysis. In this empirical context, a potential purpose for statistical approaches is the identificat...
详细信息
In the last few years, model-based clustering techniques have become widely used in the context of microarray data analysis. In this empirical context, a potential purpose for statistical approaches is the identification of clusters of genes that are co-expressed under subsets of experimental conditions. We discuss a hierarchical mixture model to combine advantages of allowing for dependence within gene clusters and for simultaneous clustering of genes and experimental conditions. Thanks to the adopted hierarchical structure, we may distinguish gene clusters from mixture components, where the latter may represent intra-cluster gene-specific extra-Gaussian departures. To cluster experimental conditions, instead, we suggest a suitable parameterization of component-specific means by using a binary row stochastic matrix representing condition membership. The performance of the proposed approach is discussed on both simulated and real datasets.
暂无评论