Background: microarray gene expression datasets usually contain a large number of genes that complicate further operations like classification, clustering and other kinds of analysis. During the classification process...
详细信息
Background: microarray gene expression datasets usually contain a large number of genes that complicate further operations like classification, clustering and other kinds of analysis. During the classification process, the identification of salient genes is a brainstorming task and needs a careful selection. Methods: The classification of multi-class datasets is more critical when compared with binary classification. When there are multiple class labels, chances are more likely that the datasets are imbalanced. Large variations can be seen in the number of samples belonging to each class, and hence the classification process may go biased with incorrect samples chosen for training. There is no sufficient research work available to address all these three scenarios together in microarray datasets. Results and Discussion: The paper fills this gap with the following contributions: i) Selects salient genes for classification using multiSURF algorithm ii) Identifies right instances from imbalanced datasets using Retained Tomek Link algorithm and iii) Performs gene selection for multi-class classification using Dynamic Length Particle Swarm Optimization (DPSO). Conclusion: The proposed method is implemented on multi-class imbalanced microarray datasets, and the final classification performance is seen to be encouraging and better than other compared methods.
Recently, advances in bioinformatics lead to microarray high dimensional datasets. These kinds of datasets are still challenging for researchers in the area of machine learning since they suffer from small sample size...
详细信息
Recently, advances in bioinformatics lead to microarray high dimensional datasets. These kinds of datasets are still challenging for researchers in the area of machine learning since they suffer from small sample size and extremely large number of features. Therefore, feature selection is the problem of interest in the learning process in this area. In this paper, a novel feature selection method based on a global search (by using the main concepts of divide and conquer technique) which is called CCFS, is proposed. The proposed CCFS algorithm divides vertically (on features) the dataset by random manner and utilizes the fundamental concepts of cooperation coevolution by using a filter criterion in the fitness function in order to search the solution space via binary gravitational search algorithm. For determining the effectiveness of the proposed method some experiments are carried out on seven binary microarray high dimensional datasets. The obtained results are compared with nine state-of-the-art feature selection algorithms including Interact (INT), and Maximum Relevancy Minimum Redundancy (MRMR). The average outcomes of the results are analyzed by a statistical non-parametric test and it reveals that the proposed method has a meaningful difference to the others in terms of accuracy, sensitivity, specificity and number of selected features. (C) 2018 Elsevier Ltd. All rights reserved.
Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection met...
详细信息
Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection methods help us select a small number of variables out of thousands of genes in microarray datasets for a more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification task, and can give subset of gene set without the loss of classification performance. In classifying microarray data, the main objective of gene selection is to search for the genes while keeping the maximum amount of relevant information about the class and minimize classification errors. In this paper, explain the importance of feature subset selection methods in machine learning and data mining fields. Consequently, the analysis of microarray expression was used to check whether global biological differences underlie common pathological features in different types of cancer datasets and identify genes that might anticipate the clinical behavior of this disease. Using the feature subset selection model for gene expression contains large amounts of raw data that needs analyzing to obtain useful information for specific biological and medical applications. One way of finding relevant (and removing redundant ) genes is by using the Bayesian network based on the Markov blanket [1]. We present and compare the performance of the different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs) used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards compares the Markov blanket model s performance with the most common classical classifica
In this paper, a genetic programming(GP) based new ensemble system is proposed, named as GPES. Decision tree is used as base classifier, and fused by GP with three voting methods: min, max and average. In this way, ea...
详细信息
ISBN:
(纸本)9783319093390;9783319093383
In this paper, a genetic programming(GP) based new ensemble system is proposed, named as GPES. Decision tree is used as base classifier, and fused by GP with three voting methods: min, max and average. In this way, each individual of GP acts as an ensemble system. When the evolution process of GP ends, the final ensemble committee is selected from the last generation by a forward search algorithm. GPES is evaluated on microarray datasets, and results show that this ensemble system is competitive compared with other ensemble systems.
Traditional classification algorithms struggle with the high dimensionality of medical data, resulting in reduced performance in tasks like disease diagnosis. Feature selection (FS) has emerged as a crucial preprocess...
详细信息
Traditional classification algorithms struggle with the high dimensionality of medical data, resulting in reduced performance in tasks like disease diagnosis. Feature selection (FS) has emerged as a crucial preprocessing step to mitigate these challenges by extracting relevant features and improving classification accuracy. This paper proposes a hybrid FS method, FJMIBCOA, which integrates Fuzzy Joint Mutual Information (FJMI) as a filter measure and Binary Cheetah Optimizer Algorithm (BCOA) as a wrapper method. Unlike existing hybrid FS methods, the proposed method employs FJMI to address uncertainty in feature relationships, providing several advantages such as handling both discrete and continuous features, accommodating linear and non-linear relationships, noise robustness and effectively utilizing intra- and inter-class information. It also employs BCOA as a wrapper method, requiring a few parameters, minimizing computational overhead and enhancing classification robustness, making it an efficient and adaptable solution for FS in complex medical datasets. The proposed method is validated on 23 medical datasets and 14 high-dimensional microarray datasets, demonstrating excellent performance in terms of fitness value, accuracy and feature size. FJMIBCOA surpasses existing methods in medical datasets by achieving higher accuracy in 78.26% of datasets while reducing the feature size by 84.79%. Similarly, in microarray datasets, it improves accuracy in 78.58% of datasets with an impressive 95.08% reduction in feature size. Furthermore, FJMIBCOA achieves superior accuracy in 60% of datasets while selecting fewer features in 78.57% of datasets as compared to previous studies. Statistical testing indicates that FJMIBCOA outperforms other methods significantly. The proposed method enhances diagnosis accuracy and minimizes medical testing requirements, making it suitable for real-world, high-dimensional datasets and decision-making in medical data analysis. The findings
Identifying disease-related genes is an ongoing study issue in biomedical analysis. Many research has recently presented various strategies for predicting disease-related genes. However, only a handful of them were ca...
详细信息
Identifying disease-related genes is an ongoing study issue in biomedical analysis. Many research has recently presented various strategies for predicting disease-related genes. However, only a handful of them were capable of identifying or selecting relevant genes with a low computational burden. In order to tackle this issue, we introduce a new filter-wrapper-based gene selection (GS) method based on metaheuristic algorithms (MHAs) in conjunction with the k-nearest neighbors (k-NN\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${k{\hbox {-NN}}}$$\end{document}) classifier. Specifically, we hybridize two MHAs, bat algorithm (BA) and JAYA algorithm (JA), embedded with perturbation as a new perturbation-based exploration strategy (PES), to obtain JAYA-bat algorithm (JBA). The fact that JBA outperforms 10 state-of-the-art GS methods on 12 high-dimensional microarray datasets (ranging from 2000 to 22,283 features or genes) is impressive. It is also noteworthy that relevant genes are first selected via a filter-based method called mutual information (MI), and then further optimized by JBA to select the near-optimal genes in a timely fashion. Comparing the performance analysis of 11 well-known original MHAs, including BA and JA, the proposed JBA achieves significantly better results with improvement rates of 12.36%, 12.45%, 97.88%, 9.84%, 12.45%, and 12.17% in terms of fitness, accuracy, gene selection ratio, precision, recall, and F1-score, respectively. The results of Wilcoxon's signed-rank test at a significance level of alpha=0.05\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =0.05$$\end{document} furth
Malignancies and diseases of various genetic origins can be diagnosed and classified with microarray data. There are many obstacles to overcome due to the large size of the gene and the small number of samples in the ...
详细信息
Malignancies and diseases of various genetic origins can be diagnosed and classified with microarray data. There are many obstacles to overcome due to the large size of the gene and the small number of samples in the microarray. A combination strategy for gene expression in a variety of diseases is described in this paper, consisting of two steps: identifying the most effective genes via soft ensembling and classifying them with a novel deep neural network. The feature selection approach combines three strategies to select wrapper genes and rank them according to the k-nearest neighbour algorithm, resulting in a very generalisable model with low error levels. Using soft ensembling, the most effective subsets of genes were identified from three microarray datasets of diffuse large cell lymphoma, leukaemia, and prostate cancer. A stacked deep neural network was used to classify all three datasets, achieving an average accuracy of 97.51%, 99.6%, and 96.34%, respectively. In addition, two previously unreported datasets from small, round blue cell tumors (SRBCTs)and multiple sclerosis-related brain tissue lesions were examined to show the generalisability of the model method.
Purpose The prolyl 3-hydroxylase family member 4 gene (P3H4) is involved in the development of human cancers. The association of P3H4 with bladder cancer (BC) prognosis is unclear. This study aimed to analyze the asso...
详细信息
Purpose The prolyl 3-hydroxylase family member 4 gene (P3H4) is involved in the development of human cancers. The association of P3H4 with bladder cancer (BC) prognosis is unclear. This study aimed to analyze the association of P3H4 with BC prognosis. Methods RNA-Seq data were downloaded from The Cancer Genome Atlas project and BC microarray datasets (GSE13507, GSE31684, and GSE32548) were downloaded from the Gene Expression Omnibus database. We analyzed the differences in P3H4 expression levels between BC tumors and non-tumor tissues and between samples with different clinical information. The association of P3H4 and P3H4-related genes with BC prognosis and the possibility of using P3H4 expression as a prognostic biomarker in BC patients were also analyzed. RevMan was used to perform the meta-analysis. Results P3H4 was upregulated in BC tissues compared with the adjacent non-tumor tissues (p = 4.06e-08). Univariate Cox regression analysis and meta-analysis showed that high P3H4 expression level contributed to a poor BC prognosis (Hazard ratio, HR = 1.348, 95% CI 1.140-1.594, p = 4.89e-04;meta-analysis: HR = 1.45, 95% CI 1.10-1.91;p = 9.00e-03). Among the genes related to P3H4, the PLOD1 gene was closely associated with P3H4 expression (r = 0.620, p = 2.49e-44). Also, a meta-analysis showed that PLOD1 expression was associated with a poor prognosis in BC patients (HR = 1.77, 95% CI 1.31-2.38;p = 2.00e-04). Conclusions The P3H4 and PLOD1 genes might be used as reliable prognostic biomarkers for BC.
A common situation in classification tasks is to deal with unbalanced datasets, an issue that appears when the majority class(es) has a large number of samples compared to the minority class(es). This problem is even ...
详细信息
A common situation in classification tasks is to deal with unbalanced datasets, an issue that appears when the majority class(es) has a large number of samples compared to the minority class(es). This problem is even more significant when the datasets have a large number of features but only a few samples, as is the case with microarray datasets. Traditionally, an approach to alleviate this problem has been the application of sampling methods to obtain more balanced classes, increasing the number of samples in the minority class (replicating samples or generating new synthetic samples), or decreasing the number of samples in the majority class. In this study, we have compared different balancing methods, including a novel method that applies sampling in both the minority and majority classes. The interest in applying feature selection in combination with balancing methods has also been explored. In view of the results, a recommendation of sampling method, feature selection, and classifier is proposed to improve the classification results according to the type of dataset.
microarray expression datasets generate a huge number of genes, but only a few genes provide information about cancer diseases. In this context, feature selection approaches have been developed to deal with this probl...
详细信息
microarray expression datasets generate a huge number of genes, but only a few genes provide information about cancer diseases. In this context, feature selection approaches have been developed to deal with this problem. Filter-based methods, in particular, select the relevant genes and remove the irrelevant ones using different evaluation metrics. In this study, we shed light on nine univariate filter methods. Three categories of filter methods were investigated using eight microarray datasets, including binary and multi-class samples. The support vector machine and Naive Bayes classifiers were used to assess classification accuracy. Different comparison methods were used to assist the researchers in visualizing the performance of each studied filter. Precisely, statistical tests were applied in terms of classification accuracy, and the feature ranking similarity of the filter methods was studied based on a rank correlation measure.
暂无评论