The rising interest in integrative approach has shifted gene selection from purely data-centric to incorporating additional biological knowledge. Integrative gene selection is viewed as a promising approach in micr...
详细信息
The rising interest in integrative approach has shifted gene selection from purely data-centric to incorporating additional biological knowledge. Integrative gene selection is viewed as a promising approach in microarray data classification that took into consideration the complex relationships among genes. However, in most of the existing methods, the selection of genes is still based on expression values alone and biological knowledge is integrated at the end of analysis to verify experimental results or to gain biological insights. Thus, this paper proposed an integrative gene selection based on filter method and association analysis for selecting genes that are not only differentially expressed but also informative for classification. Association analysis is employed to integrate microarray data with multiple types of biological knowledge simultaneously, and to identify groups of genes that are frequently co-occurred in target samples. It has been tested on four cancer-related datasets, and two types of biological knowledge are incorporated, namely Gene Ontology (GO) and KEGG Pathways (KEGG). The experimental results show that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with lesser number of genes. The performance of the integrative models verified the efficiency and scalability of association analysis in mining microarray data.
A vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosi...
详细信息
A vast number of microarray datasets have been produced as a way to identify differentially expressed genes and gene expression signatures. A better understanding of these biological processes can help in the diagnosis and prognosis of diseases, as well as in the therapeutic response to drugs. However, most of the available datasets are composed of a reduced number of samples, leading to low statistical, predictive and generalization power. One way to overcome this problem is by merging several microarray datasets into a single dataset, which is typically a challenging task. Statistical methods or supervised machine learning algorithms are usually used to determine gene expression signatures. Nevertheless, statistical methods require an arbitrary threshold to be defined, and supervised machine learning methods can be ineffective when applied to high-dimensional datasets like microarrays. We propose a methodology to identify gene expression signatures by merging microarray datasets. This methodology uses statistical methods to obtain several sets of differentially expressed genes and uses supervised machine learning algorithms to select the gene expression signature. This methodology was validated using two distinct research applications: one using heart failure and the other using autism spectrum disorder microarray datasets. For the first, we obtained a gene expression signature composed of 117 genes, with a classification accuracy of approximately 98%. For the second use case, we obtained a gene expression signature composed of 79 genes, with a classification accuracy of approximately 82%. This methodology was implemented in R language and is available, under the MIT licence, at https://***/bioinformatics-ua/MicroGES.
In the contemporary landscape, the imperative for cost-effective solutions is paramount, especially when dealing with extensively large dimensional datasets like gene expression datasets. The use of machine learning a...
详细信息
In the contemporary landscape, the imperative for cost-effective solutions is paramount, especially when dealing with extensively large dimensional datasets like gene expression datasets. The use of machine learning and data mining techniques in processing these voluminous and complex datasets presents a significant challenge in terms of time and resource consumption. A notable obstacle in dataset analysis is the prevalence of extraneous features or attributes. This is particularly evident innumerous medical datasets, which are often burdened with unnecessary attributes, complicating the task of classifications or prediction algorithms in obtaining precise results. However, the application of metaheuristic optimization algorithms shows remarkable proficiency in isolating pertinent feature vectors, thus markedly improving the efficiency and cost-effectiveness of data processing endeavors. We propose a novel feature selection method using a Genetic Algorithm (GA) that enhances initial population diversity by clustering features during initialization. The paper also introduces a modified crossover technique for generating offspring and employs an adaptive threshold-based Roulette Wheel for parent selection, ensuring effective feature selection. We evaluate the proposed feature selection method on 17 UCI datasets with 3 of them having a very high number of features and the obtained results are found to be better than many state-of-the-art methods both in terms of the classification accuracy and the reduction in the number of features. We also apply our method on 5 microarray-based gene expression datasets, used for the prediction of cancer, in order to ensure scalability and robustness of our method as a feature selector in real-life scenarios. This link provides the source code of the proposed method.
Gene selection is crucial for cancer classification using microarray data. In the interests of improving cancer classification accuracy, in this paper, we developed a new wrapper method called ieGENES for gene selecti...
详细信息
Gene selection is crucial for cancer classification using microarray data. In the interests of improving cancer classification accuracy, in this paper, we developed a new wrapper method called ieGENES for gene selection. First we proposed a parsimonious kernel machine regularization (PKMR) model by using ridge regularization in kernel machine driven classification to tackle multi-collinearity for the sake of stable estimates in high-dimensional settings. Then the ieGENES algorithm was developed to optimally identify relevant genes while iteratively eliminating redundant ones based on leave-one-out cross-validation accuracy. In particular, we developed a new methodology to optimally update model parameters upon gene removal. The ieGENES algorithm was evaluated on six cancer microarray datasets and compared to existing methods. Classification accuracy and number of differentially expressed genes (DEGs) identified were assessed. In terms of gene selection accuracy, the ieGENES outperformed multiple wrapper methods on 5 out of 6 datasets (Colon, Leukemia, Hepato, Glioma, and Breast Cancers), with statistically significant improvements (p<0.001). For the Colon dataset, ieGENES achieved 96.21% accuracy with 167 DEGs. The proposed ieGENES technique demonstrated superior performance in identifying DEGs for cancer diagnosis comparing with existing techniques. It offers a promising tool for identifying biologically relevant genes in microarray data analysis and biomarker discovery for cancer research.
Cancer, in particular breast cancer, is considered one of the most common causes of death worldwide according to the world health organization. For this reason, extensive research efforts have been done in the area of...
详细信息
Cancer, in particular breast cancer, is considered one of the most common causes of death worldwide according to the world health organization. For this reason, extensive research efforts have been done in the area of accurate and early diagnosis of cancer in order to increase the likelihood of cure. Among the available tools for diagnosing cancer, microarray technology has been proven to be effective. microarray technology analyzes the expression level of thousands of genes simultaneously. Although the huge number of features or genes in the microarray data may seem advantageous, many of these features are irrelevant or redundant resulting in the deterioration of classification accuracy. To overcome this challenge, feature selection techniques are a mandatory preprocessing step before the classification process. In the paper, the main feature selection and classification techniques introduced in the literature for cancer (particularly breast cancer) are reviewed to improve the microarray-based classification.
DNA microarray datasets, also known as "omics" data, are important for the diagnosis of numerous diseases, including cancer and tumors. In the analysis of these data, feature selection techniques and classif...
详细信息
DNA microarray datasets, also known as "omics" data, are important for the diagnosis of numerous diseases, including cancer and tumors. In the analysis of these data, feature selection techniques and classification algorithms are the workhorse for choosing candidate genes that serve as cancer biomarkers. However, microarray datasets present a challenge;they contain a greater number of features than the samples, which affects the performance of algorithms used in the analysis process. In order to extract precise information, it is necessary to employ a method that is both robust and performant. This paper emphasizes the importance of accurate and stable gene selection for the discovery of knowledge derived from high-dimensional data. A novel hybrid framework was put forth for consideration, comprising three distinct stages: Clustering, Parallel Filtering, and Hybrid-Parallel Optimization. In each step, a combination of techniques and algorithms is used to improve the results in terms of stability and/or accuracy. The proposal is evaluated and tested according to different scenarios;using thirteen gene expression datasets and two classifiers: Artificial Neural Network (ANN) and Na & iuml;ve Bayes (NB). Comparison with related work demonstrates the efficacy of this approach, which enhances classification accuracy and stability while reducing the number of selected genes.
Gene selection is a pivotal process in machine-learning-driven medical diagnostics, where the goal is to identify a subset of genes from microarray expression profiles that can enhance the predictive accuracy of class...
详细信息
Gene selection is a pivotal process in machine-learning-driven medical diagnostics, where the goal is to identify a subset of genes from microarray expression profiles that can enhance the predictive accuracy of classifiers for disease diagnosis. The two key objectives of gene selection are to reduce the dimensionality of the data and to improve the accuracy of disease diagnosis, which is typically a multi-objective optimization problem. In recent years, multi-objective evolutionary algorithms (MOEAs) have gained wide attention in feature selection research, and several related algorithms have been produced. However, most algorithms tend to get stuck in local optimality when searching for solutions from a high-dimensional space. To solve the gene selection problem effectively, this study introduces a recursive multi-objective differential evolution algorithm with elite recursive strategy (RMODE-E) and a recursive multi-objective differential evolution algorithm with Pareto front recursive strategy (RMODE-P). RMODE-E amalgamates the features selected by the top E elite individuals, RMODE-P consolidates the features selected by the Pareto front set, and the combined features then serve as the foundation for subsequent recursive rounds of searching. The proposed feature subspace combination strategy not only reduces the recursive search space but also improves the capacity to find globally optimal feature subsets. Extensive experiments were conducted to compare our proposed algorithms with eight state-of-the-art evolutionary algorithms to validate their effectiveness. Experimental results demonstrate that RMODE-P has better global search capability as it achieves better best classification accuracy, mean classification accuracy, and minimal gene subset size.
Alzheimer's disease (AD) classification, which is crucial for identifying AD-associated genes, relies heavily effective feature selection (FS) to tackle the curse of dimensionality. Traditional methods like filter...
详细信息
Alzheimer's disease (AD) classification, which is crucial for identifying AD-associated genes, relies heavily effective feature selection (FS) to tackle the curse of dimensionality. Traditional methods like filter, wrapper, and embedded techniques have their drawbacks, including ignoring feature independence, sensitivity classifier choices, and high computational costs. Hybrid approaches combining these methods seek to harness their collective strengths but face challenges, particularly in selecting the optimal number of features from each method. This selection is typically manual or requires time-intensive k-fold cross-validation (KFCV), significantly increasing computational demands and complicating the process with the need for extensive parameter optimization across families, thereby escalating the complexity and resource requirements of model development. To overcome these challenges, this work proposes a framework for optimal FS and classification AD using a combination of filter and embedded techniques, enhanced with hyperparameter tuning. Firstly, gene expression data (GED) from the AD Neuroimaging Initiative (ADNI) is preprocessed. Then, Chi-square filter selection is applied to decrease correlated features. Next, Logistic Regression with ElasticNet penalty (LREN) is employed to further refine the feature set. Finally, Bayesian Optimization (BO) is introduced to
microarray data is of great significance for cancer identification at the gene level. In the microarray dataset, only a small number of characteristic genomes have significant classification and identification rates f...
详细信息
microarray data is of great significance for cancer identification at the gene level. In the microarray dataset, only a small number of characteristic genomes have significant classification and identification rates for cancer. How to extract a small number of characteristic genes from a large number of microarray data is a classic NP-hard problem. This paper proposes a practical hybrid approach to implement the feature selection of gene expression from the microarray by combining the F-score algorithm and an improved artificial fish swarm algorithm with population variation (FSA-PV). Firstly, the F-score algorithm eliminates a large number of useless and redundant features in the dataset. Then, FSA-PV is discussed to obtain the ability to jump out of the local optimum while retaining the excellent feature of the subset as much as possible, and the adaptive step and visual are used to adjust the search space and to move the range of the algorithm in different environments to improve the local optimization and global optimization abilities. In addition, a naive Bayesian classifier is used to test the classification accuracy of subsets. Eight classical datasets are used to verify the performance of the proposed mechanism in the experiment part. The results reveal that the classification accuracy using the FSA-PV is significant superior to other algorithms in Breast dataset, and the classification accuracy is more than 90% in 8 cases. It further indicates the robustness and feasibility of the FSA-PV in the gene selection process.
Cancer classification based on microarray data plays a very important role in cancer diagnosis and detection. Indeed, since microarray data contains a huge number of genes and a small number of samples, it is also non...
详细信息
Cancer classification based on microarray data plays a very important role in cancer diagnosis and detection. Indeed, since microarray data contains a huge number of genes and a small number of samples, it is also nonlinear and noisy, which has led to the need to find a way to reduce the data dimensionality. In order to solve this problem, we need to find an effective way to help biologists and medical research scientists. This paper proposes a new bio-inspired algorithm for cancer classification in gene selection called Binary Grey Wolf Optimization Algorithm (BGWOA), which is based on hybridization between Minimum Redundancy-Maximum Relevance (MRMR) and a novel Binary Grey Wolf algorithm. The BGWOA is composed of two stages: The first stage consists of the MRMR pre-filter to obtain the set of relevant genes that reduces the dimensionality of the data sets. The second stage consists of a new Binary Grey Wolf algorithm based on direct similarity and centroid known in the geometric field to update the positions of grey wolves in order to exploit and explore the search spaces. As well, we used a fitness function that depends on the SVM with LOOCV classifier and the rate of unselected genes to evaluate the presented solutions. The primary goal of the last stage is to identify the best relevant subset of genes among those obtained in the first stage. This research used eight microarray datasets to evaluate and compare the proposed method with other existing algorithms. The experimental results produced in this research are able to provide a higher classification accuracy with fewer genes compared to many recently published algorithms. Specifically, the proposed method achieves 100% classification accuracy in five reference datasets with a number of genes ranging from 12 to 25. Therefore, this indicates that our research is promising and significant.
暂无评论