Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects...
详细信息
Although they have become a widely used experimental technique for identifying differentially expressed (DE) genes, DNA microarrays are notorious for generating noisy data. A common strategy for mitigating the effects of noise is to perform many experimental replicates. This approach is often costly and sometimes impossible given limited resources;thus, analytical methods are needed which increase accuracy at no additional cost. One inexpensive source of microarray replicates comes from prior work: to date, data from hundreds of thousands of microarray experiments are in the public domain. Although these data assay a wide range of conditions, they cannot be used directly to inform any particular experiment and are thus ignored by most DE gene methods. We present the SVD Augmented Gene expression Analysis Tool (SAGAT), a mathematically principled, data-driven approach for identifying DE genes. SAGAT increases the power of a microarray experiment by using observed coexpression relationships from publicly available microarray datasets to reduce uncertainty in individual genes' expression measurements. We tested the method on three well-replicated human microarray datasets and demonstrate that use of SAGAT increased effective sample sizes by as many as 2.72 arrays. We applied SAGAT to unpublished data from a microarray study investigating transcriptional responses to insulin resistance, resulting in a 50% increase in the number of significant genes detected. We evaluated 11 (58%) of these genes experimentally using qPCR, confirming the directions of expression change for all 11 and statistical significance for three. Use of SAGAT revealed coherent biological changes in three pathways: inflammation, differentiation, and fatty acid synthesis, furthering our molecular understanding of a type 2 diabetes risk factor. We envision SAGAT as a means to maximize the potential for biological discovery from subtle transcriptional responses, and we provide it as a freely available s
Background: microarray gene expression datasets usually contain a large number of genes that complicate further operations like classification, clustering and other kinds of analysis. During the classification process...
详细信息
Background: microarray gene expression datasets usually contain a large number of genes that complicate further operations like classification, clustering and other kinds of analysis. During the classification process, the identification of salient genes is a brainstorming task and needs a careful selection. Methods: The classification of multi-class datasets is more critical when compared with binary classification. When there are multiple class labels, chances are more likely that the datasets are imbalanced. Large variations can be seen in the number of samples belonging to each class, and hence the classification process may go biased with incorrect samples chosen for training. There is no sufficient research work available to address all these three scenarios together in microarray datasets. Results and Discussion: The paper fills this gap with the following contributions: i) Selects salient genes for classification using multiSURF algorithm ii) Identifies right instances from imbalanced datasets using Retained Tomek Link algorithm and iii) Performs gene selection for multi-class classification using Dynamic Length Particle Swarm Optimization (DPSO). Conclusion: The proposed method is implemented on multi-class imbalanced microarray datasets, and the final classification performance is seen to be encouraging and better than other compared methods.
Recently, advances in bioinformatics lead to microarray high dimensional datasets. These kinds of datasets are still challenging for researchers in the area of machine learning since they suffer from small sample size...
详细信息
Recently, advances in bioinformatics lead to microarray high dimensional datasets. These kinds of datasets are still challenging for researchers in the area of machine learning since they suffer from small sample size and extremely large number of features. Therefore, feature selection is the problem of interest in the learning process in this area. In this paper, a novel feature selection method based on a global search (by using the main concepts of divide and conquer technique) which is called CCFS, is proposed. The proposed CCFS algorithm divides vertically (on features) the dataset by random manner and utilizes the fundamental concepts of cooperation coevolution by using a filter criterion in the fitness function in order to search the solution space via binary gravitational search algorithm. For determining the effectiveness of the proposed method some experiments are carried out on seven binary microarray high dimensional datasets. The obtained results are compared with nine state-of-the-art feature selection algorithms including Interact (INT), and Maximum Relevancy Minimum Redundancy (MRMR). The average outcomes of the results are analyzed by a statistical non-parametric test and it reveals that the proposed method has a meaningful difference to the others in terms of accuracy, sensitivity, specificity and number of selected features. (C) 2018 Elsevier Ltd. All rights reserved.
Well-defined relationships between oligonucleotide properties and hybridization signal intensities (HSI) can aid chip design, data normalization and true biological knowledge discovery. We clarify these relationships ...
详细信息
Well-defined relationships between oligonucleotide properties and hybridization signal intensities (HSI) can aid chip design, data normalization and true biological knowledge discovery. We clarify these relationships using the data from two microarray experiments containing over three million probes from 48 high-density chips. We find that melting temperature (T-m) has the most significant effect on HSI while length for the long oligonucleotides studied has very little effect. Analysis of positional effect using a linear model provides evidence that the protruding ends of probes contribute more than tethered ends to HSI, which is further validated by specifically designed match fragment sliding and extension experiments. The impact of sequence similarity (SeqS) on HSI is not significant in comparison with other oligonucleotide properties. Using regression and regression tree analysis, we prioritize these oligonucleotide properties based on their effects on HSI. The implications of our discoveries for the design of unbiased oligonucleotides are discussed. We propose that isothermal probes designed by varying the length is a viable strategy to reduce sequence bias, though imposing selection constraints on other oligonucleotide properties is also essential.
In this paper, a genetic programming(GP) based new ensemble system is proposed, named as GPES. Decision tree is used as base classifier, and fused by GP with three voting methods: min, max and average. In this way, ea...
详细信息
ISBN:
(纸本)9783319093390;9783319093383
In this paper, a genetic programming(GP) based new ensemble system is proposed, named as GPES. Decision tree is used as base classifier, and fused by GP with three voting methods: min, max and average. In this way, each individual of GP acts as an ensemble system. When the evolution process of GP ends, the final ensemble committee is selected from the last generation by a forward search algorithm. GPES is evaluated on microarray datasets, and results show that this ensemble system is competitive compared with other ensemble systems.
Mining frequent closed patterns from microarray datasets has attracted more attention. However, most previous studies needed users to specify a minimum support threshold. In practice, it is not easy for users to set a...
详细信息
Mining frequent closed patterns from microarray datasets has attracted more attention. However, most previous studies needed users to specify a minimum support threshold. In practice, it is not easy for users to set an appropriate minimum support threshold and discover the interesting patterns from huge frequent closed patterns. In this paper, we proposed an alternative mining task that mines top-k frequent closed patterns of length no less than min from microarray datasets, where k is the desired number of frequent closed patterns to be mined. An efficient algorithm TBtop is developed adopting top-down breadth-first search strategy. Our performance study showed that the strategy was effective in pruning search space. And in most cases, the algorithm TBtop outperformed the algorithm CARPENTER.
Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection met...
详细信息
Currently, feature subset selection methods are very important, especially in areas of application for which datasets with tens or hundreds of thousands of variables (genes) are available. Feature subset selection methods help us select a small number of variables out of thousands of genes in microarray datasets for a more accurate and balanced classification. Efficient gene selection can be considered as an easy computational hold of the subsequent classification task, and can give subset of gene set without the loss of classification performance. In classifying microarray data, the main objective of gene selection is to search for the genes while keeping the maximum amount of relevant information about the class and minimize classification errors. In this paper, explain the importance of feature subset selection methods in machine learning and data mining fields. Consequently, the analysis of microarray expression was used to check whether global biological differences underlie common pathological features in different types of cancer datasets and identify genes that might anticipate the clinical behavior of this disease. Using the feature subset selection model for gene expression contains large amounts of raw data that needs analyzing to obtain useful information for specific biological and medical applications. One way of finding relevant (and removing redundant ) genes is by using the Bayesian network based on the Markov blanket [1]. We present and compare the performance of the different approaches to feature (genes) subset selection methods based on Wrapper and Markov Blanket models for the five-microarray cancer datasets. The first way depends on the Memetic algorithms (MAs) used for the feature selection method. The second way uses MRMR (Minimum Redundant Maximum Relevant) for feature subset selection hybridized by genetic search optimization techniques and afterwards compares the Markov blanket model s performance with the most common classical classifica
Traditional classification algorithms struggle with the high dimensionality of medical data, resulting in reduced performance in tasks like disease diagnosis. Feature selection (FS) has emerged as a crucial preprocess...
详细信息
Traditional classification algorithms struggle with the high dimensionality of medical data, resulting in reduced performance in tasks like disease diagnosis. Feature selection (FS) has emerged as a crucial preprocessing step to mitigate these challenges by extracting relevant features and improving classification accuracy. This paper proposes a hybrid FS method, FJMIBCOA, which integrates Fuzzy Joint Mutual Information (FJMI) as a filter measure and Binary Cheetah Optimizer Algorithm (BCOA) as a wrapper method. Unlike existing hybrid FS methods, the proposed method employs FJMI to address uncertainty in feature relationships, providing several advantages such as handling both discrete and continuous features, accommodating linear and non-linear relationships, noise robustness and effectively utilizing intra- and inter-class information. It also employs BCOA as a wrapper method, requiring a few parameters, minimizing computational overhead and enhancing classification robustness, making it an efficient and adaptable solution for FS in complex medical datasets. The proposed method is validated on 23 medical datasets and 14 high-dimensional microarray datasets, demonstrating excellent performance in terms of fitness value, accuracy and feature size. FJMIBCOA surpasses existing methods in medical datasets by achieving higher accuracy in 78.26% of datasets while reducing the feature size by 84.79%. Similarly, in microarray datasets, it improves accuracy in 78.58% of datasets with an impressive 95.08% reduction in feature size. Furthermore, FJMIBCOA achieves superior accuracy in 60% of datasets while selecting fewer features in 78.57% of datasets as compared to previous studies. Statistical testing indicates that FJMIBCOA outperforms other methods significantly. The proposed method enhances diagnosis accuracy and minimizes medical testing requirements, making it suitable for real-world, high-dimensional datasets and decision-making in medical data analysis. The findings
Recently, there has been a great attention to develop feature selection methods on the microarray high dimensional datasets. In this paper, an innovative method based on Maximum Relevancy and Minimum Redundancy (MRMR)...
详细信息
Recently, there has been a great attention to develop feature selection methods on the microarray high dimensional datasets. In this paper, an innovative method based on Maximum Relevancy and Minimum Redundancy (MRMR) approach by using Hesitant Fuzzy Sets (HFSs) is proposed to deal with feature subset selection;the method is called MRMR-HFS. MRMR-HFS is a novel filter-based feature selection algorithm that selects features by ensemble of ranking algorithms (as the measure of feature-class relevancy that must be maximized) and similarity measures (as the measure of feature-feature redundancy that must be minimized). The combination of ranking algorithms and similarity measures are done by using the fundamental concepts of information energies of HFSs. The proposed method has been inspired from Correlation based Feature Selection (CFS) within the sequential forward search in order to present a robust feature selection tool to solve high dimensional problems. To evaluate the effectiveness of the MRMR-HFS, several experimental results are carried out on nine well-known microarray high dimensional datasets. The obtained results are compared with those of other similar state-of-the-art algorithms including Correlation-based Feature Selection (CFS), Fast Correlation-based Filter (FCBF), Intract (INT), and Maximum Relevancy Minimum Redundancy (MRMR). The outcomes of comparison carried out via some non-parametric statistical tests confirm that the MRMR-HFS is effective for feature subset selection in high dimensional datasets in terms of accuracy, sensitivity, specificity, G-mean, and number of selected features. (C) 2016 Elsevier B.V. All rights reserved.
Identifying disease-related genes is an ongoing study issue in biomedical analysis. Many research has recently presented various strategies for predicting disease-related genes. However, only a handful of them were ca...
详细信息
Identifying disease-related genes is an ongoing study issue in biomedical analysis. Many research has recently presented various strategies for predicting disease-related genes. However, only a handful of them were capable of identifying or selecting relevant genes with a low computational burden. In order to tackle this issue, we introduce a new filter-wrapper-based gene selection (GS) method based on metaheuristic algorithms (MHAs) in conjunction with the k-nearest neighbors (k-NN\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${k{\hbox {-NN}}}$$\end{document}) classifier. Specifically, we hybridize two MHAs, bat algorithm (BA) and JAYA algorithm (JA), embedded with perturbation as a new perturbation-based exploration strategy (PES), to obtain JAYA-bat algorithm (JBA). The fact that JBA outperforms 10 state-of-the-art GS methods on 12 high-dimensional microarray datasets (ranging from 2000 to 22,283 features or genes) is impressive. It is also noteworthy that relevant genes are first selected via a filter-based method called mutual information (MI), and then further optimized by JBA to select the near-optimal genes in a timely fashion. Comparing the performance analysis of 11 well-known original MHAs, including BA and JA, the proposed JBA achieves significantly better results with improvement rates of 12.36%, 12.45%, 97.88%, 9.84%, 12.45%, and 12.17% in terms of fitness, accuracy, gene selection ratio, precision, recall, and F1-score, respectively. The results of Wilcoxon's signed-rank test at a significance level of alpha=0.05\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha =0.05$$\end{document} furth
暂无评论