data mining is a process which discovers patterns and retrieval knowledge in large datasets. Many learning and data mining algorithms rely on distance metrics. Cluster analysis is one of learning algorithms which adop...
详细信息
ISBN:
(纸本)9781538630662
data mining is a process which discovers patterns and retrieval knowledge in large datasets. Many learning and data mining algorithms rely on distance metrics. Cluster analysis is one of learning algorithms which adopted to biological data, for example;microarray expression data. In this study, we assessed the validity of five distance metrics (Euclidean, Manhattan, Minkowski, Cosine, and Mahalanobis) with the partitioning around medoids (PAM) algorithm on microarray datasets. microarray datasets were pre-processed prior to analysis, and the evaluation of the algorithm was undertaken using Dunn's validity index. Our results showed when selected microarray datasets were clustered with partitioning around medoids based on Manhattan distance, Minkowski, Cosine and Euclidean distance for different k partitions all distances exhibited unsatisfactory performance, however, the partitioning around medoids algorithm generates an optimal cluster solution when used with Mahalanobis distance.
Most of the available feature selection techniques in the literature are classifier bound. It means a group of features tied to the performance of a specific classifier as applied in wrapper and hybrid approach. Our o...
详细信息
microarray technologies produce very large amounts of data that need to be classified for interpretation. Large data coupled with small sample sizes make it challenging for researchers to get useful information and th...
详细信息
ISBN:
(纸本)9783319561486;9783319561479
microarray technologies produce very large amounts of data that need to be classified for interpretation. Large data coupled with small sample sizes make it challenging for researchers to get useful information and therefore a lot of effort goes into the design and testing of feature selection tools;literature abounds with description of numerous methods. In this paper we select five representative review papers in the field of feature selection for microarray data in order to understand their underlying classification of methods. Finally, on this base, we propose an extended taxonomy for categorizing feature selection techniques and use it to classify the main methods presented in the selected reviews.
DNA microarray data is a high-dimensional data that enables the researchers to analyze the expression of many genes in a single reaction quickly and in an efficient manner. Its characteristics such as small sample siz...
详细信息
ISBN:
(纸本)9781538607169
DNA microarray data is a high-dimensional data that enables the researchers to analyze the expression of many genes in a single reaction quickly and in an efficient manner. Its characteristics such as small sample size, class imbalance, and data complexity causes it difficult to classified. Feature selection is a process that automatically selects features that are most relevant to the predictive modeling in dataset. This research aims at investigating, implementing, and analyzing a feature selection method using the Artificial Bee Colony (ABC) approach. The result is compared with other evolution algorithms, which is Genetic Algorithm (GA) and Particle Swarm Optimization (PSO). The result is that feature selection using ABC has a better result at classification using k - Nearest Neighbor (k-NN) and Decision Tree (DT), but has a slightly higher fracture of features compared to GA and PSO algorithms.
Human gene network is much more complex than just pairwise interaction among the genes. Zhang et al. [6] extracted microarray data from International Genomics Consortium (IGC), and presented the detection of three-way...
详细信息
Human gene network is much more complex than just pairwise interaction among the genes. Zhang et al. [6] extracted microarray data from International Genomics Consortium (IGC), and presented the detection of three-way gene interactions in their paper using Fisher’s z-transformation test. Three-way gene interactions are closer than pairwise correlations in representing the complex gene structures. Additionally, it was more tractable than assessing four or more gene interactions. In this paper, we are simulating different models where Fisher’s test might not be as effective. Zhang et al.’s approach utilized Pearson’s correlation coefficients and involved detection of linear interactions only. Since gene interactions could show any kind of behavior, their evaluation approach might not work most of the time. Therefore, we are utilizing the dataset Zhang et al. provided in order to detect the three-way gene interaction using non-parametric tests like Kolmogorov-Smirnov and Cross-Match.
microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complet...
详细信息
microarray data suffer from missing values for various reasons, including insufficient resolution, image noise, and experimental errors. Because missing values can hinder downstream analysis steps that require complete data as input, it is crucial to be able to estimate the missing values. In this study, we propose a Global Learning with Local Preservation method (GL2P) for imputation of missing values in microarray data. GL2P consists of two components: a local similarity measurement module and a global weighted imputation module. The former uses a local structure preservation scheme to exploit as much information as possible from the observable data, and the latter is responsible for estimating the missing values of a target gene by considering all of its neighbors rather than a subset of them. Furthermore, GL2P imputes the missing values in ascending order according to the rate of missing data for each target gene to fully utilize previously estimated values. To validate the proposed method, we conducted extensive experiments on six benchmarked microarray datasets. We compared GL2P with eight state-of-the-art imputation methods in terms of four performance metrics. The experimental results indicate that GL2P outperforms its competitors in terms of imputation accuracy and better preserves the structure of differentially expressed genes. In addition, GL2P is less sensitive to the number of neighbors than other local learning-based imputation. methods. (C) 2016 Elsevier Ltd. All rights reserved.
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign d...
详细信息
Identification of relevant genes from microarray data is an apparent need in many applications. For such identification different ranking techniques with different evaluation criterion are used, which usually assign different ranks to the same gene. As a result, different techniques identify different gene subsets, which may not be the set of significant genes. To overcome such problems, in this study pipelining the ranking techniques is suggested. In each stage of pipeline, few of the lower ranked features are eliminated and at the end a relatively good subset of feature is preserved. However, the order in which the ranking techniques are used in the pipeline is important to ensure that the significant genes are preserved in the final subset. For this experimental study, twenty four unique pipeline models are generated out of four gene ranking strategies. These pipelines are tested with seven different microarray databases to find the suitable pipeline for such task. Further the gene subset obtained is tested with four classifiers and four performance metrics are evaluated. No single pipeline dominates other pipelines in performance;therefore a grading system is applied to the results of these pipelines to find out a consistent model. The finding of grading system that a pipeline model is significant is also established by Nemenyi post-hoc hypothetical test. Performance of this pipeline model is compared with four ranking techniques, though its performance is not superior always but majority of time it yields better results and can be suggested as a consistent model. However it requires more computational time in comparison to single ranking techniques. (C) 2016 Elsevier B.V. All rights reserved.
This paper proposes an approach for gene selection in microarray data. The proposed approach consists of a primary filter approach using Fisher criterion which reduces the initial genes and hence the search space and ...
详细信息
This paper proposes an approach for gene selection in microarray data. The proposed approach consists of a primary filter approach using Fisher criterion which reduces the initial genes and hence the search space and time complexity. Then, a wrapper approach which is based on cellular learning automata (CLA) optimized with ant colony method (ACO) is used to find the set of features which improve the classification accuracy. CLA is applied due to its capability to learn and model complicated relationships. The selected features from the last phase are evaluated using ROC curve and the most effective while smallest feature subset is determined. The classifiers which are evaluated in the proposed framework are K-nearest neighbor;support vector machine and naive Bayes. The proposed approach is evaluated on 4 microarray datasets. The evaluations confirm that the proposed approach can find the smallest subset of genes while approaching the maximum accuracy. (C) 2016 Elsevier Inc. All rights reserved.
For classification problems based on microarray data, the data typically contains a large number of irrelevant and redundant features. In this paper, a new gene selection method is proposed to choose the best subset o...
详细信息
For classification problems based on microarray data, the data typically contains a large number of irrelevant and redundant features. In this paper, a new gene selection method is proposed to choose the best subset of features for microarray data with the irrelevant and redundant features removed. We formulate the selection problem as a L1-regularized optimization problem, based on a newly defined linear discriminant analysis criterion. Instead of calculating the mean of the samples, a kernel-based approach is used to estimate the class centroid to define both the between-class separability and the within-class compactness for the criterion. Theoretical analysis indicates that the global optimal solution of the L1-regularized criterion can be reached with a general condition, on which an efficient algorithm is derived to the feature selection problem in a linear time complexity with respect to the number of features and the number of samples. The experimental results on ten publicly available microarray datasets demonstrate that the proposed method performs effectively and competitively compared with state-of-the-art methods. (C) 2016 Elsevier Ltd. All rights reserved.
Machine learning is a burgeoning technology used for extractions of knowledge from an ocean of data. It has robust binding with optimization and artificial intelligence that delivers theory, methodologies and applicat...
详细信息
ISBN:
(纸本)9788132225386;9788132225379
Machine learning is a burgeoning technology used for extractions of knowledge from an ocean of data. It has robust binding with optimization and artificial intelligence that delivers theory, methodologies and application domain to the field of statistics and computer science. Machine learning tasks are broadly classified into two groups namely supervised learning and unsupervised learning. The analysis of the unsupervised data requires thorough computational activities using different clustering algorithms. microarray gene expression data are taken into consideration for cluster regulating genes from non-regulating genes. In our work optimization technique (Cat Swarm Optimization) is used to minimize the number of cluster by evaluating the Euclidean distance among the centroids. A comparative study is being carried out by clustering the regulating genes before optimization and after optimization. In our work Principal component analysis (PCA) is incorporated for dimensionality reduction of vast dataset to ensure qualitative cluster analysis.
暂无评论