Background: The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most a...
详细信息
Background: The most common method of identifying groups of functionally related genes in microarray data is to apply a clustering algorithm. However, it is impossible to determine which clustering algorithm is most appropriate to apply, and it is difficult to verify the results of any algorithm due to the lack of a gold-standard. Appropriate data visualization tools can aid this analysis process, but existing visualization methods do not specifically address this issue. Results: We present several visualization techniques that incorporate meaningful statistics that are noise-robust for the purpose of analyzing the results of clustering algorithms on microarray data. This includes a rank-based visualization method that is more robust to noise, a difference display method to aid assessments of cluster quality and detection of outliers, and a projection of high dimensional data into a three dimensional space in order to examine relationships between clusters. Our methods are interactive and are dynamically linked together for comprehensive analysis. Further, our approach applies to both protein and gene expression microarrays, and our architecture is scalable for use on both desktop/laptop screens and large-scale display devices. This methodology is implemented in GeneVAnD (Genomic Visual ANalysis of Datasets) and is available at http://***/GeneVAnD. Conclusion: Incorporating relevant statistical information into data visualizations is key for analysis of large biological datasets, particularly because of high levels of noise and the lack of a gold-standard for comparisons. We developed several new visualization techniques and demonstrated their effectiveness for evaluating cluster quality and relationships between clusters.
Motivation: In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been not...
详细信息
Motivation: In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets. Methods: We considered four commonly used iterative partitioning algorithms ( Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. cluster stability was quantified using Cramer's v(2) from a kXk table. Cramer's v(2) is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility. Results: All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: ( 1) microarray datasets lack natural clustering structure thereby
Background: clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite ...
详细信息
Background: clustering is a key step in the analysis of gene expression data, and in fact, many classical clustering algorithms are used, or more innovative ones have been designed and validated for the task. Despite the widespread use of artificial intelligence techniques in bioinformatics and, more generally, data analysis, there are very few clustering algorithms based on the genetic paradigm, yet that paradigm has great potential in finding good heuristic solutions to a difficult optimization problem such as clustering. Results: GenClust is a new genetic algorithm for clustering gene expression data. It has two key features: (a) a novel coding of the search space that is simple, compact and easy to update;(b) it can be used naturally in conjunction with data driven internal validation methods. We have experimented with the FOM methodology, specifically conceived for validating clusters of gene expression data. The validity of GenClust has been assessed experimentally on real data sets, both with the use of validation measures and in comparison with other algorithms, i.e., Average Link, Cast, Click and K-means. Conclusion: Experiments show that none of the algorithms we have used is markedly superior to the others across data sets and validation measures;i.e., in many cases the observed differences between the worst and best performing algorithm may be statistically insignificant and they could be considered equivalent. However, there are cases in which an algorithm may be better than others and therefore worthwhile. In particular, experiments for GenClust show that, although simple in its data representation, it converges very rapidly to a local optimum and that its ability to identify meaningful clusters is comparable, and sometimes superior, to that of more sophisticated algorithms. In addition, it is well suited for use in conjunction with data driven internal validation measures and, in particular, the FOM methodology.
Background: Understanding the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. Reconstructing phylogenies for large data sets is st...
详细信息
Background: Understanding the evolutionary relationships among species based on their genetic information is one of the primary objectives in phylogenetic analysis. Reconstructing phylogenies for large data sets is still a challenging task in Bioinformatics. Results: We propose a new distance-based clustering method, the shortest triplet clustering algorithm (STC), to reconstruct phylogenies. The main idea is the introduction of a natural definition of so-called k- representative sets. Based on k-representative sets, shortest triplets are reconstructed and serve as building blocks for the STC algorithm to agglomerate sequences for tree reconstruction in O(n2) time for n sequences. Simulations show that STC gives better topological accuracy than other tested methods that also build a first starting tree. STC appears as a very good method to start the tree reconstruction. However, all tested methods give similar results if balanced nearest neighbor interchange (BNNI) is applied as a post-processing step. BNNI leads to an improvement in all instances. The program is available at http://***/software/stc/. Conclusion: The results demonstrate that the new approach efficiently reconstructs phylogenies for large data sets. We found that BNNI boosts the topological accuracy of all methods including STC, therefore, one should use BNNI as a post-processing step to get better topological accuracy.
We present a detailed numerical study of effective interactions between micrometer-sized silica spheres, induced by highly charged zirconia nanoparticles. It is demonstrated that the effective interactions are consist...
详细信息
We present a detailed numerical study of effective interactions between micrometer-sized silica spheres, induced by highly charged zirconia nanoparticles. It is demonstrated that the effective interactions are consistent with a recently discovered mechanism for colloidal stabilization. In accordance with the experimental observations, small nanoparticle concentrations induce an effective repulsion that counteracts the intrinsic van der Waals attraction between the colloids and thus stabilizes the suspension. At higher nanoparticle concentrations an attractive potential is recovered, resulting in reentrant gelation. Monte Carlo simulations of this highly size-asymmetric mixture are made possible by means of a geometric cluster Monte Carlo algorithm. A comparison is made to results obtained from the Ornstein-Zernike equations with the hypernetted-chain closure.
We present a cluster algorithm for resistively shunted Josephson junctions or similar physical systems, which dramatically improves sampling efficiency, and apply it to the superconductor-to-metal transition in a sing...
详细信息
We present a cluster algorithm for resistively shunted Josephson junctions or similar physical systems, which dramatically improves sampling efficiency, and apply it to the superconductor-to-metal transition in a single junction. Measuring the temperature dependence of the zero bias resistance, we confirm that the critical point does not depend on the strength of the Josephson coupling. However, we find that the correlation exponents vary continuously along the phase boundary, indicating that the Schmid-Bulgadaev transition is a line of fixed points.
In this paper, it is framed a model of RBF neural network (RBFNN) to solve identification of nonlinear systems. First, it is proposed a kind of optimal selection cluster algorithm. By this algorithm, it is optimally g...
详细信息
ISBN:
(纸本)0780372689
In this paper, it is framed a model of RBF neural network (RBFNN) to solve identification of nonlinear systems. First, it is proposed a kind of optimal selection cluster algorithm. By this algorithm, it is optimally gained the hidden layer node number of RBFNN in terms of input samples. At the same time, it is obtained the initial parameters values of RBF. Then, it is estimated the parameters value of RBF by gradient algorithm with momentum terms, and identified the weights of RBFNN by recursive least square algorithm. With the above two algorithms, it is alternately iterated. By the above hybrid algorithms, it is not only raised identification precision of RBFNN, but also improved generalization property of the net. It is proved the validity of the scheme by its applications.
Background: In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as K-means, hierarchical clustering,...
详细信息
Background: In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as K-means, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data. Results: In this paper, we propose a new clustering algorithm, Incremental Genetic K-means algorithm (IGKA). IGKA is an extension to our previously proposed clustering algorithm, the Fast Genetic K-means algorithm (FGKA). IGKA outperforms FGKA when the mutation probability is small. The main idea of IGKA is to calculate the objective value Total Within-cluster Variation (TWCV) and to cluster centroids incrementally whenever the mutation probability is small. IGKA inherits the salient feature of FGKA of always converging to the global optimum. C program is freely available at http://***/proj/FGKA/***. Conclusions: Our experiments indicate that, while the IGKA algorithm has a convergence pattern similar to FGKA, it has a better time performance when the mutation probability decreases to some point. Finally, we used IGKA to cluster a yeast dataset and found that it increased the enrichment of genes of similar function within the cluster.
We have performed a high-precision Monte Carlo study of the dynamic critical behavior of the Swendsen-Wang algorithm for the three-dimensional Ising model at the critical point. For the dynamic critical exponents asso...
详细信息
We have performed a high-precision Monte Carlo study of the dynamic critical behavior of the Swendsen-Wang algorithm for the three-dimensional Ising model at the critical point. For the dynamic critical exponents associated to the integrated autocorrelation times of the "energy-like" observables, we find z(int,N) = z(int,epsilon) = z(int,epsilon')= 0.459 +/- 0.005 +/- 0.025, where the first error bar represents statistical error (68% confidence interval) and the second error bar represents possible systematic error due to corrections to scaling (68% subjective confidence interval). For the "susceptibility-like" observables, we find z(int,M2) = z(int,S2) = 0.443 +/- 0.005 +/- 0.030. For the dynamic critical exponent associated to the exponential autocorrelation time, we find z(exp) approximate to 0.481. Our data are consistent with the Coddington-Baillie conjecture z(SW) = beta/v approximate to 0.5183, especially if it is interpreted as referring to z(exp). (C) 2004 Elsevier B.V. All rights reserved.
Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of cluste...
详细信息
Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of clustering methods should improve confidence in gene-expression analysis. Here we introduce consensus clustering, which provides such an advantage. When coupled with a statistically based gene functional analysis, our method allowed the identification of novel genes regulated by NFkappaB and the unfolded protein response in certain B-cell lymphomas.
暂无评论