Background: Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always revea...
详细信息
Background: Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. Results: The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. Co
This correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4...
详细信息
This correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4], [12] which allows the use of the k- modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating formula of the k- modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework.
In clustering algorithms, choosing a subset of representative examples is very important in data set. Such "exemplars" can be found by randomly choosing an initial subset of data objects and then iteratively...
详细信息
In clustering algorithms, choosing a subset of representative examples is very important in data set. Such "exemplars" can be found by randomly choosing an initial subset of data objects and then iteratively refining it, but this works well only if that initial choice is close to a good solution. In this paper, based on the frequency of attribute values, the average density of an object is defined. Furthermore, a novel initialization method for categorical data is proposed, in which the distance between objects and the density of the object is considered. We also apply the proposed initialization method to k-modes algorithm and fuzzy k-modes algorithm. Experimental results illustrate that the proposed initialization method is superior to random initialization method and can be applied to large data sets for its linear time complexity with respect to the number of data objects. (C) 2009 Elsevier Ltd. All rights reserved.
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the fe...
详细信息
ISBN:
(纸本)9781424421138
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the feature of human being's DNA splice site adjacent sequences. Firstly, we propose a kind of DNA splice site sequences clustering method based on Genetic k-odes, secondly, we analyze the frequency of various bases, di-bases and tri-bases about the experimental data set and each cluster, lastly, we propose one kind of Markov model based frequent patterns discovery algorithm and use it to mine the frequent patterns of the experimental data set and each cluster.
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the fe...
详细信息
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the feature of human being's DNA splice site adjacent sequences. Firstly, we propose a kind of DNA splice site sequences clustering method based on Genetic k-modes;secondly, we analyze the frequency of various bases, di-bases and tri-bases about the experimental data set and each cluster;lastly, we propose one kind of Markov model based frequent patterns discovery algorithm and use it to mine the frequent patterns of the experimental data set and each cluster.
In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the c...
详细信息
In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments. (c) 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
In this paper the conventional fuzzy k-modes algorithm for clustering categorical data is extended by representing the clusters of categorical data with fuzzy centroids instead of the hard-type centroids used in the o...
详细信息
In this paper the conventional fuzzy k-modes algorithm for clustering categorical data is extended by representing the clusters of categorical data with fuzzy centroids instead of the hard-type centroids used in the original algorithm. Use of fuzzy centroids makes it possible to fully exploit the power of fuzzy sets in representing the uncertainty in the classification of categorical data. To test the proposed approach, the proposed algorithm and two conventional algorithms (the k-modes and fuzzy k-modes algorithms) were used to cluster three categorical data sets. The proposed method was found to give markedly better clustering results. (C) 2004 Elsevier B.V. All rights reserved.
暂无评论