In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the c...
详细信息
In this paper, the conventional k-modes-type algorithms for clustering categorical data are extended by representing the clusters of categorical data with k-populations instead of the hard-type centroids used in the conventional algorithms. Use of a population-based centroid representation makes it possible to preserve the uncertainty inherent in data sets as long as possible before actual decisions are made. The k-populations algorithm was found to give markedly better clustering results through various experiments. (c) 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.
Background: Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always revea...
详细信息
Background: Much progress has been made in understanding the 3D structure of proteins using methods such as NMR and X-ray crystallography. The resulting 3D structures are extremely informative, but do not always reveal which sites and residues within the structure are of special importance. Recently, there are indications that multiple-residue, sub-domain structural relationships within the larger 3D consensus structure of a protein can be inferred from the analysis of the multiple sequence alignment data of a protein family. These intra-dependent clusters of associated sites are used to indicate hierarchical inter-residue relationships within the 3D structure. To reveal the patterns of associations among individual amino acids or sub-domain components within the structure, we apply a k-modes attribute (aligned site) clustering algorithm to the ubiquitin and transthyretin families in order to discover associations among groups of sites within the multiple sequence alignment. We then observe what these associations imply within the 3D structure of these two protein families. Results: The k-modes site clustering algorithm we developed maximizes the intra-group interdependencies based on a normalized mutual information measure. The clusters formed correspond to sub-structural components or binding and interface locations. Applying this data-directed method to the ubiquitin and transthyretin protein family multiple sequence alignments as a test bed, we located numerous interesting associations of interdependent sites. These clusters were then arranged into cluster tree diagrams which revealed four structural sub-domains within the single domain structure of ubiquitin and a single large sub-domain within transthyretin associated with the interface among transthyretin monomers. In addition, several clusters of mutually interdependent sites were discovered for each protein family, each of which appear to play an important role in the molecular structure and/or function. Co
Clustering is the most significant unsupervised learning where the aim is to partition the data set into uniform groups called clusters. Many real-world data sets often contain categorical values, but many clustering ...
详细信息
ISBN:
(纸本)9789811055652;9789811055645
Clustering is the most significant unsupervised learning where the aim is to partition the data set into uniform groups called clusters. Many real-world data sets often contain categorical values, but many clustering algorithms work only on numeric values which limits its use in data mining The k-modes algorithm is one of the very effective for proper partitions of categorical data sets, though the algorithm stops at locally optimum solution as depended on initial cluster centres. Proposed algorithm utilizes the genetic algorithm (GA) to optimize the k-modes clustering algorithm. The reason is, considering noise as cluster centres gives the high cost which will not fit for the next iteration and also not gets stuck to the suboptimal solutions. The superiority of proposed algorithm is demonstrated for several real-life data sets in terms of accuracy and proves it is efficient and can reveal encouraging results especially for the large datasets.
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the fe...
详细信息
ISBN:
(纸本)9781424421138
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the feature of human being's DNA splice site adjacent sequences. Firstly, we propose a kind of DNA splice site sequences clustering method based on Genetic k-odes, secondly, we analyze the frequency of various bases, di-bases and tri-bases about the experimental data set and each cluster, lastly, we propose one kind of Markov model based frequent patterns discovery algorithm and use it to mine the frequent patterns of the experimental data set and each cluster.
In order to provide the necessary data for Public opinion monitoring and trend warning, this paper did some researches on text processing and clustering algorithms based on hot topics of the Weibo. Data that get from ...
详细信息
ISBN:
(纸本)9781728101200
In order to provide the necessary data for Public opinion monitoring and trend warning, this paper did some researches on text processing and clustering algorithms based on hot topics of the Weibo. Data that get from Weibo were classification data which contain two properties. To adapt this feature and meet the requirement of public opinion trends warning, hamming distance was used to do text similarity computing. By improving the traditional k-means algorithm, a new k-mode algorithm which is used to text clustering on hot topics was achieved. Simulation and results analysis indicated the text processing method was accurate and suitable to the microblog public opinion early warning.
The deep fusion of Internet technology and education is constantly pushing forward the reform of university education. Traditional educational ideas, concepts, and models cannot keep pace with the times, and hybrid te...
详细信息
The deep fusion of Internet technology and education is constantly pushing forward the reform of university education. Traditional educational ideas, concepts, and models cannot keep pace with the times, and hybrid teaching has become a new way of education in colleges and universities. To improve the teaching effect of physical education classes, the study used a blended teaching model and designed a teaching evaluation and performance prediction model under the blended teaching model based on an improved cluster analysis method and attention mechanism. The lab results indicated that under the blended teaching model, students' performance increased by 12.89 points, and the level of skill mastery and proficiency increased by 26.52 and 28.55%, respectively, with grades more inclined to high score distribution. "Excellent" grade clustering increased by 77.71%, and "Good" grade clustering increased by 19.01%. The minimum error sum of squares of the improved clustering algorithm was 58.18 and 36.25% lower than the other two algorithms, and the clustering results were more relevant. The two-way attention mechanism algorithm predicted higher accuracy results and performed best on all four evaluation metrics, with a prediction accuracy of 98.23%, an accuracy of 98.42%, and an F1 value of 91.78%. This hybrid teaching model is more in line with the characteristics of the physical education teaching discipline, successfully cultivates students' independent learning ability, stimulates students' love for physical education courses, and achieves better teaching results.
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the fe...
详细信息
DNA splice site adjacent sequences have remarkable conservative feature, and mining their underlying biological knowledge has become a key issue in the field of DNA sequences analysis. In this paper, we analyze the feature of human being's DNA splice site adjacent sequences. Firstly, we propose a kind of DNA splice site sequences clustering method based on Genetic k-modes;secondly, we analyze the frequency of various bases, di-bases and tri-bases about the experimental data set and each cluster;lastly, we propose one kind of Markov model based frequent patterns discovery algorithm and use it to mine the frequent patterns of the experimental data set and each cluster.
暂无评论