The text clustering based on Vector Space Model has problems, such as high-dimensional and sparse, unable to solve synonym and polyseme etc. And meanwhile, k-means clustering algorithm has shortcomings, which depends ...
详细信息
The text clustering based on Vector Space Model has problems, such as high-dimensional and sparse, unable to solve synonym and polyseme etc. And meanwhile, k-means clustering algorithm has shortcomings, which depends on the initial clustering center and needs to fix the number of clusters in advance. Aiming at these problems, in this paper, a text clusteringalgorithm based on Latent Semantic Analysis and Optimization is proposed. This algorithm can not only overcome the problems of Vector Space Model, but also can avoid the shortcomings of k-meansalgorithm. And compared with the text clusteringalgorithm based on Latent Semantic Analysis and the text clusteringalgorithm based on Vector Space Model and optimization, our algorithm is proved which can preferably improve the effect of text clustering, and upgrade the precision ratio and recall ration of text.
Various optimization methods are used along with the standard clusteringalgorithms to make the clustering process simpler and quicker. In this paper we propose a new hybrid technique of clusteringknown as k-Evolutio...
详细信息
ISBN:
(纸本)9781450300643
Various optimization methods are used along with the standard clusteringalgorithms to make the clustering process simpler and quicker. In this paper we propose a new hybrid technique of clusteringknown as k-Evolutionary Particle Swarm Optimization (kEPSO) based on the concept of Particle Swarm Optimization (PSO). The proposed algorithm uses the k-meansalgorithm as the first step and the Evolutionary Particle Swarm Optimization (EPSO) algorithm as the second step to perform clustering. The experiments were performed using the clustering benchmark data. This method was compared with the standard k-means and EPSO algorithms. The results show that this method produced compact results and performed faster than other clusteringalgorithms. Later, the algorithm was used to cluster web pages. The web pages were clustered by first cleaning the unnecessary data and then labeling the obtained web pages to categorize them.
Internet is becoming a spreading platform for the public opinion. It is important to grasp the internet public opinion (IPO) in time and understand the trends of their opinion correctly. Text mining plays a fundamenta...
详细信息
ISBN:
(纸本)9783642134975
Internet is becoming a spreading platform for the public opinion. It is important to grasp the internet public opinion (IPO) in time and understand the trends of their opinion correctly. Text mining plays a fundamental role in a number of information management and retrieval tasks. This paper studies internet public opinion hotspot detection using text mining approaches. First, we create an algorithm to obtain vector space model for all of text document. Second, this algorithm is combined with k-means clustering algorithm to develop unsupervised text mining approach. We use the proposed text mining approach to group the internet public opinion into various clusters, with the center of each representing a hotspot public opinion within the current time span. Through the result of the experiment, it shows that the efficiency and effectiveness of the algorithm using.
In coordinated multi-point transmission (CoMP) systems, the optimal remote radio unit (RRU) location is analyzed theoretically and a RRU location design scheme for energy efficiency in practical scenarios is given. An...
详细信息
ISBN:
(纸本)9781424435746
In coordinated multi-point transmission (CoMP) systems, the optimal remote radio unit (RRU) location is analyzed theoretically and a RRU location design scheme for energy efficiency in practical scenarios is given. An average minimum access distance criterion is given for RRU location optimization. By minimizing the average distance between users and RRU, the optimal RRU distribution can be obtained when users are located uniformly in the cell. Taking into account the fact that user distribution will not be completely uniform in a practical environment, the k-means clustering algorithm is used to get the optimized RRU deployment in a practical user distribution. Simulation results show that the uplink transmission power can be greatly reduced with the RRU optimized location design in both the uniform and non-uniform user distribution.
In this paper the ant colony optimization (ACO) is used in the k-meansalgorithm for improving the image segmentation. The learning mechanism of this algorithm is formulated by using the ACO meta-heuristic. As the phe...
详细信息
ISBN:
(纸本)9781450300643
In this paper the ant colony optimization (ACO) is used in the k-meansalgorithm for improving the image segmentation. The learning mechanism of this algorithm is formulated by using the ACO meta-heuristic. As the pheromone dominates the exploration of ants for problem solutions, preliminary experiments on pheromone's update are reported. Two methods for defining and updating pheromone values are proposed and tested: one with the spatial coordinate distances and the other without using such a distance. The ACO improves the k-meansalgorithm by making it less dependent on the initial parameters.
Most of the business decisions are based on cost and benefit considerations. Data mining techniques that make it possible for the businesses to incorporate financial considerations will be more meaningful to the decis...
详细信息
Most of the business decisions are based on cost and benefit considerations. Data mining techniques that make it possible for the businesses to incorporate financial considerations will be more meaningful to the decision makers. Decision theoretic framework has been helpful in providing a better understanding of classification models. This study describes a semi-supervised decision theoretic rough set model. The model is based on an extension of decision theoretic model proposed by Yao. The proposal is used to model financial cost/benefit scenarios for a promotional campaign in a real-world retail store.
作者:
Cao, FuyuanLiang, JiyeJiang, GuangShanxi Univ
Sch Comp & Informat Technol Taiyuan 030006 Shanxi Peoples R China Minist Educ
Key Lab Computat Intelligence & Chinese Informat Taiyuan 030006 Peoples R China Chinese Acad Sci
Key Lab Intelligent Informat Proc Inst Comp Technol Beijing 100190 Peoples R China
As a simple clustering method, the traditional k-meansalgorithm has been widely discussed and applied in pattern recognition and machine learning. However, the k-meansalgorithm could not guarantee unique clustering ...
详细信息
As a simple clustering method, the traditional k-meansalgorithm has been widely discussed and applied in pattern recognition and machine learning. However, the k-meansalgorithm could not guarantee unique clustering result because initial cluster centers are chosen randomly. In this paper, the cohesion degree of the neighborhood of an object and the coupling degree between neighborhoods of objects are defined based on the neighborhood-based rough set model. Furthermore, a new initialization method is proposed, and the corresponding time complexity is analyzed as well. We study the influence of the three norms on clustering, and compare the clustering results of the k-means with the three different initialization methods. The experimental results illustrate the effectiveness of the proposed method. (C) 2009 Elsevier Ltd. All rights reserved.
Surface water contamination from agricultural and urban runoff and wastewater discharges from industrial and municipal activities is of major concern to people worldwide. Classical models can be insufficient to visual...
详细信息
Surface water contamination from agricultural and urban runoff and wastewater discharges from industrial and municipal activities is of major concern to people worldwide. Classical models can be insufficient to visualise the results because the water quality variables used to describe dynamic pollution sources are complex, multivariable, and nonlinearly related. Artificial intelligence techniques with the ability to analyse multivariant water quality data by means of a sophisticated visualisation capacity can offer an alternative to current models. In this study, the kohonen self-organising feature maps (SOM) neural network was initially applied to analyse the complex nonlinear relationships among multivariable surface water quality variables using the component planes of the variables to determine the complex behaviour of water quality parameters. The dependencies between water quality variables were extracted and interpreted using the pattern analysis visualised in component planes. For further investigation, the k-means clustering algorithm was used to determine the optimal number of clusters by partitioning the maps and utilising the Davies-Bouldin clustering index, leading to seven groups or clusters corresponding to water quality variables. The results reveal that the concentrations of Na, k, Cl, NH4-N, NO2-N, o-PO4, component planes of organic matter (pV), and dissolved oxygen (DO) were significantly affected by seasonal changes, and that the SOM technique is an efficient tool with which to analyse and determine the complex behaviour of multidimensional surface water quality data. These results suggest that this technique could also be applied to other environmentally sensitive areas such as air and groundwater pollution.
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, m...
详细信息
Document clustering or unsupervised document classification is an automated process of grouping documents with similar content. A typical technique uses a similarity function to compare documents. In the literature, many similarity functions such as dot product or cosine measures are proposed for the comparison operator. For the thesis, we evaluate the effects a similarity function may have on clustering. We start by representing a document and a query, both as a vector of high-dimensional space corresponding to the keywords followed by using an appropriate distance measure in k-means to compute similarity between the document vector and the query vector to form clusters. Based on these clusters we decide the best distance metric for the document set used. Next, we compute time complexities for different similarity functions for the same model and document set based on the number of iterations and number of clusters.
Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse confo...
详细信息
Information about local protein sequence motifs is very important to the analysis of biologically significant conserved regions of protein sequences. These conserved regions can potentially determine the diverse conformation and activities of proteins. In this work, recurring sequence motifs of proteins are explored with an improved k-means clustering algorithm on a new dataset. The structural similarity of these recurring sequence clusters to produce sequence motifs is studied in order to evaluate the relationship between sequence motifs and their structures. To the best of our knowledge, the dataset used by our research is the most updated dataset among similar studies for sequence motifs. A new greedy initialization method for the k-meansalgorithm is proposed to improve traditional k-meansclustering techniques. The new initialization method tries to choose suitable initial points, which are well separated and have the potential to form high-quality clusters. Our experiments indicate that the improved k-meansalgorithm satisfactorily increases the percentage of sequence segments belonging to clusters with high structural similarity. Careful comparison of sequence motifs obtained by the improved and traditional algorithms also suggests that the improved k-means clustering algorithm may discover some relatively weak and subtle sequence motifs, which are undetectable by the traditional k-meansalgorithms. Many biochemical tests reported in the literature show that these sequence motifs are biologically meaningful. Experimental results also indicate that the improved k-meansalgorithm generates more detailed sequence motifs representing common structures than previous research. Furthermore, these motifs are universally conserved sequence patterns across protein families, overcoming some weak points of other popular sequence motifs. The satisfactory result of the experiment suggests that this new k-meansalgorithm may be applied to other areas of bioinformatics resea
暂无评论