Like traditional K-means, the main drawback of spherical K-means is its high sensitivity to the initialization of centroids. This issue can cause the algorithm to converge to poor local optima, resulting in clusters t...
详细信息
Like traditional K-means, the main drawback of spherical K-means is its high sensitivity to the initialization of centroids. This issue can cause the algorithm to converge to poor local optima, resulting in clusters that do not accurately reflect the true structure of the data. In this paper, we propose two new text clustering algorithms that are less sensitive to initialization and that significantly improve clustering performance. The first algorithm employs simulated annealing to avoid getting trapped in poor local optima. The second algorithm, a relaxed version of simulated annealing, also uses randomization to escape poor local optima but requires significantly fewer computations than the first algorithm. The two algorithms are extensively evaluated across more than thirty text datasets. Experimental results demonstrate that the proposed approaches significantly outperform well-established text clustering algorithms in terms of clustering quality. Furthermore, the second algorithm is as efficient as standard spherical K-means regarding clustering speed, as both have the same time complexity. Finally, an important advantage of the proposed algorithms is that they can be applied to other domains involving directional data, such as recommender systems, social network analysis, image analysis, and more.
text clustering is a cornerstone task in natural language processing with a broad spectrum of applications. Given the advancements in large language models, employing such models to enhance general text clustering has...
详细信息
text clustering is a cornerstone task in natural language processing with a broad spectrum of applications. Given the advancements in large language models, employing such models to enhance general text clustering has shown promising potential in boosting clustering effectiveness. However, current LLMs-driven approaches often act as black boxes in analyzing the processes of text clustering, leading to poor interpretability. Additionally, these approaches are associated with significant API usage costs and lack effective techniques to explore cluster details. To align these challenges, we propose an LLMs-powered visual analytics approach, called textLens, to enhance text clustering. First, we present an LLMs-powered framework that integrated LLMs for guiding topic extraction, anomaly filtering, and modification assessment. Second, we introduce a visual analytics system designed to support proposed framework, which facilitates interactive exploration of clusters, analysis of cluster-level thematic extraction, and iterative refinement of clustering results. Finally, we conduct evaluations by applying two datasets into four case studies and a user study to compare clustering outcomes with previous methods, demonstrating the effectiveness and scalability of our approach.
The exponential growth of unstructured text data generated by internet users has created an urgent need for efficient organization methods to uncover valuable insights. text clustering, a widely used data mining appro...
详细信息
The exponential growth of unstructured text data generated by internet users has created an urgent need for efficient organization methods to uncover valuable insights. text clustering, a widely used data mining approach, often relies on single-objective optimization, which can struggle to deliver optimal results for datasets with diverse clustering criteria. To address these challenges, we propose the Multi-objective Firefly Differential Jaya (MFDJ) algorithm, a novel nature-inspired optimization method designed to enhance text clustering. MFDJ integrates the strengths of NSGA-II, a well-established multi-objective optimization framework, with three complementary algorithms: the Firefly algorithm for swarm intelligence-based optimization, Differential Evolution for robust exploration through mutation, and the Jaya algorithm for parameter-free improvement leveraging both the best and worst solutions. This synergy significantly enhances the algorithm's ability to balance exploration and exploitation, yielding superior clustering performance. We evaluated MFDJ on eight benchmark text datasets, where it demonstrated consistent superiority over state-of-the-art methods, including NSGA-II and MOMDE. On average, MFDJ achieved a 67.89% improvement in F-measure over NSGA-II and a 5.87% improvement over MOMDE, while also exhibiting better convergence properties for the majority of datasets. These results underscore the capability of MFDJ to generate high-quality clusters, making it a versatile tool for tackling complex text clustering and broader optimization challenges.
Feature, selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervi...
详细信息
Feature, selection is an important method for improving the efficiency and accuracy of text categorization algorithms by removing redundant and irrelevant terms from the corpus. In this paper, we propose a new supervised feature selection method, named CHIR, which is based on the chi(2) statistic and new statistical data that can measure the positive term-category dependency. We also propose a new text clustering algorithm, named text clustering with Feature Selection (TCFS). TCFS can incorporate CHIR to identify relevant features (i.e., terms) iteratively, and the clustering becomes 6 learning process. We compared TCFS and the K-means clustering algorithm in combination with different feature selection methods for various real data sets. Our experimental results show that TCFS with CHIR has better clustering accuracy in terms of the F-measure and the purity.
Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text cluster...
详细信息
Representation of documents is the basis of clustering systems. In addition, non-contiguous phrases appear more and more frequent in the text in the Web 2.0 age, and these phrases can affect the result of text clustering. In order to improve the quality of text clustering, this paper proposed a feature cluster-based vector space model (FC-VSM) which used the text feature clusters co-occurrence matrix to represent document and proposed to identify non-contiguous phrases in the text preprocessing stage. Our method can reduce dimension of features compared with the traditional VSM-based model. It identified non-contiguous phrases, used distributed representation of features, and implements feature clusters. Despite their simplicity, our methods are surprisingly effective and can improve the accuracy of clustering significantly which is shown in experimental results.
Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions...
详细信息
Based on an effective clustering algorithm-Affinity Propagation (AP)-we present in this paper a novel semisupervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semisupervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions. Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semisupervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.
text theme is the key of text clustering, while the co-occurrence words can be very stronger to express text theme in document. This paper proposes a text clustering algorithm based on the text semantic representation...
详细信息
ISBN:
(纸本)9781509015856
text theme is the key of text clustering, while the co-occurrence words can be very stronger to express text theme in document. This paper proposes a text clustering algorithm based on the text semantic representation and the graph structure of word co-occurrence on the basis of in-depth studying text theme mining and word co-occurrence. First, the algorithm constructs the text graph-structure according to the co-occurrence of feature words. In other words, it uses the graph structure to represent all texts. Then, it adopts the maximum common sub-graph between two texts to calculate their similarity and combines with K-means clustering algorithm to realize the document clustering. The compared experimental results with hierarchical clustering algorithm show the K-means clustering algorithm based on the graph structures of word co-occurrence greatly reduce the high dimension of text vector and the algorithm complexity, significantly improves the efficiency and accuracy of text clustering, and it can also produce the clustering effect of good quality.
One key step in text mining is the categorization of texts, i. e., to put texts of the same or similar contents into one group so as to distinguish texts of different contents. However, traditional word-frequency-base...
详细信息
ISBN:
(纸本)9783038351153
One key step in text mining is the categorization of texts, i. e., to put texts of the same or similar contents into one group so as to distinguish texts of different contents. However, traditional word-frequency-based statistical approaches, such as VSM model, failed to reflect the complicated meaning in texts. This paper ushers in domain ontology and constructs new conceptual vector space model in the pre-processing stage of text clustering, substituting the initial matrix (lexicon-text matrix) in the latent semantic analysis with concept-text matrix. In the clustering analysis stage, this model adopts semantic similarity, partially overcoming the difficulty in accurately and effectively evaluating the degree of similarity of text due to simply taking into account the frequency of words and/or phrases in the text. Experimental results indicate that this method is helpful in improving the result of text clustering.
text clustering is a critical step in text data analysis and has been extensively studied by the text mining community. Most existing text clustering algorithms are based on the bag-of-words model, which faces the hig...
详细信息
text clustering is a critical step in text data analysis and has been extensively studied by the text mining community. Most existing text clustering algorithms are based on the bag-of-words model, which faces the high-dimensional and sparsity problems and ignores text structural and sequence information. Deep learning-based models such as convolutional neural networks and recurrent neural networks regard texts as sequences but lack supervised signals and explainable results. In this paper, we propose a deep feature-based text clustering (DFTC) framework that incorporates pretrained text encoders into text clustering tasks. This model, which is based on sequence representations, breaks the dependency on supervision. The experimental results show that our model outperforms classic text clustering algorithms and the state-of-the-art pretrained language model, i.e., BERT, on almost all the considered datasets. In addition, the explanation of the clustering results is significant for understanding the principles of the deep learning approach. Our proposed clustering framework includes an explanation module that can help users understand the meaning and quality of the clustering results.
This article suggests a method of text clustering that does not depend on any user-set parameters. text documents and connections between them are represented as graph nodes and edges and graph community detection met...
详细信息
This article suggests a method of text clustering that does not depend on any user-set parameters. text documents and connections between them are represented as graph nodes and edges and graph community detection method is thus applied to the text clustering problem. The method was tested against news articles collections and proved effective manual and automatic clustering of text documents in collections were same or really close. (C) 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://***/licenses/by-nc-nd/3.0/)Peer-review under responsibility of the scientific committee of the 8th Annual International Conference on Biologically Inspired Cognitive Architectures
暂无评论