检索结果-内蒙古大学图书馆

Mining fuzzy frequent itemsets for hierarchical document clustering

INFORMATION PROCESSING & MANAGEMENT 2010年第2期46卷 193-211页

作者： Chen, Chun-Ling Tseng, Frank S. C. Liang, Tyne Natl Chiao Tung Univ Dept Comp Sci Hsinchu 300 Taiwan Natl Kaohsiung First Univ Sci & Technol Dept Informat Management Yenchao 824 Kaoshiung Taiwan

As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Item-set-Based hierarchical clustering ((FIHC)-I-2) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Item-set-Based hierarchical clustering (FIHC) method, In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, ReO, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC. Crown Copyright (C) 2009 Published by Elsevier Ltd. All rights reserved.

关键词： Fuzzy association rule mining Text mining hierarchical document clustering Frequent itemsets

来源：评论

学校读者我要写书评

暂无评论

Fast Rank-2 Nonnegative Matrix Factorization for hierarchical document clustering 13

Fast Rank-2 Nonnegative Matrix Factorization for Hierarchica...

引用

19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)

作者： Kuang, Da Park, Haesun Georgia Inst Technol Sch Computat Sci & Engn Atlanta GA 30332 USA

ISBN: (纸本)9781450321747

Nonnegative matrix factorization (NMF) has been successfully used as a clustering method especially for flat partitioning of documents. In this paper, we propose an efficient hierarchical document clustering method based on a new algorithm for rank-2 NMF. When the two block coordinate descent framework of nonnegative least squares is applied to computing rank-2 NMF, each subproblem requires a solution for nonnegative least squares with only two columns in the matrix. We design the algorithm for rank-2 NMF by exploiting the fact that an exhaustive search for the optimal active set can be performed extremely fast when solving these NNLS problems. In addition, we design a measure based on the results of rank-2 NMF for determining which leaf node should be further split. On a number of text data sets, our proposed method produces high-quality tree structures in significantly less time compared to other methods such as hierarchical K-means, standard NMF, and latent Dirichlet allocation.

关键词： Active-set algorithm hierarchical document clustering non-negative matrix factorization rank-2 NMF

来源：评论

学校读者我要写书评

暂无评论

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

引用

JOURNAL OF CLOUD COMPUTING-ADVANCES SYSTEMS AND APPLICATIONS 2020年第1期9卷 2-2页

作者： Kotouza, Maria Th Psomopoulos, Fotis E. Mitkas, Pericles A. Aristotle Univ Thessaloniki Dept Elect & Comp Engn Thessaloniki 54124 Greece Ctr Res & Technol Hellas Inst Appl Biosci Thessaloniki 57001 Greece Karolinska Inst Dept Mol Med & Surg Stockholm Sweden

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users' queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters' connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

关键词： hierarchical document clustering Topic modeling Docker Performance testing

来源：评论

学校读者我要写书评

暂无评论

Modeling Network with Topic Model and Triangle Motif 12

Modeling Network with Topic Model and Triangle Motif

引用

12th IEEE Int Conf Ubiquitous Intelligence & Comp/12th IEEE Int Conf Autonom & Trusted Comp/15th IEEE Int Conf Scalable Comp & Commun & Associated Workshops/IEEE Int Conf Cloud & Big Data Comp/IEEE Int Conf Internet People

作者： Bian, Xuewen Zhang, Kun Nanjing Sci & Technol Univ Sch Sci & Engn Comp Nanjing Jiangsu Peoples R China

ISBN: (纸本)9781467372114

This paper describes a hierarchical model based on triangle motif and topic model, considering both network data and node attribute. The attribute of nodes we study here is text, so we choose document network as our research content. We represent the document network with triangle motif, which has good scalability on large amount of data. This representation makes the complexity of our approach grows linearly in the number of documents, and more relational with the max degree of the network. We extend hLDA by incorporating network data, remodeling the hLDA. Using non-parametric Bayesian model, our approach does not need pre-specification of the branch factor at each non-terminal. The model is suitable for large-scale network of academic abstract, web document and related news.

关键词： nCRP hierarchical document clustering document topic models Triangle motif

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：