咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Hierarchical Dirichlet Multino... 收藏

Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering

作     者:Huang, Ruizhang Xu, Weijia Qin, Yongbin Chen, Yanping 

作者机构:Guizhou Univ Coll Comp Sci & Technol Guiyang 550025 Peoples R China Guizhou Intelligent Human Comp Interact Engn Tech Guiyang 550025 Peoples R China Guizhou Univ Guizhou Prov Key Lab Publ Big Data Guiyang 550025 Peoples R China 

出 版 物:《IEEE ACCESS》 (IEEE Access)

年 卷 期:2020年第8卷

页      面:109917-109927页

核心收录:

基  金:National Natural Science Foundation of China [U1836205] Major Research Program of National Natural Science Foundation of China Major Special Science and Technology Projects of Guizhou Province [3002] Key Projects of Science and Technology of Guizhou Province [ 1Z055] 

主  题:Clustering algorithms Data models Resource management Partitioning algorithms Clustering methods Social networking (online) Licenses Document clustering multi-source document clustering Dirichlet distribution Gibbs sampling 

摘      要:Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分