文献详情 >Improving Online Clustering of... 收藏

Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

作者：Zhang, Zhe Chen, Le Yin, Fengjing Zhang, Xin Guo, Lixiang

作者机构：Natl Univ Def Technol Coll Syst Engn Sci & Technol Informat Syst Engn Lab Changsha 410073 Peoples R China Acad Mil Sci Inst Syst Engn Beijing 100091 Peoples R China

出版物：《IEEE ACCESS》 (IEEE Access)

年卷期：2020年第8卷

页面：94245-94257页

核心收录：

主　　题：Computational modeling Semantics Vocabulary Clustering algorithms Facebook Licenses Artificial intelligence Information retrieval single-pass algorithm document representation word embeddings agglomerative clustering

摘要：In the Internet era, online clustering of technology web news can help discover scientific breakthroughs and grasp technology trends. To do that automatically, the news documents to be clustered must be represented appropriately with numerical vectors. However, traditional representations such as Term Frequency-Inverse Document Frequency (TF-IDF) cannot distinguish near-synonyms and may cause dimension disaster To overcome these problems, this article proposes the Bag-of-Near-Synonyms (BoNS) model based on the idea to construct near-synonym sets using word embeddings and agglomerative clustering, and then to represent a document with a Set Frequency-Inverse Document Frequency (SF-IDF) vector in which each dimension corresponds to a near-synonym set rather than a single word. To speed up computation, we further propose the hashed version of SF-IDF and name it hSF-IDF, which employs a hash function to map each near-synonym set to a unique number as the key and hence reduces the computation of SF to linear time. In addition, we apply hSF-IDF to online clustering of Chinese technology web news and propose an improved batch-based method. Extensive experiments have been conducted on a real-world dataset. The results show that our model outperforms some strong baselines including TF-IDF, average pooling of word or character embeddings, Latent Dirichlet Allocation (LDA), and bag-of-concepts in terms of both accuracy and efficiency.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

Improving Online Clustering of Chinese Technology Web News With Bag-of-Near-Synonyms

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：