检索结果-内蒙古大学图书馆

An efficient Wikipedia semantic matching approach to text document classification

INFORMATION SCIENCES 2017年 393卷 15-28页

作者： Wu, Zongda Zhu, Hui Li, Guiling Cui, Zongmin Huang, Hui Li, Jun Chen, Enhong Xu, Guandong Wenzhou Univ Oujiang Coll Wenzhou Zhejiang Peoples R China Wenzhou Vocat Coll Sci & Technol Wenzhou Zhejiang Peoples R China China Univ Geosci Sch Comp Sci Wuhan Peoples R China Jiujiang Univ Sch Informat Sci & Technol Jiujiang Jiangxi Peoples R China Wenzhou Univ Coll Phys & Elect Informat Engn Wenzhou Zhejiang Peoples R China Univ Sci & Technol China Sch Comp Sci & Technol Hefei Anhui Peoples R China Univ Technol Sydney Fac Engn & IT Sydney NSW Australia

A traditional classification approach based on keyword matching represents each text document as a set of keywords, without considering the semantic information, thereby, reducing the accuracy of classification. To solve this problem, a new classification approach based on Wikipedia matching was proposed, which represents each document as a concept vector in the Wikipedia semantic space so as to understand the text semantics, and has been demonstrated to improve the accuracy of classification. However, the immense Wildpedia semantic space greatly reduces the generation efficiency of a concept vector, resulting in a negative impact on the availability of the approach in an online environment. In this paper, we propose an efficient Wikipedia semantic matching approach to document classification. First, we define several heuristic selection rules to quickly pick out related concepts for a document from the Wikipedia semantic space, making it no longer necessary to match all the concepts in the semantic space, thus greatly improving the generation efficiency of the concept vector. Second, based on the semantic representation of each text document, we compute the similarity between documents so as to accurately classify the documents. Finally, evaluation experiments demonstrate the effectiveness of our approach, i.e., which can improve the classification efficiency of the Wikipedia matching under the precondition of not compromising the classification accuracy. (C) 2017 Elsevier Inc. All rights reserved.

关键词： Wikipedia matching Keyword matching document classification Semantics

来源：评论

学校读者我要写书评

暂无评论

Long document classification From Local Word Glimpses via Recurrent Attention Learning

引用

IEEE ACCESS 2019年 7卷 40707-40718页

作者： He, Jun Wang, Liqun Liu, Liu Feng, Jiao Wu, Hao Nanjing Univ Informat Sci & Technol Sch Elect & Informat Engn Nanjing 210044 Jiangsu Peoples R China Yunnan Univ Sch Informat Sci & Engn Kunming 650091 Yunnan Peoples R China

document classification requires to extract high-level features from low-level word vectors. Typically, feature extraction by deep neural networks makes use of all words in a document, which cannot scale well for a long document. In this paper, we propose to tackle the long document classification task by incorporating the recurrent attention learning framework, which can produce the discriminative features with significantly less words. Specifically, the core work is to train a recurrent neural network (RNN)-based controller, which can focus its attention on the discriminative parts. Then, the glimpsed feature is extracted by a typical short text level convolutional neural network (CNN) from the focused group of words. The controller locates its attention according to the context information, which consists of the coarse representation of the original document and the memorized glimpsed features. By glimpsing a few groups, the document can be classified by aggregating these glimpsed features and the coarse representation. For our collected 11-class 10 000-word arXiv paper data set, the proposed method outperforms two subsampled deep CNN baseline models by a large margin given much less observed words.

关键词： document classification deep learning reinforcement learning recurrent attention learning

来源：评论

学校读者我要写书评

暂无评论

Genetic Programming for document classification: A Transductive Transfer Learning System

引用

IEEE TRANSACTIONS ON CYBERNETICS 2024年第2期54卷 1119-1132页

作者： Fu, Wenlong Xue, Bing Gao, Xiaoying Zhang, Mengjie Victoria Univ Wellington Sch Engn & Comp Sci Wellington 6140 New Zealand

document classification is a challenging task to the data being high-dimensional and sparse. Many transfer learning methods have been investigated for improving the classification performance by effectively transferring knowledge from a source domain to a target domain, which is similar to but different from the source domain. However, most of the existing methods cannot handle the case that the training data of the target domain does not have labels. In this study, we propose a transductive transfer learning system, utilizing solutions evolved by genetic programming (GP) on a source domain to automatically pseudolabel the training data in the target domain in order to train classifiers. Different from many other transfer learning techniques, the proposed system pseudolabels target-domain training data to retrains classifiers using all target-domain features. The proposed method is examined on nine transfer learning tasks, and the results show that the proposed transductive GP system has better prediction accuracy on the test data in the target domain than existing transfer learning approaches including subspace alignment-domain adaptation methods, feature-level-domain adaptation methods, and one latest pseudolabeling strategy-based method.

关键词： document classification genetic programming (GP) pseudolabel transductive transfer learning

来源：评论

学校读者我要写书评

暂无评论

Twin labeled LDA: a supervised topic model for document classification

引用

APPLIED INTELLIGENCE 2020年第12期50卷 4602-4615页

作者： Wang, Wei Guo, Bing Shen, Yan Yang, Han Chen, Yaosen Suo, Xinhua Sichuan Univ Coll Comp Sci Chengdu Peoples R China Chengdu Sobey Digital Technol Co Ltd Chengdu Peoples R China Chengdu Univ Informat Technol Sch Comp Sci Chengdu Peoples R China

Recently, some statistic topic modeling approaches, e.g., Latent Dirichlet allocation (LDA), have been widely applied in the field of document classification. However, standard LDA is a completely unsupervised algorithm, and then there is growing interest in incorporating prior information into the topic modeling procedure. Some effective approaches have been developed to model different kinds of prior information, for example, observed labels, hidden labels, the correlation among labels, label frequencies;however, these methods often need heavy computing because of model complexity. In this paper, we propose a new supervised topic model for document classification problems, Twin Labeled LDA (TL-LDA), which has two sets of parallel topic modeling processes, one incorporates the prior label information by hierarchical Dirichlet distributions, the other models the grouping tags, which have prior knowledge about the label correlation;the two processes are independent from each other, so the TL-LDA can be trained efficiently by multi-thread parallel computing. Quantitative experimental results compared with state-of-the-art approaches demonstrate our model gets the best scores on both rank-based and binary prediction metrics in solving single-label classification, and gets the best scores on three metrics, i.e., One Error, Micro-F1, and Macro-F1 while multi-label classification, including non power-law and power-law datasets. The results show benefit from modeling fully prior knowledge, our model has outstanding performance and generalizability on document classification. Further comparisons with recent works also indicate the proposed model is competitive with state-of-the-art approaches.

关键词： Supervised Topic modeling document classification Hierarchical Dirichlet distributions

来源：评论

学校读者我要写书评

暂无评论

An Embedding-Based Topic Model for document classification

引用

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2021年第3期20卷 1–13页

作者： Seifollahi, Sattar Piccardi, Massimo Jolfaei, Alireza RMIT Univ Sch Comp Technol 124 La Trobe St Melbourne Vic 3000 Australia Univ Technol Sydney Sch Elect & Data Engn 15 Broadway Ultimo Sydney NSW 2007 Australia Macquarie Univ Dept Comp 16 Macquarie Walk Sydney NSW 2109 Australia

Topic modeling is an unsupervised learning task that discovers the hidden topics in a collection of documents. In turn, the discovered topics can be used for summarizing, organizing, and understanding the documents in the collection. Most of the existing techniques for topic modeling are derivatives of the Latent Dirichlet Allocation which uses a bag-of-word assumption for the documents. However, bag-of-words models completely dismiss the relationships between the words. For this reason, this article presents a two-stage algorithm for topic modelling that leverages word embeddings and word co-occurrence. In the first stage, we determine the topic-word distributions by soft-clustering a random set of embedded n-grams from the documents. In the second stage, we determine the document-topic distributions by sampling the topics of each document from the topic-word distributions. This approach leverages the distributional properties of word embeddings instead of using the bag-of-words assumption. Experimental results on various data sets from an Australian compensation organization show the remarkable comparative effectiveness of the proposed algorithm in a task of document classification.

关键词： Topic modelling word embedding document classification clustering

来源：评论

学校读者我要写书评

暂无评论

Sparse multiple instance learning as document classification

引用

MULTIMEDIA TOOLS AND APPLICATIONS 2017年第3期76卷 4553-4570页

作者： Yan, Shengye Zhu, Xiaodong Liu, Guoqing Wu, Jianxin NUIST CICAEET Sch Informat & Control B DAT Niuliu Rd Nanjing 210044 Peoples R China Youjia Innovat LLC Minieye Shenzhen Peoples R China Nanjing Univ Natl Key Lab Novel Software Technol Nanjing Peoples R China

This work focuses on multiple instance learning (MIL) with sparse positive bags (which we name as sparse MIL). A structural representation is presented to encode both instances and bags. This representation leads to a non-i.i.d. MIL algorithm, miStruct, which uses a structural similarity to compare bags. Furthermore, MIL with this representation is shown to be equivalent to a document classification problem. document classification also suffers from the fact that only few paragraphs/words are useful in revealing the category of a document. By using the TF-IDF representation which has excellent empirical performance in document classification, the miDoc method is proposed. The proposed methods achieve significantly higher accuracies and AUC (area under the ROC curve) than the state-of-the-art in a large number of sparse MIL problems, and the document classification analogy explains their efficacy in sparse MIL problems.

关键词： Sparse multiple instance learning Low witness rate Structural representation document classification

来源：评论

学校读者我要写书评

暂无评论

Analyzing the impact of redaction on document classification performance of deep CNN models

引用

INTERNATIONAL JOURNAL ON document ANALYSIS AND RECOGNITION 2024年 1-13页

作者： Pagel, Johannes Vogl, Stefanie Israel, Laura S. F. Atruvia AG Karl Hammerschmidt Str 44 D-85609 Aschheim Germany Munich Univ Appl Sci Dept Comp Sci & Math Lothstr 34 D-80335 Munich Germany

Many companies are facing growing data archives leading to an increasing focus on the automated classification of documents in corporate processes. Due to data protection guidelines, development with clear data is often difficult. One way to overcome this difficulty is to desensitize documents using document redaction. The following study, therefore, examines the impact of redaction on the document classification performance of a deep CNN model by analyzing how the classification performance deteriorates when the model is trained on unredacted documents and evaluated on redacted data (unredacted model) or trained on redacted data and applied to unredacted documents (redacted model). For the former condition, a loss in accuracy of 2.56%P was found and a loss of 2.08%P for the latter. We were also able to show that the loss in performance differed greatly between document classes and was influenced by their proportion of redacted area (unredacted model: r=0.31;redacted model: r=0.87). For the model trained with redacted and evaluated on unredacted data, we also determined that the decrease in classification accuracy was affected by the intra-class variability of the redacted area (r=0.74). From these results, recommendations for dealing with redacted data in document classification systems are derived.

关键词： document classification Image classification Deep learning Convolutional neuronal networks Missing data Imputation document redaction

来源：评论

学校读者我要写书评

暂无评论

Discriminative learning of generative models: large margin multinomial mixture models for document classification

引用

PATTERN ANALYSIS AND APPLICATIONS 2015年第3期18卷 535-551页

作者： Jiang, Hui Pan, Zhenyu Hu, Pingzhao York Univ Dept Comp Sci & Engn Toronto ON M3J IP3 Canada

In this paper, a novel discriminative learning method is proposed to estimate generative models for multi-class pattern classification tasks, where a discriminative objective function is formulated with separation margins according to certain discriminative learning criterion, such as large margin estimation (LME). Furthermore, the so-called approximation-maximization (AM) method is proposed to optimize the discriminative objective function w.r.t. parameters of generative models. The AM approach provides a good framework to deal with latent variables in generative models and it is flexible enough to discriminatively learn many rather complicated generative models. In this paper, we are interested in a group of generative models derived from multinomial distributions. Under some minor relaxation conditions, it is shown that the AM-based discriminative learning methods for these generative models result in linear programming (LP) problems that can be solved effectively and efficiently even for rather large-scale models. As a case study, we have studied to learn multinomial mixture models (MMMs) for text document classification based on the large margin criterion. The proposed methods have been evaluated on a standard RCV1 text corpus. Experimental results show that large margin MMMs significantly outperform the conventional MMMs as well as pure discriminative models such as support vector machines (SVM), where over 25 % relative classification error reduction is observed in three independent RCV1 test sets.

关键词： Discriminative learning Large margin estimation (LME) Multinomial mixture model (MMM) Linear programming document classification Approximation-maximization (AM)

来源：评论

学校读者我要写书评

暂无评论

Exploiting the value of class labels on high-dimensional feature spaces: topic models for semi-supervised document classification

引用

PATTERN ANALYSIS AND APPLICATIONS 2019年第2期22卷 299-309页

作者： Soleimani, Hossein Miller, David J. Penn State Univ Sch Elect Engn & Comp Sci University Pk PA 16802 USA

We propose a class-based mixture of topic models for classifying documents using both labeled and unlabeled examples (i.e., in a semi-supervised fashion). Most topic models incorporate documents' class labels by generating them after generating the words. In these models, the training class labels have small effect on the estimated topics, as they are effectively treated as just another word, amongst a huge set of word features. In this paper, we propose to increase the influence of class labels on topic models by generating the words in each document conditioned on the class label. We show that our specific generative process improves classification performance with small loss in test set log-likelihood. Within our framework, we provide a principled mechanism to control the contributions of the class labels and the word space to the likelihood function. Experiments show our approach achieves better classification accuracy compared to some standard semi-supervised and supervised topic models.

关键词： Semi-supervised learning Topic mdels document classification

来源：评论

学校读者我要写书评

暂无评论

Learning with rationales for document classification

引用

MACHINE LEARNING 2018年第5期107卷 797-824页

作者： Sharma, Manali Bilgic, Mustafa IIT 10 W 31st St Chicago IL 60616 USA

We present a simple and yet effective approach for document classification to incorporate rationales elicited from annotators into the training of any off-the-shelf classifier. We empirically show on several document classification datasets that our classifier-agnostic approach, which makes no assumptions about the underlying classifier, can effectively incorporate rationales into the training of multinomial na < ve Bayes, logistic regression, and support vector machines. In addition to being classifier-agnostic, we show that our method has comparable performance to previous classifier-specific approaches developed for incorporating rationales and feature annotations. Additionally, we propose and evaluate an active learning method tailored specifically for the learning with rationales framework.

关键词： document classification Learning with rationales Active learning

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：