检索结果-内蒙古大学图书馆

New Bagging Based Ensemble Learning Algorithm Distinguishing Short and Long Texts for document classification

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 2025年第4期24卷 1-26页

作者： Wang, Youwei Feng, Lizhou Cent Univ Finance & Econ Beijing Peoples R China Tianjin Univ Finance & Econ Tianjin Peoples R China

To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the performances of different typical deep learning methods on processing long and short texts are compared, and the optimal base classifiers for long and short texts are selected, respectively. Second, the random sampling method in traditional Bagging classification algorithms is improved, and a threshold group based random sampling method that can balance the numbers of long and short text subsets is proposed. Moreover, to improve the model inference speed and classification accuracy, the training of long and short text subsets is realized by combining the knowledge distillation theory. Finally, the sample classification probabilities on different categories are considered, and the category similarity information is combined with the traditional weighted voting classifier ensemble method to avoid the problem that the sampling process may decrease the accuracy. Experimental results on multiple datasets show that the algorithm can effectively improve the accuracy of document classification and has obvious advantages over typical deep learning algorithms and ensemble learning algorithms.

关键词： Ensemble learning weak classifier document classification deep learning random sampling

来源：评论

学校读者我要写书评

暂无评论

An application of textual document classification for Arabic governmental correspondence

KUWAIT JOURNAL OF SCIENCE

引用

KUWAIT JOURNAL OF SCIENCE 2025年第1期52卷

作者： Alzamel, Khaled Alajmi, Manayer Kuwait Univ Dept Comp Engn Shadadiya Kuwait

The automation of classifying Arabic documents is becominge increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer- based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries' names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.

关键词： BERT Contrastive learning document classification Transfer learning

来源：评论

学校读者我要写书评

暂无评论

Automatic PDF document classification with Machine Learning 25th

Automatic PDF Document Classification with Machine Learning

引用

25th International Conference on Intelligent Data Engineering and Automated Learning

作者： Llacer Luna, Socrates Garigliotti, Dario Martinez Plumed, Fernando Ferri Ramirez, Cesar Univ Politecn Valencia Valencia Spain Univ Bergen Bergen Norway

ISBN: (纸本)9783031777301;9783031777318

Universitat Polit`ecnica de Val`encia (UPV) faces challenges in managing its Alfresco document repository, which contains 600,000 PDF files, of which only 100,000 are correctly categorised. Manual classification is laborious and error-prone, hindering information retrieval and advanced search capabilities. This project presents an automated pipeline that integrates optical character recognition (OCR) and machine learning to efficiently classify documents. Our approach distinguishes between scanned and digital documents, accurately extracts text and categorises it into 51 predefined categories using models such as BERT and RF. By improving document organisation and accessibility, this work optimises UPV's document management and paves the way for advanced search technologies and real-time classification systems.

关键词： document classification OCR Machine Learning Alfresco Repository

来源：评论

学校读者我要写书评

暂无评论

A Novel Automatic Text document classification Using Learning based Text classification(LbTC) Approach

引用

Procedia Computer Science 2025年 258卷 4279-4290页

作者： Avinash N J Krishnaraj Rao Rama Moorthy H Raviprakash B Raghunadan K R Vasudeva Venkatadri M Department of Electronics and Communication Engineering Mangalore Institute of Technology and Engineering Moodabidre 574225 Karnataka India Department of Information Science and Engineering NMAM Institute of Technology Nitte (Deemed to be University) Karkala 574110 Karnataka India Department of Computer Applications Nitte Institute of Professional Education Nitte (Deemed to be University) Mangalore 575007 Karnataka India Department of Computer Science and Engineering NMAM Institute of Technology Nitte (Deemed to be University) Karkala 574110 Karnataka India Associate Dean MPSTME Shirpur Campus NMIMS (Deemed-to-be-University) Shirpur Maharashtra India

There are practical uses for text document classification. In actuality, it is crucial to categorize and parse documents with natural language. Utilizing applications like bogus news identification, query tagging, sentiment classification, and spam filtering needs this sort of study. However, because of their ambiguity, open-ended nature, and vastness, text documents provide a difficult classification challenge. Machine learning (ML) methods became valuable alongside the growth of artificial intelligence (AI) due to their data-driven learning techniques. These methods are seen as highly effective at handling and analyzing vast data sets in depth. Topic segmentation, text categorization, entity identification, machine translation, and text summarization, to name a few, are just a few of the issues that may be resolved with ML approaches. In this study, we introduced an automatic Text classification system (ATCF), a system that uses shallow and deep neural networks to classify text documents. To implement our system, we proposed an approach called Learning based Text classification (LbTC). We investigated the performance of several models in comparison. Based on the training provided to ML models, the suggested framework assists in categorizing any type of document. It is compatible with practical applications where the categorization of documents is essential.

关键词： Machine learning shallow learning deep learning text classification document classification

来源：评论

学校读者我要写书评

暂无评论

document classification for mining host pathogen protein-protein interactions

引用

ARTIFICIAL INTELLIGENCE IN MEDICINE 2010年第3期49卷 155-160页

作者： Yin, Lanlan Xu, Guixian Torii, Manabu Niu, Zhendong Maisog, Jose M. Wu, Cathy Hu, Zhangzhi Liu, Hongfang Georgetown Univ Dept Biostat Bioinformat & Biomath Washington DC USA Beijing Inst Technol Sch Comp Sci & Technol Beijing 100081 Peoples R China Minzu Univ China Sch Informat Engn Beijing Peoples R China Georgetown Univ Med Ctr Imaging Sci & Informat Syst Ctr Washington DC 20007 USA Med Numer Inc Germantown MD USA Georgetown Univ Med Ctr Dept Oncol Washington DC 20007 USA

Objective: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. Methods: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IC), chi(2) test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. Results: NDCG measures for classification systems using all features or a subset of features selected using IC and chi(2) test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. Conclusions: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system. (C) 2010 Elsevier B.V. All rights reserved.

关键词： document classification Host pathogen protein-protein interaction Feature selection Literature mining

来源：评论

学校读者我要写书评

暂无评论

document classification USING INFORMATION THEORY AND A FAST BACK-PROPAGATION NEURAL NETWORK

引用

INTELLIGENT AUTOMATION AND SOFT COMPUTING 2010年第1期16卷 25-37页

作者： Li, Howard Paull, Liam Biletskiy, Yevgen Yang, Simon X. Univ New Brunswick Dept Elect & Comp Engn Fredericton NB E3B 5A3 Canada Univ Guelph Sch Engn Guelph ON N1G 2W1 Canada

In this paper, a fast back-propagation neural network is developed to build document classifiers and the information gain method is used for feature selection. According to the rank of the information gain of all the words contained in the documents, those words that contain more information to classify the documents are selected as the input features of the artificial neural network (ANN) classifiers. The neural network developed assumes a three-layer structure with a fast back-propagation learning algorithm. Because of the information contained in the vectors selected, (fie learning efficiency of the developed ANN is very high. For the output of the ANN, Shannon entropy is used to tune the threshold of (lie binary classifiers. The classifiers are tested using the Reuters corpus. Two performance measures are used to evaluate the performance of the classifiers and generally the results of this study are better than those claimed in literature.

关键词： document classification information gain Shannon entropy artificial neural networks

来源：评论

学校读者我要写书评

暂无评论

document classification using convolutional neural networks with small window sizes and latent semantic analysis

引用

WEB INTELLIGENCE 2020年第3期18卷 239-248页

作者： Gultepe, Eren Kamkarhaghighi, Mehran Makrehchi, Masoud Southern Illinois Univ Edwardsville Dept Comp Sci Edwardsville IL 62026 USA Ontario Tech Univ Dept Elect Comp & Software Engn Oshawa ON L1G 0C5 Canada

A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leverage locally trained latent semantic analysis (LSA) word vectors. The architecture is based on parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words. To test the efficacy of the new CNN architecture, three balanced text datasets that are known to perform exceedingly well with linear classifiers were evaluated. Also, three additional imbalanced datasets were evaluated to gauge the robustness of the LSA vectors and small window sizes. The new CNN architecture consisting of 1 to 4-grams, coupled with LSA word vectors, exceeded the accuracy of all linear classifiers on balanced datasets with an average improvement of 0.73%. In four out of the total six datasets, the LSA word vectors provided a maximum classification performance on par with or better than word2vec vectors in CNNs. Furthermore, in four out of the six datasets, the new CNN architecture provided the highest classification performance. Thus, the new CNN architecture and LSA word vectors could be used as a baseline method for text classification tasks.

关键词： Convolutional neural networks document classification latent semantic analysis word embedding word vectors

来源：评论

学校读者我要写书评

暂无评论

document classification Algorithm Based on NPE and PSO

Document Classification Algorithm Based on NPE and PSO

引用

1st International Conference on E-Business and Information System Security

作者： Wang, Ziqiang Sun, Xia Henan Univ Technol Sch Informat Sci & Engn Zhengzhou 450000 Peoples R China

ISBN: (纸本)9781424429097

With many potential applications in document management and web searching, document classification has recently gained more attention. To efficiently resolve this problem, an efficient document classification algorithm based on neighborhood preserving embedding (NPE) and particle swarm optimization (PSO) is proposed in this paper. The document features are first extracted by the NPE algorithm, then the PSO classifier is used to classify the documents into semantically different classes. Experimental results show that the proposed algorithm achieves much better performance than other related classification algorithms.

关键词： document classification neighborhood preserving embedding(NPE) particle swarm optimization(PSO)

来源：评论

学校读者我要写书评

暂无评论

document classification Algorithm Based on IB and LS-SVM

Document Classification Algorithm Based on IB and LS-SVM

引用

3rd International Symposium on Intelligent Information Technology Application

作者： Wang, Ziqiang Sun, Xia Henan Univ Technol Sch Informat Sci & Engn Zhengzhou 450000 Peoples R China

ISBN: (纸本)9780769538594

document classification has received extensive attention in the past few decades due to its wide applications in many fields. To efficiently deal with this problem, a novel document classification algorithm based on information bottleneck (IB) and least square version of SVM(LS-SVM) is proposed in this paper. Extensive experimental results on the real-word document corpus show that the proposed algorithm achieves much better performance than SVM algorithm.

关键词： document classification feature selection information bottleneck (IB) least square SVM(LS-SVM)

来源：评论

学校读者我要写书评

暂无评论

document classification via Stable Graph Patterns and Conceptual AMR Graphs 1st

Document Classification via Stable Graph Patterns and Concep...

引用

1st International Joint Conference on Conceptual Knowledge Structures (CONCEPTS)

作者： Parakal, Eric George Dudyrev, Egor Kuznetsov, Sergei O. Napoli, Amedeo Natl Res Univ Higher Sch Econ Pokrovsky Blvd 11 Moscow 109028 Russia Univ Lorraine LORIA CNRS F-54000 Nancy France

ISBN: (纸本)9783031678677;9783031678684

This paper proposes an approach and an associated system based on pattern structures, aimed at the classification of documents represented as graphs. The representation of documents relies on Abstract Meaning Representation (AMR) document graphs. Given a set of AMR document graphs, the system learns characteristic graph patterns, that can be reused by an aggregate rule classifier to predict the class of a document. The selection of the most stable graph patterns is based on the gSOFIA algorithm and the Delta-stability measure. In the experiments, two document datasets are considered for validating the approach. The first includes documents belonging to 10 different newsgroups and the second contains sports news articles belonging to 5 topical areas. The results in terms of the macro-averaged F-1 scores, are quite satisfactory and show that the approach is well-founded and useful.

关键词： Pattern Structures document classification Natural Language Processing Abstract Meaning Representation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：