To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the pe...
详细信息
To improve the classification accuracy of ensemble learning, a new bootstrap aggregating (Bagging) ensemble learning algorithm distinguishing short and long texts for document classification is proposed. First, the performances of different typical deep learning methods on processing long and short texts are compared, and the optimal base classifiers for long and short texts are selected, respectively. Second, the random sampling method in traditional Bagging classification algorithms is improved, and a threshold group based random sampling method that can balance the numbers of long and short text subsets is proposed. Moreover, to improve the model inference speed and classification accuracy, the training of long and short text subsets is realized by combining the knowledge distillation theory. Finally, the sample classification probabilities on different categories are considered, and the category similarity information is combined with the traditional weighted voting classifier ensemble method to avoid the problem that the sampling process may decrease the accuracy. Experimental results on multiple datasets show that the algorithm can effectively improve the accuracy of document classification and has obvious advantages over typical deep learning algorithms and ensemble learning algorithms.
The automation of classifying Arabic documents is becominge increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of ...
详细信息
The automation of classifying Arabic documents is becominge increasingly in demand, especially when dealing with an ever-growing amount of linguistic data. Natural language processing (NLP) has recently become one of the most significant fields in artificial intelligence (AI) thanks to recent advances in introducing transformer- based models. Transformers facilitate the use of reusable models by using pre-trained models (PTMs). This study aims to fine-tune monolingual (AraBERT (Antoun et al., 2020)), bilingual (GigaBERT (Lan et al., 2020)), and multilingual (XLM-RoBERTa (Conneau et al., 2020)) transformer-based encoder models to classify official Arabic correspondence in pre-defined classes and compare their predictive performance in terms of accuracy, using a new balanced dataset. The new balanced dataset has 22,741 Arabic texts and is categorized into six categories labeled with the most common ministries' names. The results in this study show that GigaBERT achieved the highest accuracy rate of 98%. The implemented models may contribute to the domain of information systems (ISs) to facilitate the classification process in ministries without human intervention.
Universitat Polit`ecnica de Val`encia (UPV) faces challenges in managing its Alfresco document repository, which contains 600,000 PDF files, of which only 100,000 are correctly categorised. Manual classification is la...
详细信息
ISBN:
(纸本)9783031777301;9783031777318
Universitat Polit`ecnica de Val`encia (UPV) faces challenges in managing its Alfresco document repository, which contains 600,000 PDF files, of which only 100,000 are correctly categorised. Manual classification is laborious and error-prone, hindering information retrieval and advanced search capabilities. This project presents an automated pipeline that integrates optical character recognition (OCR) and machine learning to efficiently classify documents. Our approach distinguishes between scanned and digital documents, accurately extracts text and categorises it into 51 predefined categories using models such as BERT and RF. By improving document organisation and accessibility, this work optimises UPV's document management and paves the way for advanced search technologies and real-time classification systems.
There are practical uses for text document classification. In actuality, it is crucial to categorize and parse documents with natural language. Utilizing applications like bogus news identification, query tagging, sen...
详细信息
There are practical uses for text document classification. In actuality, it is crucial to categorize and parse documents with natural language. Utilizing applications like bogus news identification, query tagging, sentiment classification, and spam filtering needs this sort of study. However, because of their ambiguity, open-ended nature, and vastness, text documents provide a difficult classification challenge. Machine learning (ML) methods became valuable alongside the growth of artificial intelligence (AI) due to their data-driven learning techniques. These methods are seen as highly effective at handling and analyzing vast data sets in depth. Topic segmentation, text categorization, entity identification, machine translation, and text summarization, to name a few, are just a few of the issues that may be resolved with ML approaches. In this study, we introduced an automatic Text classification system (ATCF), a system that uses shallow and deep neural networks to classify text documents. To implement our system, we proposed an approach called Learning based Text classification (LbTC). We investigated the performance of several models in comparison. Based on the training provided to ML models, the suggested framework assists in categorizing any type of document. It is compatible with practical applications where the categorization of documents is essential.
Objective: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-re...
详细信息
Objective: Scientific findings regarding human pathogens and their host responses are buried in the growing volume of biomedical literature and there is an urgent need to mine information pertaining to pathogenesis-related proteins especially host pathogen protein-protein interactions (HP-PPIs) from literature. Methods: In this paper, we report our exploration of developing an automated system to identify MEDLINE abstracts referring to HP-PPIs. An annotated corpus consisting of 1360 MEDLINE abstracts was generated. With this corpus, we developed and evaluated document classification systems using support vector machines (SVMs). We also investigated the effects of three feature selection methods:information gain (IC), chi(2) test, and specific mutual information (SI). The performance was measured using normalized discounted cumulative gain (NDCG) and positive predictive value (PPV) and all measures were obtained through 10-fold cross validation. Results: NDCG measures for classification systems using all features or a subset of features selected using IC and chi(2) test range from 0.83 to 0.89 while classification systems built based on features selected using SI had relatively lower NDCG measures. The classification system achieved a PPV of 50.7% for the top 10% ranked documents comparing to a baseline PPV of 10.0%. Conclusions: Our results indicate that document classification systems can be constructed to efficiently retrieve HP-PPI related documents. Feature selection was effective in reducing the dimensionality of features to build a compact system. (C) 2010 Elsevier B.V. All rights reserved.
In this paper, a fast back-propagation neural network is developed to build document classifiers and the information gain method is used for feature selection. According to the rank of the information gain of all the ...
详细信息
In this paper, a fast back-propagation neural network is developed to build document classifiers and the information gain method is used for feature selection. According to the rank of the information gain of all the words contained in the documents, those words that contain more information to classify the documents are selected as the input features of the artificial neural network (ANN) classifiers. The neural network developed assumes a three-layer structure with a fast back-propagation learning algorithm. Because of the information contained in the vectors selected, (fie learning efficiency of the developed ANN is very high. For the output of the ANN, Shannon entropy is used to tune the threshold of (lie binary classifiers. The classifiers are tested using the Reuters corpus. Two performance measures are used to evaluate the performance of the classifiers and generally the results of this study are better than those claimed in literature.
A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leve...
详细信息
A parsimonious convolutional neural network (CNN) for text document classification that replicates the ease of use and high classification performance of linear methods is presented. This new CNN architecture can leverage locally trained latent semantic analysis (LSA) word vectors. The architecture is based on parallel 1D convolutional layers with small window sizes, ranging from 1 to 5 words. To test the efficacy of the new CNN architecture, three balanced text datasets that are known to perform exceedingly well with linear classifiers were evaluated. Also, three additional imbalanced datasets were evaluated to gauge the robustness of the LSA vectors and small window sizes. The new CNN architecture consisting of 1 to 4-grams, coupled with LSA word vectors, exceeded the accuracy of all linear classifiers on balanced datasets with an average improvement of 0.73%. In four out of the total six datasets, the LSA word vectors provided a maximum classification performance on par with or better than word2vec vectors in CNNs. Furthermore, in four out of the six datasets, the new CNN architecture provided the highest classification performance. Thus, the new CNN architecture and LSA word vectors could be used as a baseline method for text classification tasks.
With many potential applications in document management and web searching, document classification has recently gained more attention. To efficiently resolve this problem, an efficient document classification algorith...
详细信息
ISBN:
(纸本)9781424429097
With many potential applications in document management and web searching, document classification has recently gained more attention. To efficiently resolve this problem, an efficient document classification algorithm based on neighborhood preserving embedding (NPE) and particle swarm optimization (PSO) is proposed in this paper. The document features are first extracted by the NPE algorithm, then the PSO classifier is used to classify the documents into semantically different classes. Experimental results show that the proposed algorithm achieves much better performance than other related classification algorithms.
document classification has received extensive attention in the past few decades due to its wide applications in many fields. To efficiently deal with this problem, a novel document classification algorithm based on i...
详细信息
ISBN:
(纸本)9780769538594
document classification has received extensive attention in the past few decades due to its wide applications in many fields. To efficiently deal with this problem, a novel document classification algorithm based on information bottleneck (IB) and least square version of SVM(LS-SVM) is proposed in this paper. Extensive experimental results on the real-word document corpus show that the proposed algorithm achieves much better performance than SVM algorithm.
This paper proposes an approach and an associated system based on pattern structures, aimed at the classification of documents represented as graphs. The representation of documents relies on Abstract Meaning Represen...
详细信息
ISBN:
(纸本)9783031678677;9783031678684
This paper proposes an approach and an associated system based on pattern structures, aimed at the classification of documents represented as graphs. The representation of documents relies on Abstract Meaning Representation (AMR) document graphs. Given a set of AMR document graphs, the system learns characteristic graph patterns, that can be reused by an aggregate rule classifier to predict the class of a document. The selection of the most stable graph patterns is based on the gSOFIA algorithm and the Delta-stability measure. In the experiments, two document datasets are considered for validating the approach. The first includes documents belonging to 10 different newsgroups and the second contains sports news articles belonging to 5 topical areas. The results in terms of the macro-averaged F-1 scores, are quite satisfactory and show that the approach is well-founded and useful.
暂无评论