Long document classification (LDC) has been a focused interest in natural language processing (NLP) recently with the exponential increase of publications. Based on the pretrained language models, many LDC methods hav...
详细信息
Long document classification (LDC) has been a focused interest in natural language processing (NLP) recently with the exponential increase of publications. Based on the pretrained language models, many LDC methods have been proposed and achieved considerable progression. However, most of the existing methods model long documents as sequences of text while omitting the document structure, thus limiting the capability of effectively representing long texts carrying structure information. To mitigate such limitation, we propose a novel hierarchical graph convolutional network (HGCN) for structured LDC in this article, in which a section graph network is proposed to model the macrostructure of a document and a word graph network with a decoupled graph convolutional block is designed to extract the fine-grained features of a document. In addition, an interaction strategy is proposed to integrate these two networks as a whole by propagating features between them. To verify the effectiveness of the proposed model, four structured long document datasets are constructed, and the extensive experiments conducted on these datasets and another unstructured dataset show that the proposed method outperforms the state-of-the-art related classification methods.
text classification is a key task of the Natural Language processing (NLP) field that aims at assigning predefined categories to textual documents. Performing text classification requires features that effectively rep...
详细信息
ISBN:
(纸本)9798350381993;9798350382006
text classification is a key task of the Natural Language processing (NLP) field that aims at assigning predefined categories to textual documents. Performing text classification requires features that effectively represent the content and the meaning of textual documents. Selecting a suitable method for term weighting is of central importance and can improve the quality of the classification method. In this paper, we propose to a new text classification solution to perform Category-based Feature Augmentation (CFA) on the document representation. First, a termcategory feature matrix is derived from a modified version of the supervised Term-Frequency Inverse-Category-Frequency (TF-ICF) weighting model. This is done by embedding the TF-ICF matrix in a one-layer feed-forward neural network. The latter is trained using the gradient descent algorithm allowing to iteratively update the term-category matrix until reaching convergence. The model produces category-based feature vector representations that are used to augment the document representations and perform the classification task. Experimental results on four benchmark datasets show that our lean model approach improves text classification accuracy and is significantly more efficient compared with its deep model alternatives.
text classification is a central task in Natural Language processing (NLP) that aims at categorizing textdocuments into predefined classes or categories. It requires appropriate features to describe the contents and ...
详细信息
text classification is a central task in Natural Language processing (NLP) that aims at categorizing textdocuments into predefined classes or categories. It requires appropriate features to describe the contents and meaning of textdocuments, and map them with their target categories. Existing text feature representations rely on a weighted representation of the document terms. Hence, choosing a suitable method for term weighting is of major importance and can help increase the effectiveness of the classification task. In this study, we provide a novel text classification framework for Category -based Feature Engineering titled CFE. It consists of a supervised weighting scheme defined based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model, embedded into three new lean classification approaches: (i) IterativeAdditive (flat), (ii) GradientDescentANN (1-layered), and (iii) FeedForwardANN (2-layered). The IterativeAdditive approach augments each document representation with a set of synthetic features inferred from TF-ICF category representations. It builds a term-category TF-ICF matrix using an iterative and additive algorithm that produces category vector representations and updates until reaching convergence. GradientDescentANN replaces the iterative additive process mentioned previously by computing the term-category matrix using a gradient descent ANN model. Training the ANN using the gradient descent algorithm allows updating the term-category matrix until reaching convergence. FeedForwardANN uses a feed-forward ANN model to transform document representations into the category vector space. The transformed document vectors are then compared with the target category vectors, and are associated with the most similar categories. We have implemented CFE including its three classification approaches, and we have conducted a large battery of tests to evaluate their performance. Experimental results on five benchmark datasets show that our le
Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data co...
详细信息
Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.
The millions of tweets submitted daily overwhelm users who find it difficult to identify content of interest revealing the need for event detection algorithms in Twitter. Such algorithms are proposed in this paper cov...
详细信息
The millions of tweets submitted daily overwhelm users who find it difficult to identify content of interest revealing the need for event detection algorithms in Twitter. Such algorithms are proposed in this paper covering both short (identifying what is currently happening) and long term periods (reviewing the most salient recently submitted events). For both scenarios, we propose fuzzy represented and timely evolved tweet-based theoretic information metrics to model Twitter dynamics. The Riemannian distance is also exploited with respect to words' signatures to minimize temporal effects due to submission delays. Events are detected through a multiassignment graph partitioning algorithm that: 1) optimally retains maximum coherence within a cluster and 2) while allowing a word to belong to several clusters (events). Experimental results on real-life data demonstrate that our approach outperforms other methods.
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading th...
详细信息
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
We present a tool that is specifically designed to support a writer in revising a draft version of a document. In addition to showing which paragraphs and sentences are difficult to read and understand, we assist the ...
详细信息
We present a tool that is specifically designed to support a writer in revising a draft version of a document. In addition to showing which paragraphs and sentences are difficult to read and understand, we assist the reader in understanding why this is the case. This requires features that are expressive predictors of readability, and are also semantically understandable. In the first part of the paper, we, therefore, discuss a semiautomatic feature selection approach that is used to choose appropriate measures from a collection of 141 candidate readability features. In the second part, we present the visual analysis tool VisRA, which allows the user to analyze the feature values across the text and within single sentences. Users can choose between different visual representations accounting for differences in the size of the documents and the availability of information about the physical and logical layout of the documents. We put special emphasis on providing as much transparency as possible to ensure that the user can purposefully improve the readability of a sentence. Several case studies are presented that show the wide range of applicability of our tool. Furthermore, an in-depth evaluation assesses the quality of the measure and investigates how well users do in revising a text with the help of the tool.
The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothf...
详细信息
ISBN:
(纸本)9780769547749;9781467322621
The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothfeder [1] to segment words in old handwritten documents. In their work the lines of the documents are extracted using projections. In this work, we have developed an approach to segment lines more efficiently. The new line segmentation algorithm tackles with skewed, touching and noisy lines, so it is significantly improves word segmentation. Experiments using Spanish documents from the Marriages Database of the Barcelona Cathedral show that this approach reduces the error rate by more than 20%.
In this paper, we present a distributed way to automatically map users' requirements to reference process models. In a prior paper [9], we presented a tool called Process Model Requirements Gap Analyzer (ProcGap),...
详细信息
ISBN:
(纸本)9780769544922
In this paper, we present a distributed way to automatically map users' requirements to reference process models. In a prior paper [9], we presented a tool called Process Model Requirements Gap Analyzer (ProcGap), which combines natural language processing, information retrieval, and semantic reasoning to automatically match and map textual requirements to domain-specific process models. Although the tool proved beneficial to users in reusing prior knowledge, by making it easy to use process models, the tool has one main drawback. It takes a long time to compare a very large requirements document, one that has a few thousand requirements, to a process model hierarchy with a few thousand capabilities. In this paper, we present how we solved this problem using Apache Hadoop. Apache Hadoop allows ProcGap to distribute matching task across several machines, increasing the tool's performance and usability. We present the performance comparison of running ProcGap on a single-machine, and our distributed version.
The Universal Business Language (UBL) is an initiative to develop common business document schemas for interoperability. However, businesses operate in different industry, geopolitical, and regulatory contexts and hav...
详细信息
The Universal Business Language (UBL) is an initiative to develop common business document schemas for interoperability. However, businesses operate in different industry, geopolitical, and regulatory contexts and have different rules and requirements for the information they exchange. So, several trading communities are tailoring UBL schemas to their needs, requiring that these schemas translate to each other. In this article, the authors describe how to enhance UBL with semantics-based translation mechanisms to maintain interoperability between documents conforming to different schema versions.
暂无评论