检索结果-内蒙古大学图书馆

Hierarchical Graph Convolutional Networks for Structured Long document Classification

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2023年第10期34卷 8071-8085页

作者： Liu, Tengfei Hu, Yongli Wang, Boyue Sun, Yanfeng Gao, Junbin Yin, Baocai Beijing Univ Technol Fac Informat Technol Beijing Inst Artificial Intelligence Beijing Key Lab Multimedia & Intelligent Software Beijing 100124 Peoples R China Univ Sydney Business Sch Discipline Business Analyt Camperdown NSW 2006 Australia

Long document classification (LDC) has been a focused interest in natural language processing (NLP) recently with the exponential increase of publications. Based on the pretrained language models, many LDC methods have been proposed and achieved considerable progression. However, most of the existing methods model long documents as sequences of text while omitting the document structure, thus limiting the capability of effectively representing long texts carrying structure information. To mitigate such limitation, we propose a novel hierarchical graph convolutional network (HGCN) for structured LDC in this article, in which a section graph network is proposed to model the macrostructure of a document and a word graph network with a decoupled graph convolutional block is designed to extract the fine-grained features of a document. In addition, an interaction strategy is proposed to integrate these two networks as a whole by propagating features between them. To verify the effectiveness of the proposed model, four structured long document datasets are constructed, and the extensive experiments conducted on these datasets and another unstructured dataset show that the proposed method outperforms the state-of-the-art related classification methods.

关键词： Transformers Context modeling Computational modeling Complexity theory Task analysis Analytical models Feature extraction Decoupled graph convolution document and text processing graph pooling hierarchical graph convolutional networks (HGCNs) long document classification (LDC)

来源：评论

学校读者我要写书评

暂无评论

Fast text Classification using Lean Gradient Descent Feed Forward Neural Network for Category Feature Augmentation 22

Fast Text Classification using Lean Gradient Descent Feed Fo...

引用

IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) / BigDataSE Conference / CSE Conference / EUC Conference / ISCI Conference

作者： Attieh, Joseph Tekli, Joe Lebanese Amer Univ ECE Dept Sch Engn 36 Byblos Beirut Lebanon

ISBN: (纸本)9798350381993;9798350382006

text classification is a key task of the Natural Language processing (NLP) field that aims at assigning predefined categories to textual documents. Performing text classification requires features that effectively represent the content and the meaning of textual documents. Selecting a suitable method for term weighting is of central importance and can improve the quality of the classification method. In this paper, we propose to a new text classification solution to perform Category-based Feature Augmentation (CFA) on the document representation. First, a termcategory feature matrix is derived from a modified version of the supervised Term-Frequency Inverse-Category-Frequency (TF-ICF) weighting model. This is done by embedding the TF-ICF matrix in a one-layer feed-forward neural network. The latter is trained using the gradient descent algorithm allowing to iteratively update the term-category matrix until reaching convergence. The model produces category-based feature vector representations that are used to augment the document representations and perform the classification task. Experimental results on four benchmark datasets show that our lean model approach improves text classification accuracy and is significantly more efficient compared with its deep model alternatives.

关键词： text Classification document and text processing Feature Engineering Supervised Term Weighting Inverse Category Frequency TF-IDF text Representation

来源：评论

学校读者我要写书评

暂无评论

Supervised term-category feature weighting for improved text classification

引用

KNOWLEDGE-BASED SYSTEMS 2023年第1期261卷

作者： Attieh, Joseph Tekli, Joe Lebanese Amer Univ LAU Elect & Comp Engn Dept Byblos 36 Lebanon Univ Pay & Pays Adour UPPA LIUPPA Lab SPIDER Res Team F-64600 Anglet Aquitaine France

text classification is a central task in Natural Language processing (NLP) that aims at categorizing text documents into predefined classes or categories. It requires appropriate features to describe the contents and meaning of text documents, and map them with their target categories. Existing text feature representations rely on a weighted representation of the document terms. Hence, choosing a suitable method for term weighting is of major importance and can help increase the effectiveness of the classification task. In this study, we provide a novel text classification framework for Category -based Feature Engineering titled CFE. It consists of a supervised weighting scheme defined based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model, embedded into three new lean classification approaches: (i) IterativeAdditive (flat), (ii) GradientDescentANN (1-layered), and (iii) FeedForwardANN (2-layered). The IterativeAdditive approach augments each document representation with a set of synthetic features inferred from TF-ICF category representations. It builds a term-category TF-ICF matrix using an iterative and additive algorithm that produces category vector representations and updates until reaching convergence. GradientDescentANN replaces the iterative additive process mentioned previously by computing the term-category matrix using a gradient descent ANN model. Training the ANN using the gradient descent algorithm allows updating the term-category matrix until reaching convergence. FeedForwardANN uses a feed-forward ANN model to transform document representations into the category vector space. The transformed document vectors are then compared with the target category vectors, and are associated with the most similar categories. We have implemented CFE including its three classification approaches, and we have conducted a large battery of tests to evaluate their performance. Experimental results on five benchmark datasets show that our le

关键词： text classification document and text processing Feature Engineering Supervised term weighting Inverse Category Frequency TF-IDF text representation

来源：评论

学校读者我要写书评

暂无评论

Preprocessing Arabic text on social media

引用

HELIYON 2021年第2期7卷 e06191页

作者： Hegazi, Mohamed Osman Al-Dossari, Yasser Al-Yahy, Abdullah Al-Sumari, Abdulaziz Hilal, Anwer Prince Sattam Bin Abdulaziz Univ Coll Comp Engn & Sci Dept Comp Sci Al Kharj 11942 Saudi Arabia Prince Sattam Bin Abdulaziz Univ Dept Comp & Self Dev Preparatory Year Deanship Al Kharj 11942 Saudi Arabia

Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

关键词： Natural language processing Information extraction Information retrieval Database Data analysis Knowledge discovery Sentiment analysis document and text processing Arabic text

来源：评论

学校读者我要写书评

暂无评论

Event Detection in Twitter Microblogging

引用

IEEE TRANSACTIONS ON CYBERNETICS 2016年第12期46卷 2810-2824页

作者： Doulamis, Nikolaos D. Doulamis, Anastasios D. Kokkinos, Panagiotis Varvarigos, Emmanouel (Manos) Natl Tech Univ Athens GR-15773 Athens Greece Univ Patras Comp Engn & Informat Dept Patras 26504 Greece

The millions of tweets submitted daily overwhelm users who find it difficult to identify content of interest revealing the need for event detection algorithms in Twitter. Such algorithms are proposed in this paper covering both short (identifying what is currently happening) and long term periods (reviewing the most salient recently submitted events). For both scenarios, we propose fuzzy represented and timely evolved tweet-based theoretic information metrics to model Twitter dynamics. The Riemannian distance is also exploited with respect to words' signatures to minimize temporal effects due to submission delays. Events are detected through a multiassignment graph partitioning algorithm that: 1) optimally retains maximum coherence within a cluster and 2) while allowing a word to belong to several clusters (events). Experimental results on real-life data demonstrate that our approach outperforms other methods.

关键词： Clustering document and text processing fuzzy representation pattern analysis tweet characterization

来源：评论

学校读者我要写书评

暂无评论

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

引用

ACM TRANSACTIONS ON THE WEB 2013年第1期7卷 1–37页

作者： Baykan, Eda Henzinger, Monika Weber, Ingmar Izmir Univ Software Engn Dept Izmir Turkey Univ Vienna Dept Comp Sci A-1010 Vienna Austria Yahoo Res Barcelona Barcelona Spain

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

关键词： Algorithms Experimentation document and text processing Web page classification language classification URL

来源：评论

学校读者我要写书评

暂无评论

Visual Readability Analysis: How to Make Your Writings Easier to Read

引用

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2012年第5期18卷 662-674页

作者： Oelke, Daniela Spretke, David Stoffel, Andreas Keim, Daniel A. Univ Konstanz Dept Comp & Informat Sci D-78457 Constance Germany

We present a tool that is specifically designed to support a writer in revising a draft version of a document. In addition to showing which paragraphs and sentences are difficult to read and understand, we assist the reader in understanding why this is the case. This requires features that are expressive predictors of readability, and are also semantically understandable. In the first part of the paper, we, therefore, discuss a semiautomatic feature selection approach that is used to choose appropriate measures from a collection of 141 candidate readability features. In the second part, we present the visual analysis tool VisRA, which allows the user to analyze the feature values across the text and within single sentences. Users can choose between different visual representations accounting for differences in the size of the documents and the availability of information about the physical and logical layout of the documents. We put special emphasis on providing as much transparency as possible to ensure that the user can purposefully improve the readability of a sentence. Several case studies are presented that show the wide range of applicability of our tool. Furthermore, an in-depth evaluation assesses the quality of the measure and investigates how well users do in revising a text with the help of the tool.

关键词： document and text processing feature evaluation and selection

来源：评论

学校读者我要写书评

暂无评论

On Influence of Line Segmentation in Efficient Word Segmentation in Old Manuscripts

On Influence of Line Segmentation in Efficient Word Segmenta...

引用

13th International Conference on Frontiers in Handwriting Recognition (ICFHR)

作者： Fernandez, D. Llados, J. Fornes, A. Manmatha, R. Univ Autonoma Barcelona Dept Ciencies Computacio Comp Vis Ctr E-08193 Barcelona Spain Univ Massachusetts Dept Comp Sci Amherst MA 01003 USA

ISBN: (纸本)9780769547749;9781467322621

The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothfeder [1] to segment words in old handwritten documents. In their work the lines of the documents are extracted using projections. In this work, we have developed an approach to segment lines more efficiently. The new line segmentation algorithm tackles with skewed, touching and noisy lines, so it is significantly improves word segmentation. Experiments using Spanish documents from the Marriages Database of the Barcelona Cathedral show that this approach reduces the error rate by more than 20%.

关键词： Segmentation document and text processing document analysis handwriting analysis heuristics path-finding

来源：评论

学校读者我要写书评

暂无评论

Distributing Computationally Expensive Matching of Requirements to Capability Models

Distributing Computationally Expensive Matching of Requireme...

引用

5th Annual IEEE International Conference on Semantic Computing (ICSC)

作者： Vasquez, Reymonrod Verma, Kunal Kass, Alex Accenture Technol Labs 50 W San Fernando StSuite 1200 San Jose CA 95113 USA

ISBN: (纸本)9780769544922

In this paper, we present a distributed way to automatically map users' requirements to reference process models. In a prior paper [9], we presented a tool called Process Model Requirements Gap Analyzer (ProcGap), which combines natural language processing, information retrieval, and semantic reasoning to automatically match and map textual requirements to domain-specific process models. Although the tool proved beneficial to users in reusing prior knowledge, by making it easy to use process models, the tool has one main drawback. It takes a long time to compare a very large requirements document, one that has a few thousand requirements, to a process model hierarchy with a few thousand capabilities. In this paper, we present how we solved this problem using Apache Hadoop. Apache Hadoop allows ProcGap to distribute matching task across several machines, increasing the tool's performance and usability. We present the performance comparison of running ProcGap on a single-machine, and our distributed version.

关键词： Natural Language processing Hadoop Map-Reduce document and text processing

来源：评论

学校读者我要写书评

暂无评论

A Semantic-Based Solution for UBL Schema Interoperability

引用

IEEE INTERNET COMPUTING 2009年第3期13卷 64-71页

作者： Yarimagan, Yalin Dogac, Asuman Middle E Tech Univ Dept Comp Engn Ankara Turkey

The Universal Business Language (UBL) is an initiative to develop common business document schemas for interoperability. However, businesses operate in different industry, geopolitical, and regulatory contexts and have different rules and requirements for the information they exchange. So, several trading communities are tailoring UBL schemas to their needs, requiring that these schemas translate to each other. In this article, the authors describe how to enhance UBL with semantics-based translation mechanisms to maintain interoperability between documents conforming to different schema versions.

关键词： Data Mapping Interoperability Software Engineering Standards document Preparation document and text processing Computing Methodologies Electronic Commerce

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：