检索结果-内蒙古大学图书馆

Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping

引用

IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2008年第3期14卷 564-575页

作者： Paulovich, Fernando V. Nonato, Luis Gustavo Minghim, Rosane Levkowitz, Haim Univ Sao Paulo Inst Ciencia Matemat & Computacao BR-13560970 Sao Carlos SP Brazil Univ Massachusetts Dept Comp Sci Lowell MA 01854 USA

The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analyses of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections ( LSP). From an initial projection of the control points, LSP defines the positioning of their neighboring points through a numerical solution that aims at preserving a similarity relationship between the points given by a metric in mD. In order to perform the projection, a small number of distance calculations are necessary, and no repositioning of the points is required to obtain a final solution with satisfactory precision. The results show the capability of the technique to form groups of points by degree of similarity in 2D. We illustrate that capability through its application to mapping collections of textual documents from varied sources, a strategic yet difficult application. LSP is faster and more accurate than other existing high-quality methods, particularly where it was mostly tested, that is, for mapping text sets.

关键词： document and text processing visualization simulation modeling and visualization data and knowledge visualization information visualization visualization techniques and methodologies

来源：评论

学校读者我要写书评

暂无评论

Extraction of plant identification keys using approximate string matching for species properties classification

Extraction of plant identification keys using approximate st...

引用

International Multiconference of Engineers and Computer Scientists

作者： Sharifalillah, N. Mohd, S. B. Khairuddin, I. Univ Teknol MARA Fac Informat Technol & Quantitat Sci Shah Alam 40450 Selangor Malaysia Univ Malaya Fac Comp Sci & Informat Technol Kuala Lumpur 50630 Malaysia Univ Malaya Inst Biol Sci Kuala Lumpur 50630 Malaysia

ISBN: (纸本)9789889867140

Most biologists keep data in separate databases. These databases are not necessary well-structured, Plant identification keys are among such data. They are data-rich description containing plant identification terminologies and maybe used to identify various plant species. The way the data is kept often requires the species identification to be done using rules that are applied sequentially. Done manually, this is very time consuming. Information extraction (IE) is a process of selecting information such as names, terms, or phrases, from a natural language text documents. This information is then structured into a specified template for retrieval. This method is applied to plant identification keys kept by the biologists. Before the keys are extracted from the description,they have to go through a number of processes. In this paper, we illustrate the pre-processing and processing methods with an example from a database, with emphasis on the approximate string matching algorithm to extract the most relevant keys from the description.

关键词： bioinformatics document and text processing

来源：评论

学校读者我要写书评

暂无评论

Fast text Classification using Lean Gradient Descent Feed Forward Neural Network for Category Feature Augmentation 22

Fast Text Classification using Lean Gradient Descent Feed Fo...

引用

IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom) / BigDataSE Conference / CSE Conference / EUC Conference / ISCI Conference

作者： Attieh, Joseph Tekli, Joe Lebanese Amer Univ ECE Dept Sch Engn 36 Byblos Beirut Lebanon

ISBN: (纸本)9798350381993;9798350382006

text classification is a key task of the Natural Language processing (NLP) field that aims at assigning predefined categories to textual documents. Performing text classification requires features that effectively represent the content and the meaning of textual documents. Selecting a suitable method for term weighting is of central importance and can improve the quality of the classification method. In this paper, we propose to a new text classification solution to perform Category-based Feature Augmentation (CFA) on the document representation. First, a termcategory feature matrix is derived from a modified version of the supervised Term-Frequency Inverse-Category-Frequency (TF-ICF) weighting model. This is done by embedding the TF-ICF matrix in a one-layer feed-forward neural network. The latter is trained using the gradient descent algorithm allowing to iteratively update the term-category matrix until reaching convergence. The model produces category-based feature vector representations that are used to augment the document representations and perform the classification task. Experimental results on four benchmark datasets show that our lean model approach improves text classification accuracy and is significantly more efficient compared with its deep model alternatives.

关键词： text Classification document and text processing Feature Engineering Supervised Term Weighting Inverse Category Frequency TF-IDF text Representation

来源：评论

学校读者我要写书评

暂无评论

On Influence of Line Segmentation in Efficient Word Segmentation in Old Manuscripts

On Influence of Line Segmentation in Efficient Word Segmenta...

引用

13th International Conference on Frontiers in Handwriting Recognition (ICFHR)

作者： Fernandez, D. Llados, J. Fornes, A. Manmatha, R. Univ Autonoma Barcelona Dept Ciencies Computacio Comp Vis Ctr E-08193 Barcelona Spain Univ Massachusetts Dept Comp Sci Amherst MA 01003 USA

ISBN: (纸本)9780769547749;9781467322621

The objective of this work is to show the importance of a good line segmentation to obtain better results in the segmentation of words of historical documents. We have used the approach developed by Manmatha and Rothfeder [1] to segment words in old handwritten documents. In their work the lines of the documents are extracted using projections. In this work, we have developed an approach to segment lines more efficiently. The new line segmentation algorithm tackles with skewed, touching and noisy lines, so it is significantly improves word segmentation. Experiments using Spanish documents from the Marriages Database of the Barcelona Cathedral show that this approach reduces the error rate by more than 20%.

关键词： Segmentation document and text processing document analysis handwriting analysis heuristics path-finding

来源：评论

学校读者我要写书评

暂无评论

Distributing Computationally Expensive Matching of Requirements to Capability Models

Distributing Computationally Expensive Matching of Requireme...

引用

5th Annual IEEE International Conference on Semantic Computing (ICSC)

作者： Vasquez, Reymonrod Verma, Kunal Kass, Alex Accenture Technol Labs 50 W San Fernando StSuite 1200 San Jose CA 95113 USA

ISBN: (纸本)9780769544922

In this paper, we present a distributed way to automatically map users' requirements to reference process models. In a prior paper [9], we presented a tool called Process Model Requirements Gap Analyzer (ProcGap), which combines natural language processing, information retrieval, and semantic reasoning to automatically match and map textual requirements to domain-specific process models. Although the tool proved beneficial to users in reusing prior knowledge, by making it easy to use process models, the tool has one main drawback. It takes a long time to compare a very large requirements document, one that has a few thousand requirements, to a process model hierarchy with a few thousand capabilities. In this paper, we present how we solved this problem using Apache Hadoop. Apache Hadoop allows ProcGap to distribute matching task across several machines, increasing the tool's performance and usability. We present the performance comparison of running ProcGap on a single-machine, and our distributed version.

关键词： Natural Language processing Hadoop Map-Reduce document and text processing

来源：评论

学校读者我要写书评

暂无评论

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

引用

ACM TRANSACTIONS ON THE WEB 2013年第1期7卷 3-3页

作者： Baykan, Eda Henzinger, Monika Weber, Ingmar Izmir Univ Software Engn Dept Izmir Turkey Univ Vienna Dept Comp Sci A-1010 Vienna Austria Yahoo Res Barcelona Barcelona Spain

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

关键词： Algorithms Experimentation document and text processing Web page classification language classification URL

来源：评论

学校读者我要写书评

暂无评论

Supervised term-category feature weighting for improved text classification

引用

KNOWLEDGE-BASED SYSTEMS 2023年第1期261卷

作者： Attieh, Joseph Tekli, Joe Lebanese Amer Univ LAU Elect & Comp Engn Dept Byblos 36 Lebanon Univ Pay & Pays Adour UPPA LIUPPA Lab SPIDER Res Team F-64600 Anglet Aquitaine France

text classification is a central task in Natural Language processing (NLP) that aims at categorizing text documents into predefined classes or categories. It requires appropriate features to describe the contents and meaning of text documents, and map them with their target categories. Existing text feature representations rely on a weighted representation of the document terms. Hence, choosing a suitable method for term weighting is of major importance and can help increase the effectiveness of the classification task. In this study, we provide a novel text classification framework for Category -based Feature Engineering titled CFE. It consists of a supervised weighting scheme defined based on a variant of the TF-ICF (Term Frequency-Inverse Category Frequency) model, embedded into three new lean classification approaches: (i) IterativeAdditive (flat), (ii) GradientDescentANN (1-layered), and (iii) FeedForwardANN (2-layered). The IterativeAdditive approach augments each document representation with a set of synthetic features inferred from TF-ICF category representations. It builds a term-category TF-ICF matrix using an iterative and additive algorithm that produces category vector representations and updates until reaching convergence. GradientDescentANN replaces the iterative additive process mentioned previously by computing the term-category matrix using a gradient descent ANN model. Training the ANN using the gradient descent algorithm allows updating the term-category matrix until reaching convergence. FeedForwardANN uses a feed-forward ANN model to transform document representations into the category vector space. The transformed document vectors are then compared with the target category vectors, and are associated with the most similar categories. We have implemented CFE including its three classification approaches, and we have conducted a large battery of tests to evaluate their performance. Experimental results on five benchmark datasets show that our le

关键词： text classification document and text processing Feature Engineering Supervised term weighting Inverse Category Frequency TF-IDF text representation

来源：评论

学校读者我要写书评

暂无评论

Preprocessing Arabic text on social media

引用

HELIYON 2021年第2期7卷 e06191页

作者： Hegazi, Mohamed Osman Al-Dossari, Yasser Al-Yahy, Abdullah Al-Sumari, Abdulaziz Hilal, Anwer Prince Sattam Bin Abdulaziz Univ Coll Comp Engn & Sci Dept Comp Sci Al Kharj 11942 Saudi Arabia Prince Sattam Bin Abdulaziz Univ Dept Comp & Self Dev Preparatory Year Deanship Al Kharj 11942 Saudi Arabia

Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.

关键词： Natural language processing Information extraction Information retrieval Database Data analysis Knowledge discovery Sentiment analysis document and text processing Arabic text

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：