检索结果-内蒙古大学图书馆

Using Combined List Hierarchy and Headings of html documents for Learning Domain-Specific Ontology

INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS 2020年第4期11卷 233-239页

作者： Raza, Muhammad Ahsan Raza, Binish Jabeen, Taiba Raza, Sehrish Abbas, Munnawar Bahauddin Zakariya Univ Dept Informat Technol Multan Pakistan Univ Malaya Fac Comp Sci & Informat Technol Kuala Lumpur Malaysia Allama Iqbal Open Univ Fac Educ Multan Pakistan Women Univ Inst Comp Sci & Informat Technol Multan Pakistan Inst Southern Punjab Dept Comp Sci Multan Pakistan

html pages contain unstructured and diverse information. However, these documents lack semantics and are not machine understandable. Semantic webs aim to add formal semantics to web data, whereas ontology provides formal semantics to a domain and is thus considered a foundation of semantic webs. Domain ontologies can be constructed manually, but this process is tedious and inefficient. Thus, this study presents an ontology learning (OL) model to create domain ontologies automatically from a set of html pages. The key insight of this research is that it combines the list structure and headings of html pages to recognize the ontology vocabulary. The approach also incorporates synonym relationships with ontology and allows the semantic interpretation of ontology concepts. We implement the proposed OL approach to build sports ontology from a collection of sports domain html documents. The new sports ontology is tested using FaCT++ reasoner;results show no inconsistency in the ontology. Furthermore, experts evaluate the successful mapping of html lists and headings to the ontology vocabulary. The proposed OL approach performs effectively and achieves 92.7% and 95.4% precision values for list and heading mapping, respectively.

关键词： Ontology learning semantic web sports ontology html documents knowledge extraction ontology engineering

来源：评论

学校读者我要写书评

暂无评论

On extracting data from tables that are encoded using html

引用

KNOWLEDGE-BASED SYSTEMS 2020年 190卷 105157-105157页

作者： Roldan, Juan C. Jimenez, Patricia Corchuelo, Rafael Univ Seville ETSI Informat Avda Reina Mercedes S-N E-41012 Seville Spain

Tables are a common means to display data in human-friendly formats. Many authors have worked on proposals to extract those data back since this has many interesting applications. In this article, we summarise and compare many of the proposals to extract data from tables that are encoded using html and have been published between 2000 and 2018. We first present a vocabulary that homogenises the terminology used in this field;next, we use it to summarise the proposals;finally, we compare them side by side. Our analysis highlights several challenges to which no proposal provides a conclusive solution and a few more that have not been addressed sufficiently;simply put, no proposal provides a complete solution to the problem, which seems to suggest that this research field shall keep active in the near future. We have also realised that there is no consensus regarding the datasets and the methods used to evaluate the proposals, which hampers comparing the experimental results. (C) 2019 Elsevier B.V. All rights reserved.

关键词： html documents Web tables Table mining Data extraction

来源：评论

学校读者我要写书评

暂无评论

VB-PTC: Visual Block Multi-Record Text Extraction Based on Sensor Network Page Type Conversion

引用

IEEE ACCESS 2020年 8卷 167900-167913页

作者： Gong, Jibing Zhang, Hekai Du, Weixia Li, Huanhuan Wen, Hongnian Yanshan Univ Sch Informat Sci & Engn Qinhuangdao 066004 Hebei Peoples R China Yanshan Univ Key Lab Comp Virtual Technol & Syst Integrat Hebe Qinhuangdao 066004 Hebei Peoples R China Yanshan Univ Key Lab Software Engn Hebei Prov Qinhuangdao 066004 Hebei Peoples R China Shijiazhuang Inst Railway Technol Sch Informat Sci & Engn Shijiazhaung 050041 Peoples R China

Usually, in addition to the main content, web pages contain additional information in the form of noise, such as navigation elements, sidebars and advertisements. This kind of noise has nothing to do with the main content, it will affect the tasks of data mining and information retrieval so that the sensor will be damaged by the wrong data and interference noise. Because of the diversity of web page structure, it is a challenge to detect relevant information and noise in order to improve the true reliability of sensor networks. In this paper, we propose a visual block construction method based on page type conversion (VB-PTC). This method uses a combination of site-level noise reduction based on hashtree and page-level noise reduction based on linked clusters to eliminate noise in web articles, and it successfully converts multi-record complex pages to multi-record simple pages, effectively simplifying the rules of visual block construction. In the aspect of multi-record content extraction, according to the characteristics of different fields, we use different extraction methods, combined with regular expression, natural language processing and symbol density detection methods which greatly improves the accuracy of multi-record content extraction. VB-PTC can be effectively used for information retrieval, content extraction and page rendering tasks.

关键词： Web pages Feature extraction Visualization Data mining Noise reduction Navigation Data collection Dom trees html documents noise elimination web data extraction web mining

来源：评论

学校读者我要写书评

暂无评论

Dec :: Tech Reports :: Nsl-Tn-12

引用

2016年

[Auto Generated] 1. Scribe vs. html 1 2. Making Scribe produce html 1 3. Making A Structured html Document 2 3.1. Coexistence With Other Device Types 3 3.2. Convert @Section to @MakeSection 3 3.3. Convert @Chapter to @MakeChapter 3 3.4. Add @MakeDocument at the Beginning 4 3.5. Include @generate Commands for Each Chapter 4 3.6. Forced Line Breaks -- Convert @* to @br 4 3.7. Add an @htmlfinish Command at the End 5 4. Imbedded Graphics 5 5. The First Chapter Contents 5 6. Labels, Cross References,

关键词： chapter chapter contents command document documents file html html documents html file html library hypertext network systems nsl-tn-12 producing producing html reports scribe scribe document structured html systems laboratory tag wide web

来源：评论

学校读者我要写书评

暂无评论

Supporting Early Contextualization of Textual Content in Digital documents on the Web 13

Supporting Early Contextualization of Textual Content in Dig...

引用

13th IAPR International Conference on Document Analysis and Recognition (ICDAR)

作者： Eldesouky, Bahaa Bakry, Menna Maus, Heiko Dengel, Andreas German Res Ctr Artificial Intelligence DFKI Knowledge Management Dept Kaiserslautern Germany German Univ Cairo New Cairo Egypt

ISBN: (纸本)9781479918058

The World Wide Web is arguably the most important source of digital documents nowadays. These documents mainly consist of unstructured and semi-structured data comprising a wealth of information at the disposal of the DAR (Document Analysis and Recognition) community. Contextualization plays an important role in understanding the content of those documents. In this paper, we present an approach to early contextualization of textual data in html documents. It combines automatic as well as semi-automatic annotation of named entities with user interaction to support contextualization of the content of digital documents as early as in the authoring stage of their life cycle. We also present the results of an online experimental evaluation involving 120 human test subjects. They show that our approach successfully managed to produce semantically annotated versions of unstructured textual content, which contain reliable contextual information, thus facilitating the task of later document analysis stages.

关键词： Internet hypermedia markup languages natural language processing DAR html documents World Wide Web digital documents document analysis and recognition reliable contextual information semiautomatic annotation semistructured data textual content textual data Blogs Electronic publishing Information services Reliability Semantics Text analysis World Wide Web hypermedia markup languages Electronic Publishing Internets text processing blogs Information Services Natural Language Processing Semantics Content analysis

来源：评论

学校读者我要写书评

暂无评论

The Effect of Hybrid Crossover Technique on Enhancing Recall and Precision in Information Retrieval

The Effect of Hybrid Crossover Technique on Enhancing Recall...

引用

World Congress on Engineering (WCE 2013)

作者： Al-Dallal, Ammar Ahlia Univ Dept Comp Engn Manama Bahrain

ISBN: (纸本)9789881925299

Several techniques are proposed to retrieve the most relevant html documents to user query. Among these techniques is the genetic algorithm which iteratively creates several generations using selection, crossover and mutation before producing the final result. In this paper, a new hybrid crossover technique is proposed to enhance the quality of the retrieved results. This technique is applied to html documents and evaluated using recall, precision and recall-precision measures. Its performance is compared to three well known techniques of crossover. The results show high improvement in the quality of the retrieved documents in terms of these measures.

关键词： genetic algorithm html documents hybrid crossover information retrieval

来源：评论

学校读者我要写书评

暂无评论

Links and copyright law

引用

COMPUTER LAW & SECURITY REVIEW 2011年第3期27卷 258-266页

作者： Honkasalo, Pessi Univ Surrey Sch Law Guildford GU2 7XH Surrey England

For at least 15 years, there have been question marks over the legal permissibility of connecting one web resource to another by means of links. The purpose of this paper is to assess where we stand in terms of the legal state on the threshold of the new decade. The substantive argument in this paper is that, fundamentally, there are only two sorts of links. 'Normal' links facilitate access to subject matter that has been made available to the public and are visible to users as 'activatable' references. 'Embedding' links, by contrast, automatically incorporate online material and cause it to become a part of the embedding document. On the grounds of the cumulative judicial custom in the member states of the European Union, this paper proposes that normal links as such should invariably be deemed not to create a state of interference with copyright law. Embedding links, however, may constitute an infringement of the exclusive right of alteration, communication or reproduction enjoyed by the copyright holder, depending on the facts and circumstances. (C) 2011 Pessi Honkasalo. Published by Elsevier Ltd. All rights reserved.

关键词： Hypertext links Copyright html documents Intellectual property Shetland times case

来源：评论

学校读者我要写书评

暂无评论

Web Document Text and Images Extraction using DOM Analysis and Natural Language Processing 09

Web Document Text and Images Extraction using DOM Analysis a...

引用

9th ACM Symposium on Document Engineering

作者： Joshi, Parag Mulendra Liu, Sam Hewlett Packard Labs Palo Alto CA 94304 USA

ISBN: (纸本)9781605585758

Web has emerged as the most important source of information in the world. This has resulted in need for automated software components to analyze web pages and harvest useful information from them. However, in typical web pages the informative content is surrounded by a very high degree of noise in the form of advertisements, navigation bars, links to other content, etc. Often the noisy content is interspersed with the main content leaving no clean boundaries between them. This noisy content makes the problem of information harvesting from web pages much harder. Therefore, it is essential to be able to identify main content of a web page and automatically isolate it from noisy content for any further analysis. Most existing approaches rely on prior knowledge of website specific templates and hand-crafted rules specific to websites for extraction of relevant content. We propose a generic approach that does not require prior knowledge of website templates. While html DOM analysis and visual layout analysis approaches have sometimes been used, we believe that for higher accuracy in content extraction, the analyzing software needs to mimic a human user and understand content in natural language similar to the way humans intuitively do in order to eliminate noisy content. In this paper, we describe a combination of html DOM analysis and Natural Language Processing (NLP) techniques for automated extractions of main article with associated images from web pages.

关键词： Web page text extraction Image extraction Natural language processing html documents DOM trees

来源：评论

学校读者我要写书评

暂无评论

Automating content extraction of html documents

引用

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS 2005年第2期8卷 179-224页

作者： Gupta, S Kaiser, GE Grimm, P Chiang, MF Starren, J Columbia Univ Dept Comp Sci New York NY 10027 USA Columbia Univ Dept Elect Engn New York NY 10027 USA Columbia Univ Dept Ophthalmol New York NY 10032 USA Columbia Univ Dept Biomed Informat New York NY 10032 USA

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell phone and PDA browsing, speech rendering for the visually impaired, and text summarization. Most approaches to making content more readable involve changing font size or removing html and data components such as images, which takes away from a webpage's inherent look and feel. Unlike "Content Reformatting," which aims to reproduce the entire webpage in a more convenient form, our solution directly addresses "Content Extraction." We have developed a framework that employs an easily extensible set of techniques. It incorporates advantages of previous work on content extraction. Our key insight is to work with DOM trees, a W3C specified interface that allows programs to dynamically access document structure, rather than with raw html markup. We have implemented our approach in a publicly available Web proxy to extract content from html web pages. This proxy can be used both centrally, administered for groups of users, as well as by individuals for personal browsers. We have also, after receiving feedback from users about the proxy, created a revised version with improved performance and accessibility in mind.

关键词： DOM trees content extraction reformatting html documents accessibility speech rendering text summarization

来源：评论

学校读者我要写书评

暂无评论

USING COOLLISTS TO INDEX html documents IN THE WEB

COMPUTER NETWORKS AND ISDN SYSTEMS

引用

COMPUTER NETWORKS AND ISDN SYSTEMS 1995年第1-2期28卷 147-154页

作者： LIM, JG E-CIM Center Corparate Technical Operations Samsung Electronics Suwon Korea

This paper suggests a partial solution (limited to html documents) to the Web-indexing problem using Coo[lists. Roughly, a Coollist is equivalent to a Hotlist in Mosaic except that it automatically records all the visited html document titles by default. Thus, in theory, by maintaining a merged list of everybody's Coollists, a complete index of all the html files in the Web should be created eventually. In practice, even if transferring everybody's Coollists to a single site were feasible, the growth and change rate of Web questions us whether the archie metaphor of ''every index server maintains all the know-wheres'' could be applied to the rest of the Web. The new metaphor we are suggesting is a library metaphor. Let each organization maintain the merged Coollists of their individuals. If some organization has surplus computing resources, let it maintain the merged list of other merged lists. This way, individuals are likely to find documents of their interest from their own organization. But organizations have characteristics like libraries have specialities. Therefore, individuals will find other interesting documents from its ''neighboring'' sites. Bigger libraries carry more books. Likewise, there will be sites that merge many merged lists together which will be useful for blind keyword searching of the titles. For our current implementation of a Coollist, we take advantage of CERN proxy-cache server to collect the indices of all the visited html files. People on three of the 19 plants within the company tried the merged list of Coollists and found it almost indispensable. People who used to save almost every URLs they visited and those who wanted some comprehensive list of URLs found it particularly useful. In the paper, we describe the result of our experiment in detail and also point out how our approach might solve the scaleability problem of other Web indexing solutions.

关键词： SEARCHING html documents INFORMATION CLUSTERING INFORMATION RETRIEVAL DIGITAL LIBRARY

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：