web query interfaces (WQIs) play a very important role in retrieving Deep web content. WQIs allow users to query domain-specific databases for obtaining information of interest from diverse domains such as car rentals...
详细信息
web query interfaces (WQIs) play a very important role in retrieving Deep web content. WQIs allow users to query domain-specific databases for obtaining information of interest from diverse domains such as car rentals, hotels, airfare, etc. As the number of WQIs on the web is increasing drastically, some research efforts are focused on building a single (unified) WQI that allows users to query and integrate information available in different web databases related to a specific domain. A very important task in this WQIs' integration process is the extraction, modeling and understanding of WQIs' semantic content. However, this task is challenging because of the great heterogeneity in the design of WQIs. This paper presents a novel tree-based approach for the modeling and understanding of WQIs. A tree schema called the Visual Reduced Tree (VR-Tree) is built from the tree produced by a web browser's render engine, applying a set of well- defined functions and guided by a set of heuristic rules to identify the WQI's main components and their relationships. The proposed strategy was evaluated by running a collection of experiments over the Tel-8 and ICQ datasets from the UIUC repository. The results show that the automatic modeling of WQIs is possible with a high degree of precision if compared against previous approaches, simplifying the modeling task by only considering visual and spatial properties of WQI components using the VR-Tree schema proposed in this work.
The amount of information contained in databases available on the web has grown explosively in the last years. This information, known as the Deep web, is heterogeneous and dynamically generated by querying these back...
详细信息
The amount of information contained in databases available on the web has grown explosively in the last years. This information, known as the Deep web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through web query interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep web. Since WQIs are the only means to access to the Deep web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable web. The accurate identification of Deep web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of web query interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple queryinterfaces. The experimental results show that the proposed strategy outperforms others previously reported works.
With the constant increase in the volume of information available on the web, it is more dificult to find the specific information related to a given domain. Users are facing the problem of information overload, in wh...
详细信息
With the constant increase in the volume of information available on the web, it is more dificult to find the specific information related to a given domain. Users are facing the problem of information overload, in which a query about a specialized subject (local information, e-commerce: hotels, airlines, car rental;science: biology, mathematics, medicine, etc.) on a web search engine, it returns a lot of web pages or results that in most of the cases are outside the domain of interest. This is one reason why the vertical search tools have become a necessity for users that seek specific-domain information from diferent databases available in the web through input sources called web query interfaces (ICWs). This paper describes an approach for automatic integration of ICWs, a crucial task to construct vertical search tools. The proposed methodology is validated by realizing a vertical search prototype called VSearch that allows users to transparently query multiple web databases in a specific-domain through a unified ICW. The proposed approach for automatic ICWs integration is based on: i) a hierarchical model called AEV for modeling the visual content of ICW;ii) semantic clustering for the identification of relationships between fields in ICWs;and iii) a field homogenization and unification process of AEV schemes for the construction of a unified ICW. The VSearch prototype was implemented and evaluated using a study case. The experimental results demonstrate the high precision in the integration phase and an efective methodology to create a functional vertical search tool for a given domain.
The amount of information contained in databases in the web has grown explosively in the last years. This information, known as the Deep web, is dynamically obtained from specific queries to these databases through We...
详细信息
ISBN:
(纸本)9783642253294
The amount of information contained in databases in the web has grown explosively in the last years. This information, known as the Deep web, is dynamically obtained from specific queries to these databases through web query interfaces (WQIs). The problem of finding and accessing databases in the web is a great challenge due to the web sites are very dynamic and the information existing is heterogeneous. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in databases in the web. Since WQIs are the only means to access databases in the web;the automatic identification of WQIs plays an important role facilitating traditional search engines to increase the coverage and access interesting information not available on the indexable web. In this paper we present a strategy for automatic identification of WQIs using supervised learning and making an adequate selection and extraction of HTML elements in the WQIs to form the training set. We present two experimental tests over a corpora of HTML forms considering positive and negative examples. Our proposed strategy achieves better accuracy than previous works reported in the literature.
The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to...
详细信息
The analysis of web usage has mostly focused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The queryinterfaces of such sites allow the specification of many search criteria. Their generated results support navigation to pages of results combining cross-linked data from many sources. For the analysis of visitor navigation behaviour in such web sites, we propose the web usage miner (WUM), which discovers navigation patterns subject to advanced statistical and structural constraints. Since our objective is the discovery of interesting navigation patterns, we do not focus on accesses to individual pages. Instead, Eve construct conceptual hierarchies that reflect the query capabilities used in the production of these pages. Our experiments with a real web site that integrates data from multiple databases, the German Schulweb, demonstrate the appropriateness of WUM in discovering navigation patterns and show how those discoveries can help in assessing and improving the quality of the site.
暂无评论