检索结果-内蒙古大学图书馆

Relevance-Based Retrieval on hidden-web Text databases without Ranking Support

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2011年第10期23卷 1555-1568页

作者： Hristidis, Vagelis Hu, Yuheng Ipeirotis, Panagiotis G. Florida Int Univ Sch Comp & Informat Sci Miami FL 33199 USA Arizona State Univ Sch Comp Informat & Decis Syst Engn Tempe AZ 85281 USA NYU Dept Informat Operat & Management Sci New York NY 10012 USA

Many online or local data sources provide powerful querying mechanisms but limited ranking capabilities. For instance, PubMed allows users to submit highly expressive Boolean keyword queries, but ranks the query results by date only. However, a user would typically prefer a ranking by relevance, measured by an information retrieval (IR) ranking function. A naive approach would be to submit a disjunctive query with all query keywords, retrieve all the returned matching documents, and then rerank them. Unfortunately, such an operation would be very expensive due to the large number of results returned by disjunctive queries. In this paper, we present algorithms that return the top results for a query, ranked according to an IR-style ranking function, while operating on top of a source with a Boolean query interface with no ranking capabilities (or a ranking capability of no interest to the end user). The algorithms generate a series of conjunctive queries that return only documents that are candidates for being highly ranked according to a relevance metric. Our approach can also be applied to other settings where the ranking is monotonic on a set of factors (query keywords in IR) and the source query interface is a Boolean expression of these factors. Our comprehensive experimental evaluation on the PubMed database and a TREC data set show that we achieve order of magnitude improvement compared to the current baseline approaches.

关键词： hidden-web databases keyword search top-k ranking

来源：评论

学校读者我要写书评

暂无评论

Automatic discovery of web Query Interfaces using machine learning techniques

引用

JOURNAL OF INTELLIGENT INFORMATION SYSTEMS 2013年第1期40卷 85-108页

作者： Marin-Castro, Heidy M. Sosa-Sosa, Victor J. Martinez-Trinidad, Jose F. Lopez-Arevalo, Ivan Natl Polytech Inst Ctr Res & Adv Studies Informat Technol Lab Victoria City Tamaulipas Mexico Natl Inst Astrophys Opt & Elect Tonantzintla Puebla San Andres Chol Mexico

The amount of information contained in databases available on the web has grown explosively in the last years. This information, known as the Deep web, is heterogeneous and dynamically generated by querying these back-end (relational) databases through web Query Interfaces (WQIs) that are a special type of HTML forms. The problem of accessing to the information of Deep web is a great challenge because the information existing usually is not indexed by general-purpose search engines. Therefore, it is necessary to create efficient mechanisms to access, extract and integrate information contained in the Deep web. Since WQIs are the only means to access to the Deep web, the automatic identification of WQIs plays an important role. It facilitates traditional search engines to increase the coverage and the access to interesting information not available on the indexable web. The accurate identification of Deep web data sources are key issues in the information retrieval process. In this paper we propose a new strategy for automatic discovery of WQIs. This novel proposal makes an adequate selection of HTML elements extracted from HTML forms, which are used in a set of heuristic rules that help to identify WQIs. The proposed strategy uses machine learning algorithms for classification of searchable (WQIs) and non-searchable (non-WQI) HTML forms using a prototypes selection algorithm that allows to remove irrelevant or redundant data in the training set. The internal content of web Query Interfaces was analyzed with the objective of identifying only those HTML elements that are frequently appearing provide relevant information for the WQIs identification. For testing, we use three groups of datasets, two available at the UIUC repository and a new dataset that we created using a generic crawler supported by human experts that includes advanced and simple query interfaces. The experimental results show that the proposed strategy outperforms others previously reported works.

关键词： Deep web hidden-web databases web Query Interfaces Dsupervised classification

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：