Many databases have become web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a web sear...
详细信息
Many databases have become web-accessible through form-based search interfaces (i.e., HTML forms) that allow users to specify complex and precise queries to access the underlying databases. In general, such a web search interface can be considered as containing an interface schema with multiple attributes and rich semantic/meta-information;however, the schema is not formally defined in HTML. Many web applications, such as web database integration and deep web crawling, require the construction of the schemas. In this paper, we first propose a schema model for representing complex search interfaces, and then present a layout-expression based approach to automatically extract the logical attributes from search interfaces. We also rephrase the identification of different types of semantic information as a classification problem, and design several Bayesian classifiers to help derive semantic information from extracted attributes. A system, WISE-iExtractor, has been implemented to automatically construct the schema from any web search interfaces. Our experimental results on real search interfaces indicate that this system is highly effective.
A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia co...
详细信息
A query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or "top" k pages for the query. This top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. For example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. A user who queries such a relation might simply specify the user's location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. Processing top-k queries efficiently is challenging for a number of reasons. One critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. In this article, we study how to process top-k queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. We present a sequential algorithm for processing such queries, but observe that any sequential top-k query processing strategy is bound to require unnecessarily long query processing times, since web accesses exhibit high and variable latency. Fortunately, web sources can be probed in parallel, and each source can typically process concurrent requests, although sources may impose some restrictions on the type and number of probes that they are willing to accept. We adapt our sequential query processing technique and introduce an efficient algorithm that maximizes source-access parallelism to minimize query response time, while satisfying source-access constraints. We evaluate our techniques experimentally using b
The author presents a critical response to Ed Folsom's article "Database a Genre: The Epic Transformation of Archives," which discussed the "Walt Whitman Archive," an online archive of the poet...
详细信息
The author presents a critical response to Ed Folsom's article "Database a Genre: The Epic Transformation of Archives," which discussed the "Walt Whitman Archive," an online archive of the poet Walt Whitman's work. The author claims that digital databases are actually dependent on print conventions, such as the book. The author discusses the connection between Whitman's writing and the archive and other digital databases.
Urban scenic forest is open to outside,fragmented and fragilc,thus is apt to the high risk of invasion of alien species The important measure to minimize the damage of invasive species is to prevent the potential inva...
详细信息
Urban scenic forest is open to outside,fragmented and fragilc,thus is apt to the high risk of invasion of alien species The important measure to minimize the damage of invasive species is to prevent the potential invasive species from entering into suitable *** Mountain in Nanjing City is selected as a case study area in the paper. Research materials concerned with biological invasion are referred and three web databases in China and abroad are used to choose the potential invasive alien species in study ***,nine invasive species which threat the safety of forest ecosystem are picked out from the web *** three invasive alien species of Bursaphelenchus xylophilus,Matsucoccus matsumura and Hyphantria cunea are selected from 9 species by means of agricultural climate similarity ***,DEM and high resolution satellite image QuickBird of study area are collected to research the spatial distribution of the potential invasive species based on the desktop GIS platform ArcGis *** biological and geographical factors affecting spatial distribution of alien species are decided respectively and digitalized as different map ***,layers are overlaid and spatial suitability map is made to specify the location of potential invasive *** methods used in the paper overcome the weaknesses of traditional suitability research which can only analyze the suitability of single species and can not specify the location of potential invasive *** supplying the theoretical basis on decision-making to control the invasive species,the methods mentioned above are of great practical significance to environmental protection in regions of high historic and cultural values.
The incompatibilities among complex data formats and various schema used by biological databases that house these data are becoming a bottleneck in biological research. For example, biological data format varies from ...
详细信息
The incompatibilities among complex data formats and various schema used by biological databases that house these data are becoming a bottleneck in biological research. For example, biological data format varies from simple words (e.g. gene name), numbers (e.g. molecular weight) to sequence strings (e.g. nucleic acid sequence), to even more complex data formats such as taxonomy trees. Some information is embedded in narrative text, such as expert comments and publications. Some other information is expressed as graphs or images (e.g. pathways networks). The confederation of heterogeneous web databases has become a crucial issue in today's biological research. In other words, interoperability has to be archieved among the biological web databases and the heterogeneity of the web databases has to be resolved. This paper presents a biological ontology, BAO, and discusses its advantages in supporting the semantic integration of biological web databases are discussed.
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web "crawlers." Recently, commercial web sites have started to manua...
详细信息
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web "crawlers." Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.
暂无评论