The World-Wide Web can be viewed as a collection of semi-structured multimedia documents in the form of Web pages connected through hyperlinks. Unlike most web search engines, which primarily focus on information retr...
详细信息
The World-Wide Web can be viewed as a collection of semi-structured multimedia documents in the form of Web pages connected through hyperlinks. Unlike most web search engines, which primarily focus on information retrieval functionality, WebDB aims at supporting a comprehensive database-like query functionality, including selection, aggregation, sorting, summary, grouping, and projection. WebDB allows users to access (1) document level information, such as title, URL, length, keywords types and last modified date;(2) intra-document structures, such as tables, forms and images and (3) inter-document linkage information, such as destination URLs and anchors. With these three types of information, comprehensive queries for complex Web-based applications, such as Web mining and Web site management, can be answered. WebDB is based on object-relational concepts: Object-oriented modeling and relational query language. In this paper, we present the data model, language and implementation of WebDB. We also present the novel visual query/browsing interface for semi-structured Web and Web documents. Our system provides high usability compared with other existing systems. (C) 2002 Elsevier Science Ltd. All rights reserved.
Informational and analytical support of the authorities is one of the most relevant subject on the development of a decision support system. This article describes the current process of information-analytical support...
详细信息
ISBN:
(纸本)9781450336406
Informational and analytical support of the authorities is one of the most relevant subject on the development of a decision support system. This article describes the current process of information-analytical support of the authorities by the example of St. Petersburg, Russia, including the analysis of social networks, as well as an analysis of the existing approaches to information and analytical support, and the problems that can be solved by using semi-structured data.
An online integration system enables incremental computation shortly after an increment data arrived at the central site. Processing increments serially ensures all data containers are in their updated states for comp...
详细信息
ISBN:
(纸本)9781479977529
An online integration system enables incremental computation shortly after an increment data arrived at the central site. Processing increments serially ensures all data containers are in their updated states for computation of the next increment data. In general, a data container may show up as several arguments in a data integration expression. Serial processing of increments at this data container failed to show its best performance due to expensive IO costs for materialization updates. This paper proposes an online integration system with dynamic scheduling to enable concurrent processing of increments of data. The online integration system allows a series of transformation of a data integration expression into a single increment expression upon the increments of multiple data containers, and generates a data integration plan. The dynamic scheduling system employs a monitoring system and a priority scheduling which is able to dynamically change the data integration plans according to the increment data behavior.
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semistructureddata classification plays an important role in many data analysis applications....
详细信息
ISBN:
(纸本)9783030731960;9783030731977
Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semistructureddata classification plays an important role in many data analysis applications. In addition to content information, semi-structured data also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.
The aim is the development of a new multi-level model of fuzzy semi-structured information data storage. A distinction of the proposed model from known is the use of extended polybasic intuitionistic sets for descript...
详细信息
ISBN:
(纸本)9781467369619
The aim is the development of a new multi-level model of fuzzy semi-structured information data storage. A distinction of the proposed model from known is the use of extended polybasic intuitionistic sets for description of fuzzy data and representation of fuzzy attributes on three typing levels. The main result is formal description of fuzzy semi-structured data storage. To verify the result an example of application the developed model of data storage in the intellectual diagnostic decision-making system for railway transport is showed.
Developers often prefer flexibility over upfront schema design, making semi-structured data formats such as JSON increasingly popular. Large amounts of JSON data are therefore stored and analyzed by relational databas...
详细信息
ISBN:
(纸本)9781450383431
Developers often prefer flexibility over upfront schema design, making semi-structured data formats such as JSON increasingly popular. Large amounts of JSON data are therefore stored and analyzed by relational database systems. In existing systems, however, JSON's lack of a fixed schema results in slow analytics. In this paper, we present JSON tiles, which, without losing the flexibility of JSON, enables relational systems to perform analytics on JSON data at native speed. JSON tiles automatically detects the most important keys and extracts them transparently - often achieving scan performance similar to columnar storage. At the same time, JSON tiles is capable of handling heterogeneous and changing data. Furthermore, we automatically collect statistics that enable the query optimizer to find good execution plans. Our experimental evaluation compares against state-of-the-art systems and research proposals and shows that our approach is both robust and efficient.
Models' transformations involve code abstraction and program description, i.e., models' transformations (MTs) operate in a more diverse set of artifacts than program transformation. Note that MTs allow program...
详细信息
ISBN:
(纸本)9781728184500
Models' transformations involve code abstraction and program description, i.e., models' transformations (MTs) operate in a more diverse set of artifacts than program transformation. Note that MTs allow programmers to link different structures, as Category Theory does for Mathematics, through recognizing similar features and properties. In this paper, we show that Category Theory can be used to describe MTs. Specifically, we propose a categorical framework for transforming a semi-structured data model, an OEM database, into a model of structureddata, in UML language. This categorical approach allows us to establish a bridge between such models and the categories of simple and directed graphs, which makes it possible to apply the features of such categories to manage databases.
The problem of data classification goes back to the definition of taxonomies covering knowledge areas. With the advent of the Web, the amount of data available increased several orders of magnitude, making manual data...
详细信息
ISBN:
(纸本)9781509001545
The problem of data classification goes back to the definition of taxonomies covering knowledge areas. With the advent of the Web, the amount of data available increased several orders of magnitude, making manual data classification impossible. This work presents an approach based on the prototype theory to automatically classify semi-structured data, represented by frames, without any previous knowledge about structured classes. Our approach uses a variation of the K-Means algorithm that organizes a set of frames into classes, structured as a strict hierarchy.
State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limita...
详细信息
ISBN:
(纸本)9781479950690
State-of-the-art in duplicate detection in semi-structured data obtains significant improvement by exploiting the schema-related knowledge. Such schema-bound duplicate detection approaches, however, have severe limitations when dealing with multi-sourced, heterogeneous, high-velocity data streams. In this paper, we propose a novel context-aware duplicate detection system which is workload-and complexity-aware, and is adaptable to the underlying computing platform. The system operates in schema-oblivious manner, and relies upon information theory based heuristic and data shaping technique for efficient, and scalable duplicate detection in multi-sourced, heterogeneous data sets. Experiments with real-world data sets show speed up of up to 8X over state-of-the-art schemes, while maintaining upto 92 percent accuracy. In addition, our data shaping technique for GPGPU processing speeds up the duplicate detection throughput by up to two orders of magnitude.
JSON ( JavaScript Object Notation) and its derivatives are essential in the modern computing infrastructure. However, existing software often fails to process such types of data in a scalable way, mainly for two reaso...
详细信息
ISBN:
(纸本)9781450362405
JSON ( JavaScript Object Notation) and its derivatives are essential in the modern computing infrastructure. However, existing software often fails to process such types of data in a scalable way, mainly for two reasons: (i) the processing often requires to build a memory-consuming parse tree;(ii) there exist inherent dependences in processing the data stream, preventing any data-level parallelization. Facing the challenges, developers often have to construct ad-hoc pre-parsers to split the data stream in order to reduce the memory consumption and increase the data parallelism. However, this strategy requires more programming efforts. Moreover, the pre-parsing itself is non-trivial to parallelize, thus introducing a new serial bottleneck. To solve the dilemma, this work introduces a scalable yet fully automatic solution - a compilation system, namely JPStream, that compiles standard JSONPath queries into parallel executables with bounded memory footprints. First, JPStream adopts a stream processing design that combines the querying and parsing into one pass, without generating any in-memory parse tree. To achieve this, JPStream uses a novel joint compilation technique that compiles the queries and the JSON syntax together into a single automaton. Furthermore, JPStream leverages the "enumerability" of automaton to break the dependences and reason about the transition rules to prune infeasible cases. It also features a module that learns data constraints from the input data to enhance the pruning. Evaluation on real-world JSON datasets with standard JSONPath queries shows that JPStream can reduce the memory consumption significantly, by up to 95%, meanwhile achieving near-linear speedup on multicore and manycore processors.
暂无评论