Retrieval of arbitrary-shaped surrounding data objects has many potential applications in spatial databases including nearby arbitrary-shaped object-of-interests retrieval surrounding a user. In this paper, we propose...
详细信息
Retrieval of arbitrary-shaped surrounding data objects has many potential applications in spatial databases including nearby arbitrary-shaped object-of-interests retrieval surrounding a user. In this paper, we propose directional zone concept to determine directional similarity among spatial data objects. Then, we propose a novel query, called direction-based spatial skyline (DSS), which retrieves non-dominated arbitrary-shaped surrounding data objects in spatial databases for a user. The proposed DSS query is rotationally invariant as well as fair. We develop efficient algorithms for processing DSS queries in spatial databases by designing novel data pruning techniques using R-Tree data indexing scheme. Finally, we demonstrate the effectiveness and efficiency of our approach by conducting extensive experiments with real datasets.
As a major type of continuous spatial queries, the moving spatial keyword queries have been studied extensively. Most existing studies focus on retrieving single objects, each of which is close to the query object and...
详细信息
As a major type of continuous spatial queries, the moving spatial keyword queries have been studied extensively. Most existing studies focus on retrieving single objects, each of which is close to the query object and relevant to the query keywords. Nevertheless, a single object may not satisfy all the needs of a user, e.g., a user who is driving may want to withdraw money, wash her car, and buy some medicine, which could only be satisfied by multiple objects. We thereby formulate a new type of queries named the moving collective spatial keyword query (MCSKQ). This type of queries continuously reports a set of objects that collectively cover the query keywords as the query moves. Meanwhile, the returned objects must also be close to the query object and close to each other. Computing the exact result set is an NP-hard problem. To reduce the queryprocessing costs, we propose algorithms, based on safe region techniques, to maintain the exact result set while the query object is moving. We further propose two approximate algorithms to obtain even higher query efficiency with precision bounds. All the proposed algorithms are also applicable to MCSKQ with weighted objects and MCSKQ in the domain of road networks. We verify the effectiveness and efficiency of the proposed algorithms both theoretically and empirically, and the results confirm the superiority of the proposed algorithms over the baseline algorithms.
Understanding the influence of a product is crucially important for making informed business decisions. This paper introduces a new type of skyline queries, called uncertain reverse skyline, for measuring the influenc...
详细信息
ISBN:
(纸本)9781450352826
Understanding the influence of a product is crucially important for making informed business decisions. This paper introduces a new type of skyline queries, called uncertain reverse skyline, for measuring the influence of a probabilistic product in uncertain data settings. More specifically, given a dataset of probabilistic products and a set of customers C, an uncertain reverse skyline of a probabilistic product q retrieves all customers c epsilon C which include q as one of their preferred products. We present efficient pruning ideas and techniques for processing the uncertain reverse skyline query of a probabilistic product using R-Tree data index. We also present an efficient parallel approach to compute the uncertain reverse skyline and influence score of a probabilistic product. Our approach significantly outperforms the baseline approach derived from the existing literature. The efficiency of our approach is demonstrated by conducting experiments with both real and synthetic datasets.
Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to...
详细信息
Rank (i.e., top-k) join queries play a key role in modern analytics tasks. However, despite their importance and unlike centralized settings, they have been completely overlooked in cloud NoSQL settings. We attempt to fill this gap: We contribute a suite of solutions and study their performance comprehensively. Baseline solutions are offered using SQL-like languages (like Hive and Pig), based on MapReduce jobs. We first provide solutions that are based on specialized indices, which may themselves be accessed using either MapReduce or coordinator-based strategies. The first index-based solution is based on inverted indices, which are accessed with MapReduce jobs. The second index-based solution adapts a popular centralized rank-join algorithm. We further contribute a novel statistical structure comprising histograms and Bloom filters, which forms the basis for the third index-based solution. We provide (i) MapReduce algorithms showing how to build these indices and statistical structures, (ii) algorithms to allow for online updates to these indices, and (iii) query processing algorithms utilizing them. We implemented all algorithms in Hadoop (HDFS) and HBase and tested them on TPC-H datasets of various scales, utilizing different queries on tables of various sizes and different score-attribute distributions. We ported our implementations to Amazon EC2 and "in-house" lab clusters of various scales. We provide performance results for three metrics: query execution time, network bandwidth consumption, and dollar-cost for query execution.
Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search simi...
详细信息
Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, pattern recognition, etc. A fundamental and critical query primitive is to efficiently search similar structures in a large collection of graphs. This paper studies the graph similarity queries with edit distance constraints. Existing solutions to the problem utilize fixed-size overlapping substructures to generate candidates, and thus become susceptible to large vertex degrees or large distance thresholds. In this paper, we present a partition-based approach to tackle the problem. By dividing data graphs into variable-size non-overlapping partitions, the edit distance constraint is converted to a graph containment constraint for candidate generation. We develop efficient query processing algorithms based on the new paradigm. A candidate pruning technique and an improved graph edit distance algorithm are also developed to further boost the performance. In addition, a cost-aware graph partitioning technique is devised to optimize the index. Extensive experiments demonstrate our approach significantly outperforms existing approaches.
Uncertain data is quite common nowadays in a variety of modern database applications. At the same time, the join operation is one of the most important but expensive operations in SQL. However, join queries on uncerta...
详细信息
ISBN:
(纸本)9781424489589
Uncertain data is quite common nowadays in a variety of modern database applications. At the same time, the join operation is one of the most important but expensive operations in SQL. However, join queries on uncertain data have not been adequately addressed thus far. In this paper, we study the SQL join operation on uncertain attributes. We observe and formalize two kinds of join operations on such data, namely v-join and d-join. They are each useful for different applications. Using probability theory, we then devise efficient query processing algorithms for these join operations. Specifically, we use probability bounds that are based on the moments of random variables to either early accept or early reject a candidate v-join result tuple. We also devise an indexing mechanism and an algorithm called Two-End Zigzag Join to further save I/O costs. For d-join, we first observe that it can be reduced to a special form of similarity join in a multidimensional space. We then design an efficient algorithm called condensed d-join and an optimal condensation scheme based on dynamic programming. Finally, we perform a comprehensive empirical study using both real datasets and synthetic datasets.
作者:
Chen, YHSu, SYWUNIV FLORIDA
DEPT COMP & INFORMAT SCI & ENGN DATABASE SYST RES & DEV CTR GAINESVILLE FL 32611 USA
Object-oriented database management systems (OODBMSs) provide rich facilities for the modeling and processing of structural as well as behavioral properties of complex application objects. However, due to their inhere...
详细信息
Object-oriented database management systems (OODBMSs) provide rich facilities for the modeling and processing of structural as well as behavioral properties of complex application objects. However, due to their inherent generality and continuously evolving functionalities, efficient implementations are important for these OODBMSs to support the present and future applications, particularly when the databases are very large. In this paper, we present several parallel, multi-wavefront algorithms based on two processing approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries. Both approaches allow more processors to operate concurrently on a query than the traditional tree-structured queryprocessing approach, thus introducing a higher degree of parallelism in queryprocessing. A heuristic method is presented for partitioning an object-oriented database (OODB). The main consideration for partitioning the database is load balancing. This method also tries to reduce the communication time by reducing the length of the path that wavefronts need to be propagated. Multiple wavefront algorithms based on the two approaches for tree-structured queries have been implemented on an nCUBE 2 parallel computer. The implementation of the query processor allows multiple queries to be executed simultaneously. This implementation provides an environment for evaluating the algorithms and the heuristic method for partitioning the database. The evaluation results are presented in this paper.
This paper presents a queryprocessing algorithm, formulated and developed in support of the prototype architecture of the Distributed Access View Integrated Database (DAVID) which is a heterogeneous distributed datab...
详细信息
This paper presents a queryprocessing algorithm, formulated and developed in support of the prototype architecture of the Distributed Access View Integrated Database (DAVID) which is a heterogeneous distributed database management system. The objective of the proposed queryprocessing algorithm is to produce an inexpensive strategy for a given query. The inexpensive query strategy is obtained primarily by computing the most profitable semi-joins and by determining the best sequence of join operations per processing site. The latter is obtained by applying a zero-one integer linear program that uses a non-parametric statistical estimation technique to compute the sizes of the temporary clusters. A cluster is a subset of the cartesian product of a list of atomic and non-atomic domains and is the structure that can represent in a uniform way data stored in relational, hierarchical and network databases. Following some background information on the development of the DAVID prototype, this paper introduces the schema architecture. The schema architecture describes the mechanism by which the component heterogeneous database schemata are mapped into the uniform global schema. This is followed by the formulation of the queryprocessing algorithm, its implementation and an illustration of its use in the context of NASA's Astrophysics Data System.
暂无评论