Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each ...
详细信息
Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error- prone process. In this paper, we propose concepts to adapt to and to exploit the performance enhancements of modern processors automatically. Our core idea is to create processor-specific code variants and to learn a well-performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic- and processor-specific code transformations. Our experimental results show that the performance of code variants may diverge up to two orders of magnitude. In order to achieve peak performance, we generate custom code for each processor. We show that our approach finds an efficient custom code variant for multi-core CPUs, GPUs, and MICs.
Data Lab is an open-access science platform developed and operated by the Community and Science Data Center (CSDC) at NSF's National Optical-Infrared Astronomy Research Laboratory (NOIRLab). It serves public photo...
详细信息
Data Lab is an open-access science platform developed and operated by the Community and Science Data Center (CSDC) at NSF's National Optical-Infrared Astronomy Research Laboratory (NOIRLab). It serves public photometric survey datasets, provides interactive and programmatic data access, and SQL/ADQL query capabilities via TAP. Users also receive generous storage allocations with VOSpace and MyDB, co-located with our data holdings. A host of services such as cross-matching, image cutouts via SIA, file services for survey data, and a Jupyter notebook interface for analysis close to the data complement the mission statement. Launched in 2017 at the National Optical Astronomy Observatory, Data Lab supports a base of over 1,300 registered users, processes on average 15,000 queries daily, serves over 50 TB of photometric catalogs, and provides access to over 2 PB of survey image products at NOIRLab's Science Data Archive. Future development will include support for massive spectroscopic datasets and for processing of alert streams generated by e.g. ZTF and LSST. Users will also be able to create and administrate ad hoc user groups for shared data access and scientific analysis, and will enjoy containerized services and notebook spaces. (C) 2020 The Author( s ). Published by Elsevier B.V.
This paper addresses the problem of evaluating ranked top-k queries with expensive predicates. As major DBMSs now all support expensive user-defined predicates for Boolean queries, we believe such support for ranked q...
详细信息
This paper addresses the problem of evaluating ranked top-k queries with expensive predicates. As major DBMSs now all support expensive user-defined predicates for Boolean queries, we believe such support for ranked queries will be even more important: First, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. Second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. Third, ranked queries often accompany Boolean ranking conditions, which may turn predicates into expensive ones, as the index structure on the predicate built on the base table may be no longer effective in retrieving the filtered objects in order. Fourth, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. These predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. To enable probe minimization, we develop the problem as cost-based optimization of searching over potential probe schedules. In particular, we decouple probe scheduling into object and predicate scheduling problems and develop an analytical object scheduling optimization and a dynamic predicate scheduling optimization, which combined together form a cost-effective probe schedule.
This paper describes a graphical user-interface for database-oriented knowledge discovery systems, DBLEARN, which has been developed for extracting knowledge rules from relational databases. The interface, designed us...
详细信息
This paper describes a graphical user-interface for database-oriented knowledge discovery systems, DBLEARN, which has been developed for extracting knowledge rules from relational databases. The interface, designed using a query-by-example approach, provides a graphical means of specifying knowledge-discovery tasks. The interface supplies a graphical browsing facility to help users to perceive the nature of the target database structure. In order to guide users' task specification, a cooperative, menu-based guidance facility has been integrated into the interface. The interface also supplies a graphical interactive adjusting facility for helping users to refine the task specification to improve the quality of learned knowledge rules. Copyright (C) 1996 Elsevier Science Ltd
We focus on a new, potentially important application of source coding directed toward storage and retrieval, termed fusion coding of correlated sources. The task at hand is to efficiently store multiple correlated sou...
详细信息
We focus on a new, potentially important application of source coding directed toward storage and retrieval, termed fusion coding of correlated sources. The task at hand is to efficiently store multiple correlated sources in a database so that, at any point of time in the future, data from a selective subset of sources specified by user can be efficiently retrieved. Only statistical information about future queries is available in advance. A typical application scenario would be in storage of correlated data generated by dense sensor networks, where information from specific regions is requested in the future. We propose a fusion coder (FC) for lossy storage and retrieval, wherein different queries are handled by allowing for selective (compressed) bit retrieval. We derive the properties of an optimal FC and present an iterative algorithm for its design. Since iterative design is initialization-dependent, we present initialization heuristics that help avoid poor local optima. An analysis of design complexity reveals complexity growth with query-set size. We first tackle this problem by exploiting optimality properties of FCs. We also consider quantization of the query-space with decision trees in order to adapt to new queries, unseen during FC design. Experiments conducted on real and synthetic data-sets demonstrate that the proposed FC is able to achieve significantly better tradeoffs than joint compression by vector quantization (VQ), with retrieval speedups reaching 3x and distortion gains of up to 3.5 dB possible.
Coming high-cadence wide-field optical telescopes will image hundreds of thousands of sources per minute. Besides inspecting the near real-time data streams for transient and variability events, the accumulated data a...
详细信息
Coming high-cadence wide-field optical telescopes will image hundreds of thousands of sources per minute. Besides inspecting the near real-time data streams for transient and variability events, the accumulated data archive is a wealthy laboratory for making complementary scientific discoveries. The goal of this work is to optimise column-oriented database techniques to enable the construction of a full-source and light-curve database for large-scale surveys, that is accessible by the astronomical community. We adopted LOFAR's Transients Pipeline as the baseline and modified it to enable the processing of optical images that have much higher source densities. The pipeline adds new source lists to the archive database, while cross-matching them with the known catalogued sources in order to build a full light-curve archive. We investigated several techniques of indexing and partitioning the largest tables, allowing for faster positional source look-ups in the cross matching algorithms. We monitored all query run times in long-term pipeline runs where we processed a subset of IPHAS data that have image source density peaks over 170,000 per field of view (500,000 deg(-2)). Our analysis demonstrates that horizontal table partitions of declination widths of one-degree control the query run times. Usage of an index strategy where the partitions are densely sorted according to source declination yields another improvement. Most queries run in sublinear time and a few (< 20%) run in linear time, because of dependencies on input source-list and result-set size. We observed that for this logical database partitioning schema the limiting cadence the pipeline achieved with processing IPHAS data is 25 s. (C) 2018 Elsevier B.V. All rights reserved.
Efficient algorithms for processing large volumes of data are very important both for relational and new object-oriented database systems. Many query-processing operations can be implemented using sort- or hash-based ...
详细信息
Efficient algorithms for processing large volumes of data are very important both for relational and new object-oriented database systems. Many query-processing operations can be implemented using sort- or hash-based algorithms, e.g., intersection, join, and duplicate elimination. In the early relational database systems, only sort-based algorithms were employed. In the last decade, hash-based algorithms have gained acceptance and popularity, and are often considered generally superior to sort-based algorithms such as merge-join. In this article, we compare the concepts behind sort- and hash-based query-processing algorithms and conclude that 1) many dualities exist between the two types of algorithms, 2) their costs differ mostly by percentages rather than factors, 3) several special cases exist that favor one or the other choice, and 4) there is a strong reason why both hash- and sort-based algorithms should be available in a query-processing system. Our conclusions are supported by experiments performed using the Volcano query execution engine.
As a growing number of applications represent data as semantic graphs like RDF (Resource Description Format) and the many entity-attribute-value formats, query languages for such data are being required to support ope...
详细信息
As a growing number of applications represent data as semantic graphs like RDF (Resource Description Format) and the many entity-attribute-value formats, query languages for such data are being required to support operations beyond graph pattern matching and inference queries. Specifically the ability to express aggregate queries is an important feature which is either lacking or is implemented with little attention to the peculiarities of the data model. In this paper, we study the meaning and implementation of grouping and aggregate queries over RDF graphs. We first define grouping and aggregate operators algebraically and then show how the SPARQL query language can be extended to express grouping and aggregate queries.
This correspondence proposes two ways to improve the soft-merge based band join algorithm. The techniques proposed address issues that have not been previously discussed: to choose a right relation as the inner relati...
详细信息
This correspondence proposes two ways to improve the soft-merge based band join algorithm. The techniques proposed address issues that have not been previously discussed: to choose a right relation as the inner relation to achieve better performance and to optimally allocate and adjust buffer allocations to make the algorithms robust to data skew and estimation errors.
The underlying processes that enable databasequery execution are fundamental to understanding database management systems. However, these processes are complex and can be difficult to explain and illustrate. To addre...
详细信息
ISBN:
(纸本)9781595939470
The underlying processes that enable databasequery execution are fundamental to understanding database management systems. However, these processes are complex and can be difficult to explain and illustrate. To address this problem, we have developed a Java-based query simulation system that enables students to visualize the steps involved in processing DML queries. We performed a field experiment to evaluate the system, and the results suggest that the system improves student comprehension of the query execution process.
暂无评论