检索结果-内蒙古大学图书馆

SCOPE: parallel databases meet MapReduce

VLDB JOURNAL 2012年第5期21卷 611-636页

作者： Zhou, Jingren Bruno, Nicolas Wu, Ming-Chuan Larson, Per-Ake Chaiken, Ronnie Shakib, Darren Microsoft Corp Redmond WA 98052 USA

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opportunities and challenges for developing a highly scalable and efficient distributed computation system that is easy to program and supports complex system optimization to maximize performance and reliability. In this paper, we describe a distributed computation system, Structured Computations Optimized for parallel Execution (Scope), targeted for this type of massive data analysis. Scope combines benefits from both traditional parallel databases and MapReduce execution engines to allow easy programmability and deliver massive scalability and high performance through advanced optimization. Similar to parallel databases, the system has a SQL-like declarative scripting language with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. An optimizer is responsible for converting scripts into efficient execution plans for the distributed computation engine. A physical execution plan consists of a directed acyclic graph of vertices. Execution of the plan is orchestrated by a job manager that schedules execution on available machines and provides fault tolerance and recovery, much like MapReduce systems. Scope is being used daily for a variety of data analysis and data mining applications over tens of thousands of machines at Microsoft, powering Bing, and other online services.

关键词： SCOPE parallel databases MapReduce Distributed computation Query optimization

来源：评论

学校读者我要写书评

暂无评论

Automatic identification and classification of Palomar Transient Factory astrophysical objects in GLADE

引用

INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING 2018年第4期16卷 337-349页

作者： Zhao, Weijie Rusu, Florin Wu, Kesheng Nugent, Peter Univ Calif Merced 5200 N Lake Rd Merced CA 95343 USA Lawrence Berkeley Natl Lab 1 Cyclotron Rd Berkeley CA 94720 USA

Palomar Transient Factory (PTF) is a comprehensive detection system for the identification and classification of transient astrophysical objects. In this paper, we make two significant contributions to the PTF pipeline. First, we present an experimental study that evaluates a novel implementation of the real-time classifier in GLADE -a parallel data processing system that combines the efficiency of a database with the extensibility of map-reduce. We show how each stage in the classifier maps optimally into GLADE tasks by taking advantage of the unique features of the system - range-based data partitioning, columnar storage, multi-query execution, and in-database support for complex aggregate computation. Second, we introduce a novel parallel similarity join algorithm for advanced transient classification. We implement this algorithm in GLADE and execute it on a massive supercomputer with more than 3,000 threads, achieving more than three orders of magnitude improvement over the PostgreSQL solution.

关键词： parallel databases multi-query processing scientific data analysis similarity join astronomical surveys transient identification

来源：评论

学校读者我要写书评

暂无评论

An identification framework for print-scan books in a large database

引用

INFORMATION SCIENCES 2017年 396卷 33-54页

作者： Lee, Sang-Hoon Kim, Jongyoo Lee, Sanghoon Yonsei Univ Dept Elect & Elect Engn Seoul 120749 South Korea

In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the preprocessed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadobp MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for fedture vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(root N) for N print-scan books. (C) 2017 Elsevier Inc. All rights reserved.

关键词： Image identification Image databases parallel databases MapReduce Inverted index Hierarchical search

来源：评论

学校读者我要写书评

暂无评论

DiNoDB: An Interactive-Speed Query Engine for Ad-Hoc Queries on Temporary Data

引用

IEEE TRANSACTIONS ON BIG DATA 2017年第3期3卷 320-333页

作者： Tian, Yongchao Alagiannis, Ioannis Liarou, Erietta Ailamaki, Anastasia Michiardi, Pietro Vukolic, Marko EURECOM F-06160 Biot France Ecole Polytech Fed Lausanne CH-1015 Lausanne Switzerland IBM Res Zurich CH-8803 Ruschlikon Switzerland

As data sets grow in size, analytics applications struggle to get instant insight into large datasets. Modern applications involve heavy batch processing jobs over large volumes of data and at the same time require efficient ad-hoc interactive analytics on temporary data. Existing solutions, however, typically focus on one of these two aspects, largely ignoring the need for synergy between the two. Consequently, interactive queries need to re-iterate costly passes through the entire dataset (e.g., data loading) that may provide meaningful return on investment only when data is queried over a long period of time. In this paper, we propose DiNoDB, an interactive-speed query engine for ad-hoc queries on temporary data. DiNoDB avoids the expensive loading and transformation phase that characterizes both traditional RDBMSs and current interactive analytics solutions. It is tailored to modern workflows found in machine learning and data exploration use cases, which often involve iterations of cycles of batch and interactive analytics on data that is typically useful for a narrow processing window. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata that DiNoDB exploits to expedite the interactive queries. Our experimental analysis demonstrates that DiNoDB achieves very good performance for a wide range of ad-hoc queries compared to alternatives.

关键词： parallel databases metadata indexing methods

来源：评论

学校读者我要写书评

暂无评论

TOPOLOGICALLY CONSISTENT MODELS for EFFICIENT BIG GEO-SPATIO-TEMPORAL DATA DISTRIBUTION 12

TOPOLOGICALLY CONSISTENT MODELS for EFFICIENT BIG GEO-SPATIO...

引用

12th 3D Geoinfo Conference 2017

作者： Jahn, M.W. Bradley, P.E. Al Doori, M. Breunig, M. KIT - Karlsruhe Institute of Technology Geodetic Institute Germany KIT - Karlsruhe Institute of Technology Institute of Photogrammetry and Remote Sensing Germany AUD - American University of Dubai Department of Electrical and Computer Engineering United Arab Emirates

Geo-spatio-temporal topology models are likely to become a key concept to check the consistency of 3D (spatial space) and 4D (spatial + temporal space) models for emerging GIS applications such as subsurface reservoir modelling or the simulation of energy and water supply of mega or smart cities. Furthermore, the data management for complex models consisting of big geo-spatial data is a challenge for GIS and geo-database research. General challenges, concepts, and techniques of big geo-spatial data management are presented. In this paper we introduce a sound mathematical approach for a topologically consistent geo-spatio-temporal model based on the concept of the incidence graph. We redesign DB4GeO, our service-based geo-spatio-temporal database architecture, on the way to the parallel management of massive geo-spatial data. Approaches for a new geo-spatio-temporal and object model of DB4GeO meeting the requirements of big geo-spatial data are discussed in detail. Finally, a conclusion and outlook on our future research are given on the way to support the processing of geo-analytics and -simulations in a parallel and distributed system environment. © Authors 2017.

关键词： big geo-spatial data big geo-spatio-temporal data nd-databases nd-topology models parallel databases parallel geo-analytics and -simulations topological consistency

来源：评论

学校读者我要写书评

暂无评论

Resource Bricolage for parallel DBMSs on Heterogeneous Clusters

引用

SIGMOD RECORD 2016年第1期45卷 42-49页

作者： Li, Jiexing Naughton, Jeffrey Nehme, Rimma V. Google Inc Mountain View CA USA Univ Wisconsin Madison WI USA Microsoft Jim Gray Syst Lab Madison WI USA

Running parallel database systems in an environment with heterogeneous resources has become increasingly common, due to cluster evolution and increasing interest in moving applications into public clouds or shared infrastructures. For database systems running in a heterogeneous cluster, the default uniform data partitioning strategy may overload some of the slow machines while at the same time it may under-utilize the more powerful machines. Since the processing time of a parallel query is determined by the slowest machine, such an allocation strategy may result in a significant query performance degradation. We take a first step to address this problem by introducing a technique we call resource bricolage that improves database performance in heterogeneous environments. Our approach quantifies the performance differences among machines with various resources as they process workloads with diverse resource requirements. We formalize the problem of minimizing workload execution time and view it as an optimization problem, and then we employ linear programming to obtain a recommended data partitioning scheme. We verify the effectiveness of our technique with an extensive experimental study on a commercial database system.

关键词： Database systems machinery machine processing time parallel Lines parallel databases Heterogeneous

来源：评论

学校读者我要写书评

暂无评论

D-ToSS: A Distributed Throwaway Spatial Index Structure for Dynamic Location Data

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2016年第9期28卷 2334-2348页

作者： Akdogan, Afsin Shahabi, Cyrus Demiryurek, Ugur Univ Southern Calif Dept Comp Sci Los Angeles CA 90089 USA

Many applications deal with moving object datasets, e.g., mobile phone social networking, scientific simulations, and ride-sharing services. These applications need to handle a tremendous number of spatial objects that continuously move and execute spatial queries to explore their surroundings. To manage such update-heavy workloads, several throwaway index structures have recently been proposed, where a static index is rebuilt periodically from scratch rather than updated incrementally. It has been shown that throwaway indices outperform specialized moving-object indices that maintain location updates incrementally. However, throwaway indices suffer from scalability due to their single-server design and the only distributed throwaway index (D-MOVIES), extension of a centralized approach, does not scale out as the number of servers increases, especially during query processing phase. We propose a distributed throwaway spatial index structure (D-ToSS) that not only scales out to multiple servers by using an intelligent partitioning technique but also scales up since it fully exploits the multi-core CPUs available on each server. D-ToSS rapidly constructs a Voronoi Diagram, which has a flat structure making it a perfect fit for parallel processing. For example, we experimentally show a 25 x speedup in query processing compared to D-MOVIES and this gap gets larger as the number of servers increases.

关键词： Spatial databases distributed systems parallel databases

来源：评论

学校读者我要写书评

暂无评论

Thrifty: Offering parallel Database as a Service using the Shared-Process Approach 15

Thrifty: Offering Parallel Database as a Service using the S...

引用

ACM SIGMOD International Conference on Management of Data

作者： Wong, Petrie He, Zhian Feng, Ziqiang Xu, Wenjian Lo, Eric Univ Hong Kong Dept Comp Sci Hong Kong Peoples R China Hong Kong Polytech Univ Dept Comp Hong Kong Peoples R China

ISBN: (纸本)9781450327589

Recently, Amazon has announced Redshift, a parallel-Databaseas-a-Service (PDaaS). Redshift adopts the "virtual cluster" approach to implement multi-tenancy, which has the merit of hard isolation among tenants (i.e., tenants do not interfere even when sharing resources). However, that benefit comes with poor resource utilization due to the significant redundancy incurred in the resources. In this demonstration, we present Thrifty, a parallel-Databaseas-a-Service operated using the "shared-process" approach. Compared with Redshift, each tenant in Thrifty does not occupy an exclusive amount of resource but share the database processes together, leading to better resource utilization. To avoid contention among tenants, Thrifty uses a proper cluster design, a tenant placement scheme, and a query routing mechanism to achieve soft isolation. In the demonstration, an attendee will be invited to register with Thrifty as a tenant to rent a parallel database instance. Then the attendee will be allowed to view the dashboard of a Thrifty's administrator. Next, the attendee will be invited to control (e.g., increase) the workload of the tenant so as to see how Thrifty carries out online re-consolidation and elastic scaling.

关键词： cloud databases consolidation database-as-a-service parallel databases multi-tenant databases

来源：评论

学校读者我要写书评

暂无评论

The Stratosphere platform for big data analytics

引用

VLDB JOURNAL 2014年第6期23卷 939-964页

作者： Alexandrov, Alexander Bergmann, Rico Ewen, Stephan Freytag, Johann-Christoph Hueske, Fabian Heise, Arvid Kao, Odej Leich, Marcus Leser, Ulf Markl, Volker Naumann, Felix Peters, Mathias Rheinlaender, Astrid Sax, Matthias J. Schelter, Sebastian Hoeger, Mareike Tzoumas, Kostas Warneke, Daniel Tech Univ Berlin Berlin Germany Humboldt Univ D-10099 Berlin Germany Hasso Plattner Inst Potsdam Germany Int Comp Sci Inst Berkeley CA 94704 USA

We present Stratosphere, an open-source software stack for parallel data analysis. Stratosphere brings together a unique set of features that allow the expressive, easy, and efficient programming of analytical applications at very large scale. Stratosphere's features include "in situ" data processing, a declarative query language, treatment of user-defined functions as first-class citizens, automatic program parallelization and optimization, support for iterative programs, and a scalable and efficient execution engine. Stratosphere covers a variety of "Big Data" use cases, such as data warehousing, information extraction and integration, data cleansing, graph analysis, and statistical analysis applications. In this paper, we present the overall system architecture design decisions, introduce Stratosphere through example queries, and then dive into the internal workings of the system's components that relate to extensibility, programming model, optimization, and query execution. We experimentally compare Stratosphere against popular open-source alternatives, and we conclude with a research outlook for the next years.

关键词： Big data parallel databases Query processing Query Optimization Data cleansing Text mining Graph processing Distributed systems

来源：评论

学校读者我要写书评

暂无评论

A Comparative Study of Implementation Techniques for Query Processing in Multicore Systems

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2014年第1期26卷 3-15页

作者： Viglas, Stratis D. Univ Edinburgh Sch Informat Edinburgh EH8 9AB Midlothian Scotland

Multicore systems and multithreaded processing are now the de facto standards of enterprise and personal computing. If used in an uninformed way, however, multithreaded processing might actually degrade performance. We present the facets of the memory access bottleneck as they manifest in multithreaded processing and show their impact on query evaluation. We present a system design based on partition parallelism, memory pooling, and data structures conducive to multithreaded processing. Based on this design, we present alternative implementations of the most common query processing algorithms, which we experimentally evaluate using multiple scenarios and hardware platforms. Our results show that the design and algorithms are indeed scalable across platforms, but the choice of optimal algorithm largely depends on the problem parameters and underlying hardware. However, our proposals are a good first step toward generic multithreaded parallelism.

关键词： parallel databases query processing parallel algorithms parallel processors multithreaded processors

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：