Workflows are widely used in data-intensive applications since it facilities the composition of individual executables or scripts, providing an easy-to-use parallelization to domain experts. With considerable populari...
详细信息
ISBN:
(纸本)9780769549569;9781467362184
Workflows are widely used in data-intensive applications since it facilities the composition of individual executables or scripts, providing an easy-to-use parallelization to domain experts. With considerable popularity of MapReduce framework, some researchers start to develop MapReduce-enabled workflows instead of general file-based ones. Meanwhile, being commercially available for nearly two decades for large-scale data processing, parallel database systems have also gotten wide attention in the support of workflows. This paper studies three real-world text processing workflows and develops them on top of several different large data processing approaches including an open source MapReduce implementation - Hadoop, a workflow-oriented parallel database system - ParaLite, and a hybrid of MapReduce and parallel DBMS - Hive. We discuss their strength/weaknesses both in terms of programmability and performance for each workflow. Our experiences and experimental results reveal some interesting trade-offs: (1) High-level query languages (SQL of ParaLite and HiveQL of Hive) are helpful for expressing data selection, aggregation and calculation by typical executables;(2) To reuse existing NLP tools, it is often important to be able to track the association between a document and its annotation attached by the tool, for which the expressiveness of SQL is particularly useful;(3) Each system has similar performance in the execution of overall workflows because essentially performing executables takes most of the time, but some small differences could reveal some potential trade-offs that each system entails for workflows.
The parallel database system (PDS) owns high performance and high availability and is suitable for mass data storing and processing. However, data loading performance is a bottleneck in PDS. To improve the data loadin...
详细信息
ISBN:
(纸本)9781467318556;9781467318570
The parallel database system (PDS) owns high performance and high availability and is suitable for mass data storing and processing. However, data loading performance is a bottleneck in PDS. To improve the data loading performance, this paper proposes an optimized load algorithm based on cloud platform which can promote the speed of load process. The paper gives the algorithm description, elaborates the algorithm implementation process through an example, and discusses the correctness of the algorithm at last.
The development of paralleldatabase management systems is an urgent problem due to the rapid information volume growth. Nowadays the basic principles of DBMS performance improvement include the use of multiprocessor ...
详细信息
ISBN:
(纸本)9789532330953
The development of paralleldatabase management systems is an urgent problem due to the rapid information volume growth. Nowadays the basic principles of DBMS performance improvement include the use of multiprocessor systems [8]. At the same time, acceleration could be achieved by using new hardware architectures, such as hybrid clusters with manycore coprocessors. The implementation of such architectures is limited by the high cost of hardware and its configuration. Therefore, the development of models that allow determining several characteristics and comparing different database queries runtime without both using real hardware and taking into account the exact execution details is a highly topical problem. This paper describes the development of a mathematical model that explores the effectiveness of a new manycore accelerator with Intel Xeon Phi Knights Landing hardware architecture in terms of paralleldatabase processing.
The parallel database system (PDS) owns high performance and high availability and is suitable for mass data storing and processing. However, data loading performance is a bottleneck in PDS. To improve the data load...
详细信息
The parallel database system (PDS) owns high performance and high availability and is suitable for mass data storing and processing. However, data loading performance is a bottleneck in PDS. To improve the data loading performance, this paper proposes an optimized load algorithm based on cloud platform which can promote the speed of load process. The paper gives the algorithm description, elaborates the algorithm implementation process through an example, and discusses the correctness of the algorithm at last.
暂无评论