文献详情 >SparkDQ: Efficient generic big... 收藏

SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

SparkDQ : 分布式的数据平行上的有效通用大数据质量管理计算

作者：Gu, Rong Qi, Yang Wu, Tongyu Wang, Zhaokang Xu, Xiaolong Yuan, Chunfeng Huang, Yihua

作者机构：Nanjing Univ State Key Lab Navel Software Technol Nanjing Peoples R China Nanjing Univ Dept Comp Sci & Technol Nanjing Peoples R China Nanjing Univ Aeronaut & Astronaut Coll Comp Sci & Technol Nanjing Peoples R China Nanjing Univ Informat Sci & Technol Sch Comp & Software Nanjing Peoples R China

出版物：《JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING》 (并行与分布式计算杂志)

年卷期：2021年第156卷

页面：132-147页

核心收录：

学科分类：08[工学] 0812[工学-计算机科学与技术（可授工学、理学学位）]

基　　金：National Key R&D Program of China [2019YFC1711000] National Natural Science Foundation of China [62072230, U1811461, 61702254] Collaborative Innovation Center of Novel Software Technology and Industrialization, Alibaba Innovative Research Project

主　　题：Parallel data quality algorithms Distributed system Data quality management system Multi-tasks scheduling Big data

摘要：In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near linear data and node scalability. (C) 2021 Elsevier Inc. All rights reserved.

本地馆藏 | 借阅须知 | 我要预约

已订购，未入库

sda

目录详情 | 试阅读 |

读者评论与其他读者分享你的观点

学校读者

FontfaceFontSizeBoldItalicUnderlineBackColorAlignListLinkImgEmot

用户名:未登录

我的评分

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

读者评论与其他读者分享你的观点

请选择收藏分类：

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

看过本文的还看了

相关文献

该作者的其他文献

CADAL相关文献

SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

读者评论 与其他读者分享你的观点

请选择收藏分类： 新增自定义分类 确定 取消

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

读者评论与其他读者分享你的观点

请选择收藏分类：