咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >SparkDQ: Efficient generic big... 收藏

SparkDQ: Efficient generic big data quality management on distributed data-parallel computation

SparkDQ : 分布式的数据平行上的有效通用大数据质量管理计算

作     者:Gu, Rong Qi, Yang Wu, Tongyu Wang, Zhaokang Xu, Xiaolong Yuan, Chunfeng Huang, Yihua 

作者机构:Nanjing Univ State Key Lab Navel Software Technol Nanjing Peoples R China Nanjing Univ Dept Comp Sci & Technol Nanjing Peoples R China Nanjing Univ Aeronaut & Astronaut Coll Comp Sci & Technol Nanjing Peoples R China Nanjing Univ Informat Sci & Technol Sch Comp & Software Nanjing Peoples R China 

出 版 物:《JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING》 (并行与分布式计算杂志)

年 卷 期:2021年第156卷

页      面:132-147页

核心收录:

学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)] 

基  金:National Key R&D Program of China [2019YFC1711000] National Natural Science Foundation of China [62072230, U1811461, 61702254] Collaborative Innovation Center of Novel Software Technology and Industrialization, Alibaba Innovative Research Project 

主  题:Parallel data quality algorithms Distributed system Data quality management system Multi-tasks scheduling Big data 

摘      要:In the big data era, large amounts of data are under generation and accumulation in various industries. However, users usually feel hindered by the data quality issues when extracting values from the big data. Thus, data quality issues are gaining more and more attention from data quality management analysts. Cutting-edge solutions like data ETL, data cleaning, and data quality monitoring systems have many deficiencies in capability and efficiency, making it difficult to cope with complicated situations on big data. These problems inspire us to build SparkDQ, a generic distributed data quality management model and framework that provides a series of data quality detection and repair interfaces. Users can quickly build custom tasks of data quality computing for various needs by utilizing these interfaces. In addition, SparkDQ implements a set of algorithms that in a parallel manner with optimizations. These algorithms aim at various data quality goals. We also propose several system-level optimizations, including the job level optimization with multi-task execution scheduling and the data-level optimization with data state caching. The experimental evaluation shows that the proposed distributed algorithms in SparkDQ run up to 12 times faster compared to the corresponding stand-alone serial and multi-thread algorithms. Compared with the cutting-edge distributed data quality solution Apache Griffin, SparkDQ has more features, and its execution time is only around half of Apache Griffin on average. SparkDQ achieves near linear data and node scalability. (C) 2021 Elsevier Inc. All rights reserved.

读者评论 与其他读者分享你的观点