作者:
Chen, JunchaoThou, Jian-taoHao, XinyuInner Mongolia Univ
Natl & Local Joint Engn Res Ctr Intelligent Infor Inner Mongolia Engn Lab Cloud Comp & Serv Softwar Inner Mongolia Key Lab Social Comp & Data ProcCo Hohhot Peoples R China
The accuracy of data analysis depends on data quality, and addressing data consistency issues is a key challenge to improve it. constant conditional functional dependency (CCFD) is an effective approach that ensures d...
详细信息
ISBN:
(纸本)9798350376975;9798350376968
The accuracy of data analysis depends on data quality, and addressing data consistency issues is a key challenge to improve it. constant conditional functional dependency (CCFD) is an effective approach that ensures data consistency by enforcing bindings of semantically related values, thus providing quality assurance for data analysis and decision-making processes. However, with the growth of data scale, especially the increasing number of data tuples and attributes, existing single-machine CCFD discovery algorithms face issues of low computational efficiency and lengthy computation time. This paper proposes a time-efficient distributed CCFD discovery algorithm (DCCFD). Through the optimization of data preprocessing and index mapping, the data organization structure is enhanced, laying the foundation for the discovery of CCFDs under distributed conditions. The Spark parallel computing framework is used to partition the dataset, which accelerates the parallel loading and processing of data. Additionally, this algorithm ensures accuracy and processing speed when discovering dependencies by efficiently generating frequent itemsets and verifying CCFDs in parallel. Experiments on multiple real datasets show that, especially with the complex Airline dataset, the DCCFD algorithm not only accurately discovers CCFDs, but also reduces the average running time by 75.64% compared with the preCFDMiner algorithm.
暂无评论