This paper introduces a spark-based fast solution for privacy-preserving frequent pattern mining problems for big data. Spark Resilient Distributed dataset (RDD) framework has been used to implement the Mask approach,...
详细信息
ISBN:
(纸本)9783031640759;9783031640766
This paper introduces a spark-based fast solution for privacy-preserving frequent pattern mining problems for big data. Spark Resilient Distributed dataset (RDD) framework has been used to implement the Mask approach, which uses the probabilistic distortion method for maintaining data privacy while mining frequent patterns. The masking technique shows very promising results in terms of privacy and utility both. However, due to sequential nature limits the application to small or medium size data. The spark-based proposed technique introduces two-level parallelization i.e. data and algorithmic level which in turn paves a way to gain faster analytical results in a bounded amount of time while dealing with a large volume of datasets. This makes the application feasible for the current growth of data size. A number of experiments have been conducted to compare the performance of the proposed scheme with benchmark parallel approaches in terms of privacy, utility, and time complexity over real and simulated data sets. It has been observed that the proposed scheme preserves the privacy of sensitive data while maintaining utility within a real bound of time. Experiments show that the proposed Spark-based scheme i.e. S-Mask gains 16 times speedup on average over different benchmark data sets and maintains a desired ratio between privacy and utility of the data.
This paper introduces a new theoretical scheme for the solution of the frequent itemset hiding problem. We propose an algorithmic approach that consists of a novel constraint-based hiding model which encompasses hidin...
详细信息
This paper introduces a new theoretical scheme for the solution of the frequent itemset hiding problem. We propose an algorithmic approach that consists of a novel constraint-based hiding model which encompasses hiding into one pass mining, along with a solution methodology that relies on Linear Programming. The induced patterns by the constraint-basedmining algorithm are, in this way, utilized to build a minimal linear program whose solution dictates the construction of a database extension that delivers the sought-for hiding. This extension should be appended to the original database and released as a whole for mining, with that resulting extended database hiding the sensitive knowledge that we want to protect. Our proposed theory outdoes both in space complexity and accuracy, all the existing approaches which have been proposed so far in this domain and we proved that superiority with a series of experiments against other existing approaches. Our proposal sheds a new light on the exploration of new algorithmic techniques which can be handily applied to model hiding problems by providing solutions that computationally outperform all existing modeling approaches for hiding.
In this paper, we address the problem of mining sequential patterns under multiple constraints. Unlike classical algorithms, our approach handles various types of constraints which are not only numeric but also symbol...
详细信息
ISBN:
(纸本)9781450331968
In this paper, we address the problem of mining sequential patterns under multiple constraints. Unlike classical algorithms, our approach handles various types of constraints which are not only numeric but also symbolic and syntactic. These multiple constraints enable us to express a large scope of knowledge to focus on interesting patterns. We illustrate our approach with the detection of gene rare disease relationships from biomedical texts for the documentation of rare diseases.
We are designing new datamining techniques on boolean contexts to identify a priori interesting bi-sets, i.e., sets of objects (or transactions) and associated sets of attributes (or items). It improves the state of ...
详细信息
We are designing new datamining techniques on boolean contexts to identify a priori interesting bi-sets, i.e., sets of objects (or transactions) and associated sets of attributes (or items). It improves the state of the art in many application domains where transactional/boolean data are to be mined (e. g., basket analysis, WWW usage mining, gene expression data analysis). The so-called (formal) concepts are important special cases of a priori interesting bi-sets that associate closed sets on both dimensions thanks to the Galois operators. Concept mining in boolean data is tractable provided that at least one of the dimensions (number of objects or attributes) is small enough and the data is not too dense. The task is extremely hard otherwise. Furthermore, it is important to enable user-defined constraints on the desired bi-sets and use them during the extraction to increase both the efficiency and the a priori interestingness of the extracted patterns. It leads us to the design of a new algorithm, called D-Miner, for mining concepts under constraints. We provide an experimental validation on benchmark data sets. Moreover, we introduce an original datamining technique for microarray data analysis. Not only boolean expression properties of genes are recorded but also we add biological information about transcription factors. In such a context, D-Miner can be used for concept mining under constraints and outperforms the other studied algorithms. We show also that data enrichment is useful for evaluating the biological relevancy of the extracted concepts.
The mechanism of gene regulation is of great interest for biologists, especially in the genomic field. One part of mechanisms controlling the genes expression is provided by the transcription factors, which are protei...
详细信息
ISBN:
(纸本)9787302139225
The mechanism of gene regulation is of great interest for biologists, especially in the genomic field. One part of mechanisms controlling the genes expression is provided by the transcription factors, which are proteins that can either repress or stimulate the transcription of a gene. In this paper, we propose a new datamining algorithm, based on boolean contexts, in order to extract a priori relevant frequent closed gensets, i.e., sets of tissus and associated sets of genes and transcription factors which are useful for the biologist. The key feature of our algorithm is a better compromise between the size of the search space and the conveyed discovered knowledge in bioinformatics. For this, the proposed algorithm, called MC(2)G for mining Cconstraint Closed Gensets, uses the Frequent Pattern Tree (FP-Tree) structure, which is an extended Prefix-Tree structure, to prime the search space. Moreover MC(2)G enables to define statistical and syntaxic constraints on the desired frequent closed gensets and uses them during the extraction process. Experimental comparisons with other algorithms are achieved on real world datasets. http://***/stamp/***?arnumber=4281879
暂无评论