Dear editor,This letter presents an unsupervised feature selection method based on machine *** selection is an important component of artificial intelligence,machine learning,which can effectively solve the curse of d...
详细信息
Dear editor,This letter presents an unsupervised feature selection method based on machine *** selection is an important component of artificial intelligence,machine learning,which can effectively solve the curse of dimensionality *** most of the labeled data is expensive to obtain.
Local differential privacy(LDP)approaches to collecting sensitive information for frequent itemset mining(FIM)can reliably guarantee *** current approaches to FIM under LDP add"padding and sampling"steps to ...
详细信息
Local differential privacy(LDP)approaches to collecting sensitive information for frequent itemset mining(FIM)can reliably guarantee *** current approaches to FIM under LDP add"padding and sampling"steps to obtain frequent itemsets and their frequencies because each user transaction represents a set of *** current state-of-the-art approach,namely set-value itemset mining(SVSM),must balance variance and bias to achieve accurate ***,an unbiased FIM approach with lower variance is highly *** narrow this gap,we propose an Item-Level LDP frequency oracle approach,named the Integrated-with-Hadamard-Transform-Based Frequency Oracle(IHFO).For the first time,Hadamard encoding is introduced to a set of values to encode all items into a fixed vector,and perturbation can be subsequently applied to the *** FIM approach,called optimized united itemset mining(O-UISM),is pro-posed to combine the padding-and-sampling-based frequency oracle(PSFO)and the IHFO into a framework for acquiring accurate frequent itemsets with their ***,we theoretically and experimentally demonstrate that O-UISM significantly outperforms the extant approaches in finding frequent itemsets and estimating their frequencies under the same privacy guarantee.
This paper proposes a new method to cluster law texts based on referential relation of laws. We extract law entities (an entity represents a law) and their referential relation from law texts. Then SimRank algorithm i...
详细信息
In this paper we present a new approach for the automatic identification of domain-relevant concepts and entities of a given domain using the category and page structures of the Wikipedia in a language independent way...
详细信息
data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system ***,the design of effective partition schemes faces multiple challen...
详细信息
data partitioning techniques are pivotal for optimal data placement across storage devices,thereby enhancing resource utilization and overall system ***,the design of effective partition schemes faces multiple challenges,including considerations of the cluster environment,storage device characteristics,optimization objectives,and the balance between partition quality and computational ***,dynamic environments necessitate robust partition detection *** paper presents a comprehensive survey structured around partition deployment environments,outlining the distinguishing features and applicability of various partitioning strategies while delving into how these challenges are *** discuss partitioning features pertaining to database schema,table data,workload,and runtime *** then delve into the partition generation process,segmenting it into initialization and optimization stages.A comparative analysis of partition generation and update algorithms is provided,emphasizing their suitability for different scenarios and optimization ***,we illustrate the applications of partitioning in prevalent database products and suggest potential future research directions and *** survey aims to foster the implementation,deployment,and updating of high-quality partitions for specific system scenarios.
Chinese radicals play important roles in forming Chinese character's semantic meaning. The semantic properties of radicals make them a promising source of information to be analyzed in text mining and content extr...
详细信息
The processing of XML queries can result in evaluation of various structural relationships. Efficient algorithms for evaluating ancestor-descendant and parent-child relationships have been proposed. Whereas the proble...
详细信息
The processing of XML queries can result in evaluation of various structural relationships. Efficient algorithms for evaluating ancestor-descendant and parent-child relationships have been proposed. Whereas the problems of evaluating preceding-sibling-following-sibling and preceding-following relationships are still open. In this paper, we studied the structural join and staircase join for sibling relationship. First, the idea of how to filter out and minimize unnecessary reads of elements using parent's structural information is introduced, which can be used to accelerate structural joins of parent-child and preceding-sibling-following-sibling relationships. Second, two efficient structural join algorithms of sibling relationship are proposed. These algorithms lead to optimal join performance: nodes that do not participate in the join can be judged beforehand and then skipped using B^+-tree index. Besides, each element list joined is scanned sequentially once at most. Furthermore, output of join results is sorted in document order. We also discussed the staircase join algorithm for sibling axes. Studies show that, staircase join for sibling axes is close to the structural join for sibling axes and shares the same characteristic of high efficiency. Our experimental results not only demonstrate the effectiveness of our optimizing techniques for sibling axes, but also validate the efficiency of our algorithms. As far as we know, this is the first work addressing this problem specially.
Duplicate detection has been well recognized as a crucial task to improve the quality of data. Related work on this problem mainly aims to propose efficient approaches over a single machine. However, with increasing v...
详细信息
ISBN:
(纸本)9783642235344;9783642235351
Duplicate detection has been well recognized as a crucial task to improve the quality of data. Related work on this problem mainly aims to propose efficient approaches over a single machine. However, with increasing volume of the data, the performance to identify duplicates is still far from satisfactory. Hence, we try to handle the problem of duplicate detection over MapReduce, a share-nothing paradigm. We argue the performance of utilizing MapReduce to detect duplicates mainly depends on the number of candidate record pairs. In this paper, we proposed a new signature scheme with new pruning strategy over MapReduce to minimize the number of candidate record pairs. Our experimental results over both real and synthetic datasets demonstrate that our proposed signature based method is efficient and scalable.
Predicting new user's reaction behavior to its recommended candidate partner correctly is critical to improve recommendation accuracy in online dating systems. However, new user (cold start) problem and data spars...
详细信息
ISBN:
(纸本)9783642258558
Predicting new user's reaction behavior to its recommended candidate partner correctly is critical to improve recommendation accuracy in online dating systems. However, new user (cold start) problem and data sparseness problem in the online dating system make this task very challenging. In this paper, we propose a hybrid method called crowd wisdom based behavior prediction to solve the two problems and achieve good prediction accuracy. By this method, old users who have been recommended partners before are first separated into groups. Users in each group have similar preference for partners. Then, we propose a novel measure to combine a group user's collective behavior to predict one user's behavior, which can solve the data sparseness problem. By calculating the probability a new user belongs to each group and utilizing the group's behavior we can solve the new user problem. Based on these strategies. we develop a behavior prediction algorithm for new users. Experimental results conducted on a real online dating dataset show that our proposed method performs better than other traditional methods.
Implementing runtime integrity measurement in an acceptable way is a big challenge. We tackle this challenge by developing a framework called Patos. This paper discusses the design and implementation concepts of our o...
详细信息
暂无评论