作者:
Sridhar, S.Anna Univ
Dept Informat Sci & Technol Chennai 600025 Tamil Nadu India
The use of datamining techniques to improve the diagnostic system accuracy is investigated in this paper. The datamining algorithms aim to discover patterns and extract useful knowledge from facts recorded in databa...
详细信息
The use of datamining techniques to improve the diagnostic system accuracy is investigated in this paper. The datamining algorithms aim to discover patterns and extract useful knowledge from facts recorded in databases. Generally, the expert systems are constructed for automating diagnostic procedures. The learning component uses the datamining algorithms to extract the expert system rules from the database automatically. Learning algorithms can assist the clinicians in extracting knowledge automatically. As the number and variety of data sources is dramatically increasing, another way to acquire knowledge from databases is to apply various datamining algorithms that extract knowledge from data. As data sets are inherently distributed, the distributed system uses agents to transport the trained classifiers and uses meta learning to combine the knowledge. Commonsense reasoning is also used in association with distributed data mining to obtain better results. Combining human expert knowledge and datamining knowledge improves the performance of the diagnostic system. This work suggests a framework of combining the human knowledge and knowledge gained by better datamining algorithms on a renal and gallstone data set.
Huge amounts of data are collected in numerous independent data storage facilities around the world. However, how the data is distributed between physical locations remains unspecified. Downloading all of the data for...
详细信息
Huge amounts of data are collected in numerous independent data storage facilities around the world. However, how the data is distributed between physical locations remains unspecified. Downloading all of the data for the purpose of processing is undesirable and sometimes even impossible. Various methods have been proposed for performing datamining tasks, but the main problem is the lack of an objective strategy for comparing them. The authors present current research on a novel evaluation platform for distributed data mining (DDM) algorithms. The proposed platform opens up a new field to evaluate algorithms in terms of the quality of the results, transfer used, and speed, but also for the use of a non-uniform data distribution among independent nodes during algorithm evaluation. This work introduces a `data partitioning strategy' term referring to a specific, not necessarily uniform data distribution. A brief evaluation for three clustering algorithms is also reported showing the usability and simplicity of identifying differences in processing with the use of the platform.
Maximizing a monotone submodular function is a fundamental task in datamining, machine learning, economics, and statistics. In this paper, we present two communication-efficient decentralized online algorithms for th...
详细信息
ISBN:
(纸本)9798400701245
Maximizing a monotone submodular function is a fundamental task in datamining, machine learning, economics, and statistics. In this paper, we present two communication-efficient decentralized online algorithms for the monotone continuous DR-submodular maximization problem, both of which reduce the number of perfunction gradient evaluations and per-round communication complexity from T-3/2 to 1. The first one, One-shot Decentralized MetaFrank-Wolfe (Mono-DMFW), achieves a ( 1 - 1/e)-regret bound of O(T-4/5). As far as we know, this is the first one-shot and projectionfree decentralized online algorithm for monotone continuous DRsubmodular maximization. Next, inspired by the non-oblivious boosting function [29], we propose the Decentralized Online Boosting Gradient Ascent (DOBGA) algorithm, which attains a (1- 1/e)-regret of O (root T). To the best of our knowledge, this is the first result to obtain the optimal O (root T) against a ( 1- 1/e)-approximation with only one gradient inquiry for each local objective function per step. Finally, various experimental results confirm the effectiveness of the proposed methods.
The quantity of data that is captured, collected, and stored by a wide variety of organizations is growing at an exponential rate. The potential for such data to support scientific discovery and optimization of existi...
详细信息
ISBN:
(纸本)9780769551098
The quantity of data that is captured, collected, and stored by a wide variety of organizations is growing at an exponential rate. The potential for such data to support scientific discovery and optimization of existing systems is significant, but only if it can be integrated and analyzed in a meaningful way by a wide range of investigators. While many believe that data sharing is desirable, there are also privacy and security concerns, rooted in ethics and the law that often prevent many legitimate and noteworthy applications. In this talk, we will provide an overview on research regarding how to integrate and mine large amounts of privacy-sensitive distributeddata without violating such constraints. Especially, we will discuss how to incentivize data sharing in privacy-preserving distributed data mining applications. This work will draw upon examples form the biomedical domain and discuss recent research on privacy-preserving mining of genomic databases.
Collaborative datamining has become very useful today with the immense increase in the amount of data collected and the increase in competition. This in turn increases the need to preserve the participants' priva...
详细信息
ISBN:
(纸本)9789897581311
Collaborative datamining has become very useful today with the immense increase in the amount of data collected and the increase in competition. This in turn increases the need to preserve the participants' privacy. There have been a number of approaches proposed that use Secret Sharing for privacy preservation for Secure Multiparty Computation (SMC) in different setups and applications. The different multiparty scenarios may have parties that are semi- honest, rational or malicious. A number of approaches have been proposed for semi honest parties in this setup. The problem however is that in reality we have to deal with parties that act in their self- interest and are rational. These rational parties may try and attain maximum gain without disrupting the protocol. Also these parties if cautioned would correct themselves to have maximum individual gain in the future. Thus we propose a new practical game theoretic approach with three novel punishment policies with the primary advantage that it avoids the use of expensive techniques like homomorphic encryption. Our proposed approach is applicable to the secret sharing scheme among rational parties in distributed data mining. We have analysed theoretically the proposed novel punishment policies for this approach. We have also empirically evaluated and implemented our scheme using Java. We compare the punishment policies proposed in terms of the number of rounds required to attain the Nash equilibrium with eventually no bad rational nodes with different percentage of initial bad nodes.
In the context of processing high volumes of data, the recent developments have led to numerous models and frameworks of distributed processing running on clusters of commodity hardware. On the other side, the Graphic...
详细信息
ISBN:
(纸本)9781479928293
In the context of processing high volumes of data, the recent developments have led to numerous models and frameworks of distributed processing running on clusters of commodity hardware. On the other side, the Graphics Processing Unit (GPU) has seen much enthusiastic development as a device for general-purpose intensive parallel computation. In this paper we propose a framework which combines both approaches and evaluates the relevance of having nodes in a distributed processing cluster that make use of GPU units for further fine-grained parallel processing. We have engineered parallel and distributed versions of two datamining problems, the naive Bayes classifier and the k-means clustering algorithm, to run on the framework and have evaluated the performance gain. Finally, we also discuss the requirements and perspectives of integrating GPUs in a distributed processing cluster, introducing a fully distributed heterogeneous computing cluster.
Federated learning (FL) is a rapidly growing privacy preserving collaborative machine learning paradigm. In practical FL applications, local data from each data silo reflect local usage patterns. Therefore, there exis...
详细信息
ISBN:
(数字)9783031001260
ISBN:
(纸本)9783031001260;9783031001253
Federated learning (FL) is a rapidly growing privacy preserving collaborative machine learning paradigm. In practical FL applications, local data from each data silo reflect local usage patterns. Therefore, there exists heterogeneity of data distributions among data owners (a.k.a. FL clients). If not handled properly, this can lead to model performance degradation. This challenge has inspired the research field of heterogeneous federated learning, which currently remains open. In this paper, we propose a data heterogeneity-robust FL approach, FEDGSP, to address this challenge by leveraging on a novel concept of dynamic Sequential-to-Parallel (STP) collaborative training. FEDGSP assigns FL clients to homogeneous groups to minimize the overall distribution divergence among groups, and increases the degree of parallelism by reassigning more groups in each round. It is also incorporated with a novel Inter-Cluster Grouping (ICG) algorithm to assist in group assignment, which uses the centroid equivalence theorem to simplify the NP-hard grouping problem to make it solvable. Extensive experiments have been conducted on the non-i.i.d. FEMNIST dataset. The results show that FEDGSP improves the accuracy by 3.7% on average compared with seven state-of-the-art approaches, and reduces the training time and communication overhead by more than 90%.
Nowadays the privacy issue arising in datamining applications has attracted much attention. In the context of distributed data mining, a major concern of the participant is that its privacy may be disclosed to other ...
详细信息
Nowadays the privacy issue arising in datamining applications has attracted much attention. In the context of distributed data mining, a major concern of the participant is that its privacy may be disclosed to other participants or a third party. To protect privacy, one can apply a differential privacy approach to perturb the data before sharing them with others, which generally causes a negative effect on the mining result. Thus there is a trade-off between privacy and the mining result. In this paper, we study a distributed classification scenario where a mediator builds a classifier based on the perturbed query results returned by a number of users. We propose a game theoretical approach to analyze how users choose their privacy budgets. Specifically, interactions among users are modeled as a game in satisfaction form. And an algorithm is proposed for users to learn the satisfaction equilibrium (SE) of the game. Experimental results demonstrate that, when the differences among users' expectations are not significant, the proposed learning algorithm can converge to an SE, at which every user achieves a balance between the accuracy of the classifier and the preserved privacy.
distributed data mining implements techniques for analyzing data on distributed computing systems by exploiting data distribution and parallel algorithms. The grid is a computing infrastructure for implementing distri...
详细信息
distributed data mining implements techniques for analyzing data on distributed computing systems by exploiting data distribution and parallel algorithms. The grid is a computing infrastructure for implementing distributed high-performance applications and solving complex problems, offering effective support to the implementation and use of datamining and knowledge discovery systems. The Web Services Resource Framework has become the standard for the implementation of grid services and applications, and it can be exploited for developing high-level services for distributed data mining applications. This paper describes how distributed data mining patterns, such as collective learning, ensemble learning, and meta-learning models, can be implemented as Web Services Resource Framework mining services by exploiting the grid infrastructure. The goal of this work was to design a distributed architectural model that can be exploited for different distributedmining patterns deployed as grid services for the analysis of dispersed data sources. In order to validate such an approach, we presented also the implementation of two clustering algorithms on the developed architecture. In particular, the distributed k-means and distributed expectation maximization were exploited as pilot examples to show the suitability of the implemented service-oriented framework. An extensive evaluation of its performance was provided. Copyright (c) 2011 John Wiley & Sons, Ltd.
Acquiring high-quality misspelling data at large scale to train quality search speller models is key but challenging. Synthetic data generation approaches as a major focus in literature are usually linguistics depende...
详细信息
ISBN:
(纸本)9781665408981
Acquiring high-quality misspelling data at large scale to train quality search speller models is key but challenging. Synthetic data generation approaches as a major focus in literature are usually linguistics dependent, challenging in domain adaptation, and centered around empirical choices based on error patterns generalized from limited annotated datasets. mining based approaches on the other hand are not sufficiently studied and don't ensure ground-truth corrections. Both methodologies also lack focus on other strategical considerations which matter for the final quality of the data and model. We introduce a novel, comprehensive and production-proved distributedmining framework which is able to generate large-scale quality data to train search speller models. The enabling method eliminates dependency on human judged data, and fully scales exploring, training and deploying high-quality speller models with maximal efficiency. The work has been demonstrated by production launches of spell correction to worldwide markets for Apple Maps search. Our approach should also facilitate the general synthetic data generation approaches in applicable domains to get rid of the human annotation dependency.
暂无评论