In the big data setting, working data sets are often distributed on multiple machines. However, classical statistical methods are often developed to solve the problems of single estimation or inference. We employ a no...
详细信息
In the big data setting, working data sets are often distributed on multiple machines. However, classical statistical methods are often developed to solve the problems of single estimation or inference. We employ a novel parallel quasi-likelihood method in generalized linear models, to make the variances between different sub-estimators relatively similar. Estimates are obtained from projection subsets of data and later combined by suitably-chosen unknown weights. We also show the proposed method to produce better asymptotic efficiency than using the simple average. Furthermore, simulation examples show that the proposed method can significantly improve statisticalinference.
Personal credits have always been a hot topic in the society. Among all of them, the evaluation of default risk is particularly concerned since robust estimation, based on personal information, can both help needy ind...
详细信息
Personal credits have always been a hot topic in the society. Among all of them, the evaluation of default risk is particularly concerned since robust estimation, based on personal information, can both help needy individuals to get loans and financial institutions to avoid losses. So far, there have been no good solutions due to limited data, especially default information. With the advent of the era of big data, it is possible to improve the effectiveness of estimates by using auxiliary information from external studies or public domains. However, the individual-level data can not be gained directly because of the emphasis on data privacy;that is, only some summarized statistics with auxiliary information are allowed to be shared. To effectively utilize external integrated auxiliary information to improve the accuracy of default risk estimation, this paper introduces a unified auxiliary information framework, which is referred as enhanced GEE method, to effectively incorporate various external summary results by employing the generalized estimating equations (GEE) approach and augmenting a weighted logarithm of confidence density on GEE function. We establish asymptotic properties for the new method and prove that it can achieve the gain of statistical efficiency compared to the study-specific estimator without any auxiliary information. Besides, a low-cost Map-Reduce procedure for the distributed statistical inference of enhanced GEE method in big data is developed that can achieve the same efficiency as the oracle enhanced GEE approach under mild condition. This method is demonstrated by an application to predict the loan default risk of bank customers in Shanghai and shown to be more effective and reliable compared with the method based on the own data only. Furthermore, the superiorities of our approach, especially the construction of the tighter confidence intervals, are also illustrated with extensive simulation studies and a real personal default risk case.
Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EH...
详细信息
Electronic health records (EHRs) offer great promises for advancing precision medicine and, at the same time, present significant analytical challenges. Particularly, it is often the case that patient-level data in EHRs cannot be shared across institutions (data sources) due to government regulations and/or institutional policies. As a result, there are growing interests about distributed learning over multiple EHRs databases without sharing patient-level data. To tackle such challenges, we propose a novel communication efficient method that aggregates the optimal estimates of external sites, by turning the problem into a missing data problem. In addition, we propose incorporating posterior samples of remote sites, which can provide partial information on the missing quantities and improve efficiency of parameter estimates while having the differential privacy property and thus reducing the risk of information leaking. The proposed approach, without sharing the raw patient level data, allows for proper statisticalinference. We provide theoretical investigation for the asymptotic properties of the proposed method for statisticalinference as well as differential privacy, and evaluate its performance in simulations and real data analyses in comparison with several recently developed methods.
With the rapid advancement in information technology, data analysis has become increasingly vital in various fields. Balancing the utility of data while protecting individual privacy has become a hot topic for both ac...
详细信息
With the rapid advancement in information technology, data analysis has become increasingly vital in various fields. Balancing the utility of data while protecting individual privacy has become a hot topic for both academic research and practical applications. As a technology that can provide strict privacy guarantees, differential privacy has attracted widespread attention in recent years. In this paper, we study statisticalinference for differentially private data based on empirical likelihood. Specifically, we develop two novel privacy-preserving-based statisticalinference methods, including differentially private distributed empirical likelihood and balanced augmented differentially private distributed empirical likelihood. Under some mild conditions, the asymptotic properties of the proposed methods are derived. We also illustrate the finite sample performance of the proposed approaches via simulation studies and real data analysis.
We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are M machines, each holding d-dimensional observations of a K-sparse vector mu corrup...
详细信息
We consider the problem of sparse normal means estimation in a distributed setting with communication constraints. We assume there are M machines, each holding d-dimensional observations of a K-sparse vector mu corrupted by additive Gaussian noise. The M machines are connected in a star topology to a fusion center, whose goal is to estimate the vector mu with a low communication budget. Previous works have shown that to achieve the centralized minimax rate for the l(2) risk, the total communication must be high-at least linear in the dimension d. This phenomenon occurs, however, at very weak signals. We show that at signal-to-noise ratios (SNRs) that are sufficiently high-but not enough for recovery by any individual machine-the support of mu can be correctly recovered with significantly less communication. Specifically, we present two algorithms for distributed estimation of a sparse mean vector corrupted by either Gaussian or sub-Gaussian noise. We then prove that above certain SNR thresholds, with high probability, these algorithms recover the correct support with total communication that is sublinear in the dimension d. Furthermore, the communication decreases exponentially as a function of signal strength. If in addition KM << d/logd then with an additional round of sublinear communication, our algorithms achieve the centralized rate for the l(2) risk. Finally, we present simulations that illustrate the performance of our algorithms in different parameter regimes.
Market beta is a measure of the volatility or systematic risk of a security or portfolio compared to the market as a whole. This paper considers the distributed estimation of market beta in the case of massive data, a...
详细信息
Market beta is a measure of the volatility or systematic risk of a security or portfolio compared to the market as a whole. This paper considers the distributed estimation of market beta in the case of massive data, and obtains the consistency and asymptotic normality of the estimator. Further, simulations show the finite sample properties of this estimator.
Many data are sensitive in areas such as finance, economics, and other social sciences. We propose an ER (encryption and recovery) algorithm that allows a central administration to do statisticalinference based on th...
详细信息
Many data are sensitive in areas such as finance, economics, and other social sciences. We propose an ER (encryption and recovery) algorithm that allows a central administration to do statisticalinference based on the encrypted data, while still preserving each party's privacy even for a colluding majority in the presence of cyber attack. We demonstrate the applications of our algorithm to linear regression, logistic regression, maximum likelihood estimation, the method of moments, and estimation of empirical distributions. Moreover, our algorithm can help to address another practically significant issue-privacy preservation for distributed statistical inference when data are allocated to different parties who are unwilling to share their own data with others. Finally, we provide two extensions of the applications of our algorithm, including the combination of our algorithm and Fourier transforms and the development of a modified root-finding method for recovering quantiles with privacy preservation.
Aggregated inference on distributed data becomes more and more important due to the larger size of data collected in different industries. Modeling and inference are needed in the case where data cannot be obtained at...
详细信息
Aggregated inference on distributed data becomes more and more important due to the larger size of data collected in different industries. Modeling and inference are needed in the case where data cannot be obtained at a central location;aggregated statisticalinference is a major tool to solve the aforementioned problems. In the literature, problems under the setting of regression model (more generally, M-estimator) are extensively studied. There are at least two popular techniques for distributed estimation: (a) averaging estimators from local locations and (b) the one-step approach, which combines the simple averaging estimator with a classical Newton's method (using the local Hessian matrices) to generate a "one-step" estimator. It is proved that under certain assumptions, the above constructed estimators enjoy the same asymptotic properties as the centralized estimator, which is obtained as if all data were available at a central location. We review the aforementioned two major estimations. It can be seen that, in Big-Data problems, dividing the data to multiple machines and then using the aggregation technique to solve the estimation problem in parallel can speed up the computation with little compromise of the quality of the estimators. We discuss potential extensions to other models, such as support vector machine, principle component analysis, and so on. Numerical examples are omitted due to the space limitation;they can be easily found in the literature. This article is categorized under: statistical Learning and Exploratory Methods of the Data Sciences > Knowledge Discovery statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods statistical Models > Fitting Models statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms
暂无评论