The scale of cloud services keeps increasing over time, significantly introducing huge challenges in system manageability and reliability. Designing coordination services in cloud is the right track to solve the above...
详细信息
Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-ba...
详细信息
Antijoin cardinality estimation is among a handful of problems that has eluded accurate efficient solutions amenable to implementation in relational query optimizers. Given the widespread use of antijoin and subset-based queries in analytical workloads and the extensive research targeted at join cardinality estimation-a seemingly related problem-the lack of adequate solutions for antijoin cardinality estimation is intriguing. In this article, we introduce a novel sampling-based estimator for antijoin cardinality that (unlike existent estimators) provides sufficient accuracy and efficiency to be implemented in a query optimizer. The proposed estimator incorporates three novel ideas. First, we use prior workload information when learning a mixture superpopulation model of the data offline. Second, we design a Bayesian statistics framework that updates the superpopulation model according to the live queries, thus allowing the estimator to adapt dynamically to the online workload. Third, we develop an efficient algorithm for sampling from a hypergeometric distribution in order to generate Monte Carlo trials, without explicitly instantiating either the population or the sample. When put together, these ideas form the basis of an efficient antijoin cardinality estimator satisfying the strict requirements of a query optimizer, as shown by the extensive experimental results over synthetically-generated as well as massive TPC-H data.
Stochastic gradient descent (SGD) is a widely-used technique to implement matrix factorization. SGD-based matrix factorization involves many iterative computations. Therefore, according to the sequential composition t...
详细信息
Stochastic gradient descent (SGD) is a widely-used technique to implement matrix factorization. SGD-based matrix factorization involves many iterative computations. Therefore, according to the sequential composition theory of differential privacy, conventional implementation strategies of differentially private matrix factorization may lead to significant error accumulation, no matter whether the Laplace noise is added to the original matrix or to the factorized matrices. In fact, the implementation of differentially private matrix factorization is so challenging that results proposed to date have the problem of inefficient privacy and data utility. In this paper, we employ the objective perturbation method to address the challenge;this method dramatically alleviates error accumulation by perturbing the objective function instead of perturbing the results. Our method outperforms the state-of-the-art methods since it only requires a scalar noise rather than a vector noise to achieve the same magnitude of privacy. Furthermore, our method may learn the resulted matrices by joint optimization, which follows the conventional learning procedure of SGD and optimizes its convergence speed and accuracy as much as possible. In addition to the differential privacy guarantee, we also empirically show the way that the novel model works together with k-coRating, a k-anonymity-like privacy preserving model, to enhance data utility. (C) 2018 Elsevier Inc. All rights reserved.
Before deploying a recommender system, its performance must be measured and understood. So evaluation is an integral part of the process to design and implement recommender systems. In collaborative filtering, there a...
详细信息
Before deploying a recommender system, its performance must be measured and understood. So evaluation is an integral part of the process to design and implement recommender systems. In collaborative filtering, there are many metrics for evaluating recommender systems. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are among the most important and representative ones. To calculate MAE/RMSE, predicted ratings are compared with their corresponding true ratings. To predict item ratings, similarities between active users and their candidate neighbors need to be calculated. The complexity for the traditional and naive similarity calculation corresponding to user u and user v is quadratic in the number of items rated by u and v. In this paper, we explore the mathematical regularities underlying the similarity formulas, introduce a novel data structure, and design linear time algorithms to calculate the similarities. Such complexity improvement shortens the evaluation time and will finally contribute to increasing the efficiency of design and development of recommender systems. Experimental results confirm the claim. (C) 2016 Elsevier B.V. All rights reserved.
The scale of cloud services keeps increasing over time, significantly introducing huge challenges in system manageability and reliability. Designing coordination services in cloud is the right track to solve the above...
详细信息
ISBN:
(纸本)9781479955497
The scale of cloud services keeps increasing over time, significantly introducing huge challenges in system manageability and reliability. Designing coordination services in cloud is the right track to solve the above problems. However, existing coordination services (e.g., Chubby and ZooKeeper) only perform well in read-intensive scenario and small ensemble scales. To this end, we propose Giraffe, a scalable distributed coordination service. There are three important contributions in our design. (1) Giraffe organizes coordination servers using interior-node-disjoint trees for better scalability. (2) Giraffe employs a novel Paxos protocol for strong consistency and fault-tolerance. (3) Giraffe supports hierarchical data organization and in-memory storage for high throughput and low latency. We evaluate Giraffe on a high performance computing test-bed. The experimental results show that Giraffe gains much better write performance than ZooKeeper when server ensemble is large. Giraffe is nearly 300% faster than ZooKeeper on update operations when ensemble size is 50 servers. Experiments also show that Giraffe reacts and recovers more quickly than ZooKeeper against node failures.
Ensuring privacy in recommender systems for smart cities remains a research challenge, and in this paper we study collaborative filtering recommender systems for privacy-aware smart cities. Specifically, we use the ra...
详细信息
Ensuring privacy in recommender systems for smart cities remains a research challenge, and in this paper we study collaborative filtering recommender systems for privacy-aware smart cities. Specifically, we use the rating matrix to establish connections between a privacy-aware smart city and k-coRating, a novel privacy-preserving rating data publishing model. First, we model privacy concerns in a smart city as the problem of privacy-preserving collaborative filtering recommendation. Then, we introduce k-coRating to address privacy concerns in published rating matrices, by filling the null ratings with predicted scores. This allows us to mask the original ratings to preserve k-anonymity-like data privacy, and enhance data utility (quantified using prediction accuracy in this paper). We show that the optimal k-coRated mapping is an NP-hard problem and design an efficient greedy algorithm to achieve k-coRating. We then demonstrate the utility of our approach empirically. (C) 2018 Elsevier Inc. All rights reserved.
暂无评论