Real-life behaviors shown by the mobile users typically exhibit plenty noises, making it hard to construct an effective recommendation engine. In this paper, we present a fused model based on the LR algorithm and the ...
详细信息
HyperLogLog Counting is widely used in cardinality estimation. It is the foundation of many algorithms in data analysis, commodity recommendation and database optimization. Facing the large scale internet business lik...
详细信息
ISBN:
(纸本)9781509006212
HyperLogLog Counting is widely used in cardinality estimation. It is the foundation of many algorithms in data analysis, commodity recommendation and database optimization. Facing the large scale internet business like electronic commerce, internet companies have an urgent requirement of distributed real-time cardinality estimation with high accuracy and low time cost. In this paper, we propose a distributed real-time cardinality estimation algorithm named Hermes. Hermes adjusts the estimated cardinality dynamically according to the result of HyperLogLog Counting and also optimizes the data distribution strategy of existing distributed cardinality estimation algorithms. Experiments have been carried out and the results show that Hermes has lower estimation error and time cost compared with existing algorithms.
In recent years,dynamically growing data and largescale data classification *** traditional methods struggle to balance the precision and computational burden when data and its number of classes ***,some methods are w...
详细信息
ISBN:
(纸本)9781510835368
In recent years,dynamically growing data and largescale data classification *** traditional methods struggle to balance the precision and computational burden when data and its number of classes ***,some methods are with weak precision,and the others are *** this paper,we propose an incremental learning method,namely,heterogeneous incremental Nearest Class Mean Random Forest(hi-RF),to handle this *** is a heterogeneous method that either replaces trees or updates trees leaves in the random forest adaptively,to reduce the computational time in comparable performance,when data of new classes ***,to keep the accuracy,one proportion of trees are replaced by new NCM decision trees;to reduce the computational load,the rest trees are updated their leaves probabilities *** of all,outof-bag estimation and out-of-bag boosting are proposed to balance the accuracy and the computational *** experiments were conducted and demonstrated its comparable precision with much less computational time.
Real-life behaviors shown by the mobile users typically exhibit plenty noises, making it hard to construct an effective recommendation engine. In this paper, we present a fused model based on the LR algorithm and the ...
详细信息
ISBN:
(纸本)9781509006212
Real-life behaviors shown by the mobile users typically exhibit plenty noises, making it hard to construct an effective recommendation engine. In this paper, we present a fused model based on the LR algorithm and the GBDT algorithm to recommend vertical industry commodities in a mobile setting. A set of specifically designed methods are proposed to deal with the data preprocessing and feature extraction problem for the mobile recommendation scenario. The proposed method is evaluated on a large scale real-world dataset provided by the Alibaba mobile shopping department. Result on the F1 score has seen an improvement of 2%-36% compared with the baseline.
Locality Sensitive Hashing (LSH) is an important indexing technique for approximate similarity search in high-dimensional spaces. An obvious limitation of LSH approaches is the lack of capability and scalability to de...
详细信息
ISBN:
(纸本)9781467399562
Locality Sensitive Hashing (LSH) is an important indexing technique for approximate similarity search in high-dimensional spaces. An obvious limitation of LSH approaches is the lack of capability and scalability to deal with massive data. This paper proposes a distributed variant of LSH called Spark-LSH, which is implemented on Apache Spark, a well-known distributed computing framework. We design a shuffle-efficient indexing scheme for the Spark-LSH, which can reduce the data shuffle and improve the network efficiency when constructing the hash table indices. Furthermore, we propose a location-aware querying scheme to improve the query performance. Experiments show that the Spark-LSH scheme can reduce the network shuffle overhead remarkably and accelerate the query significantly.
Last-Level Cache (LLC) plays an important role in Chip Multi-Processor (CMP). The objective of this work is to optimize the structure and management strategy of LLC. Based on 8-core CMP, a LLC structure based on group...
详细信息
ISBN:
(纸本)9781479975761
Last-Level Cache (LLC) plays an important role in Chip Multi-Processor (CMP). The objective of this work is to optimize the structure and management strategy of LLC. Based on 8-core CMP, a LLC structure based on grouped cores is proposed, where 8 cores are divided into 4 groups. All LLC resources are classified into three types, which are fixed private cache, dynamic private cache and dynamic shared cache. The layout of the LLC structure and the corresponding dynamic partitioning strategy are designed to achieve low access latency and high efficiency. Experimental results on full-system simulator suggest that the proposed structure and method are able to reduce the access latency by 2% to 12% compared with previous works, such as tiled structure, cache-centered structure and core-centered structure. Consequently, performance measured by IPC is improved up to 7%. The contribution of this paper is useful for CMP performance, and applies to not only 8-core CMP but also all small-scale CMPs.
The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be a...
详细信息
The growing scale and complexity of component interactions in cloud computing systems post great challenges for operators to understand the characteristics of system performance. Profiling has long been proved to be an effective approach to performance analysis; however, existing approaches confront new challenges that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, service-oriented profiling should be considered to support separation-of-concerns performance analysis. To address the above issues, in this paper, we present P-Tracer, an online performance profiling tool specifically tailored for cloud computing systems. P-Tracer constructs a specific search engine that proactively processes performance logs and generates a particular index for fast queries; second, for each service, P-Tracer retrieves a statistical insight of performance characteristics from multi-dimensions and provides operators with a suite of web-based interfaces to query the critical information. We evaluate P- Tracer in the aspects of tracing overheads, data preprocessing scalability and querying efficiency. Three real-world case studies that happened in Alibaba cloud computing platform demonstrate that P-Tracer can help operators understand soft-ware behaviors and localize the primary causes of performance anomalies effectively and efficiently.
Modular datacenters (MDCs) use shipping containers,encapsulating thousands of servers,as large pluggable building blocks for mega *** MDC’s "service-free" model poses stricter demand on fault-tolerance of t...
详细信息
Modular datacenters (MDCs) use shipping containers,encapsulating thousands of servers,as large pluggable building blocks for mega *** MDC’s "service-free" model poses stricter demand on fault-tolerance of the modular datacenter network (MDCN).Based on the "scale-out" principle,in this paper we propose SCautz,a novel hybrid intra-container network for *** comprises a base Kautz topology,created by interconnecting servers,and a small number of COTS (commercial off-the-shelf) ***,each switch connects a specific number of servers forming "clusters",which,as logical nodes,form multiple higher-level logical Kautz ***’s hybrid structure has several ***,it supports multiple running modes for the MDC,while its full structure increases network capacity ***,it retains the throughput for processing one-to-x traffic in the presence of ***,it achieves much more graceful network performance degradation than computation and storage capacity *** from theoretical analysis and simulations show that SCautz is more viable for intra-container networks.
Many recent applications involve processing and analyzing uncertain data. Recently, several research efforts have addressed answering skyline queries efficiently on massive uncertain datasets. However, the research la...
详细信息
In large-scale cloud computing systems, the growing scale and complexity of component interactions pose great challenges for operators to understand the characteristics of system performance. Performance profiling has...
详细信息
In large-scale cloud computing systems, the growing scale and complexity of component interactions pose great challenges for operators to understand the characteristics of system performance. Performance profiling has long been proved to be an effective approach to performance analysis; however, existing approaches do not consider two new requirements that emerge in cloud computing systems. First, the efficiency of the profiling becomes of critical concern; second, visual analytics should be utilized to make profiling results more readable. To address the above two issues, in this paper, we present P-Tracer, an online performance profiling approach specifically tailored for large-scale cloud computing systems. P-Tracer constructs a specific search engine that adopts a proactive way to process performance logs and generates particular indices for fast queries; furthermore, PTracer provides users with a suite of web-based interfaces to query statistical information of all kinds of services, which helps them quickly and intuitively understand system behavior. The approach has been successfully applied in Alibaba Cloud Computing Inc. to conduct online performance profiling both in production clusters and test clusters. Experience with one real-world case demonstrates that P-Tracer can effectively and efficiently help users conduct performance profiling and localize the primary causes of performance anomalies.
暂无评论