To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for MapReduce jobs in today's production environments. Much effort has been devoted to the problem of meeting deadlines...
详细信息
To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for MapReduce jobs in today's production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named 'Dart', to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach-revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.
Real-life behaviors shown by the mobile users typically exhibit plenty noises, making it hard to construct an effective recommendation engine. In this paper, we present a fused model based on the LR algorithm and the ...
详细信息
As the basis of many knowledge graph completion tasks, the embedding representation of entities and relations in knowledge graph (KG) is an important task in the fields of Natural Language processing (NLP) and Artific...
详细信息
To reduce the time required to complete the regeneration process of erasure codes, we propose a Tree-structured parallel Regeneration (TPR) scheme for multiple data losses in distributed storage systems. Under the sch...
详细信息
To reduce the time required to complete the regeneration process of erasure codes, we propose a Tree-structured parallel Regeneration (TPR) scheme for multiple data losses in distributed storage systems. Under the scheme, two algorithms are proposed for the construction of multiple regeneration trees, namely the edge-disjoint algorithm and edge-sharing algorithm. The edge-disjoint algorithm constructs multiple independent trees, and is simple and appropriate for environments where newcomers and their providers are distributed over a large area and have few intersections. The edge-sharing algorithm constructs multiple trees that compete to utilize the bandwidth, and make a better utilization of the bandwidth, although it needs to measure the available band-width and deal with the bandwidth changes; it is therefore difficult to implement in practical systems. The parallel regeneration for multiple data losses of TPR primarily includes two optimizations: firstly, transferring the data through the bandwidth optimized-paths in a pipe-line manner; secondly, executing data regeneration over multiple trees in parallel. To evaluate the proposal, we implement an event-based simulator and make a detailed comparison with some popular regeneration methods. The quantitative comparison results show that the use of TPR employing either the edge-disjoint algorithm or edge-sharing algorithm reduces the regeneration time significantly.
The original contour preserving classification technique was proposed to improve the robustness and weight fault tolerance of a neural network applied with a two-class linearly separable problem. It was recently found...
详细信息
High baud rate optical transceiver based on time division multiplexing technology is proposed. A communication channel at 80GBoud with 4 bit streams at 20Gbps is realized by 4-stage cascaded high speed switches with s...
详细信息
Although concern has been recently expressed with regard to the solution to the non-convex problem, convex optimization is still important in machine learning, especially when the situation requires an interpretable m...
详细信息
Although concern has been recently expressed with regard to the solution to the non-convex problem, convex optimization is still important in machine learning, especially when the situation requires an interpretable model. Solution to the convex problem is a global minimum, and the final model can be explained mathematically. Typically, the convex problem is re-casted as a regularized risk minimization problem to prevent overfitting. The cutting plane method (CPM) is one of the best solvers for the convex problem, irrespective of whether the objective function is differentiable or not. However, CPM and its variants fail to adequately address large-scale data-intensive cases because these algorithms access the entire dataset in each iteration, which substantially increases the computational burden and memory cost. To alleviate this problem, we propose a novel algorithm named the mini-batch cutting plane method (MBCPM), which iterates with estimated cutting planes calculated on a small batch of sampled data and is capable of handling large-scale problems. Furthermore, the proposed MBCPM adopts a "sink" operation that detects and adjusts noisy estimations to guarantee convergence. Numerical experiments on extensive real-world datasets demonstrate the effectiveness of MBCPM, which is superior to the bundle methods for regularized risk minimization as well as popular stochastic gradient descent methods in terms of convergence speed.
The development of multi-core processor makes the parallelization of traditional sequential algorithms increasingly important. Meanwhile, transactional memory serves a good parallel programming model. This paper takes...
详细信息
Knowledge representation learning (KRL) is one of the important research topics in artificial intelligence and Natural language processing. It can efficiently calculate the semantics of entities and relations in a low...
详细信息
Many recent applications involve processing and analyzing uncertain data. Recently, several research efforts have addressed answering skyline queries efficiently on massive uncertain datasets. However, the research la...
详细信息
暂无评论