The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query poin...
详细信息
The concept of similarity is used as the basis for many data exploration and data mining tasks. Nearest neighbor (NN) queries identify the most similar items, or in terms of distance the closest points to a query point. Similarity is traditionally characterized using a distance function between multi-dimensional feature vectors. However, when the data is high-dimensional, traditional distance functions fail to significantly distinguish between the closest and furthest points, as few dissimilar dimensions dominate the distance function. Localized similarity functions, i.e. functions that only consider dimensions close to the query, quantize each dimension independently and only compute similarity for the dimensions where the query and the points fall into the same bin. These quantizations are query-agnostic and there is potential to improve accuracy when a query-dependent quantization is used. In this work we propose a query dependent equi-depth (QED) on-the-fly quantization method to improve high-dimensional similarity searches. The quantization is done for each dimension at query time and localized scores are generated for the closest p fraction of the points while a constant penalty is applied for the rest of the points. QED not only improves the quality of the distance metric, but also improves query time performance by filtering out non relevant data. We propose a distributed indexing and query algorithm to efficiently compute QED. Our experimental results show improvements in classification accuracy as well as query performance up to one order of magnitude faster than Manhattan-based sequential scan NN queries over datasets with hundreds of dimensions. Furthermore, similarity searches with QED show linear or better scalability in relation to the number of dimensions, and the number of compute nodes.
The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications,...
详细信息
The skyline operator has attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this paper, we propose the efficient parallel algorithm SKY-MR+ for processing skyline queries using MapReduce. We first build a quadtree-based histogram for space partitioning by deciding whether to split each leaf node judiciously based on the benefit of splitting in terms of the estimated execution time. In addition, we apply the dominance power filtering method to effectively prune non-skyline points in advance. We next partition data based on the regions divided by the quadtree and compute candidate skyline points for each partition using MapReduce. Finally, we check whether each skyline candidate point is actually a skyline point in every partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare SKY-MR+ with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of SKY-MR+.
This paper addresses dynamic bandwidth allocation for virtual network (VN) resources to respond to increasing or decreasing applications requirements in cloud environments. A distributed and local-view framework, comp...
详细信息
ISBN:
(纸本)9781509002238
This paper addresses dynamic bandwidth allocation for virtual network (VN) resources to respond to increasing or decreasing applications requirements in cloud environments. A distributed and local-view framework, composed of a controller and three algorithms running in substrate nodes, is proposed to deal with all types of bandwidth demand fluctuations in embedded virtual networks. The framework is based on the Self-Stabilization concept to drive the system back to a "stable state" when new bandwidth demands drift the system away into an "unstable state". Performance evaluation results demonstrate the effectiveness of our proposal in handling bandwidth demand fluctuations in convergence speed and cost.
In this paper, we propose a novel exact/successive line search method for stepsize calculation in iterative algorithms for nonsmooth optimization problems. The proposed approach is to perform line search over a proper...
详细信息
ISBN:
(纸本)9780992862633
In this paper, we propose a novel exact/successive line search method for stepsize calculation in iterative algorithms for nonsmooth optimization problems. The proposed approach is to perform line search over a properly constructed differentiable function based on the original nonsmooth objective function, and it outperforms state-of-the-art techniques from the perspective of convergence speed, computational complexity and signaling burden. When applied to LASSO, the proposed exact line search is shown, either analytically or numerically, to exhibit several desirable advantages, namely;it is implementable in closed-form, converges fast and is robust with respect to the choice of problem parameters.
Searching for new efficient and exact heuristic optimization algorithms in big search spaces currently remains as an open problem. The search space increases exponentially with the problem size, making impossible to f...
详细信息
ISBN:
(纸本)9783319192581;9783319192574
Searching for new efficient and exact heuristic optimization algorithms in big search spaces currently remains as an open problem. The search space increases exponentially with the problem size, making impossible to find a solution through a mere blind search. Several heuristic approaches inspired by nature have been adopted as suitable algorithms to solve complex optimization problems in many different areas. Networks of Bio-inspired Processors (NBP) is a formal framework formed of highly parallel and distributed computing models inspired and abstracted by biological evolution. From a theoretical point of view, NBP has been proved broadly to be an efficient solving of NP complete problems. The aim of this paper is to explore the expressive power of NBP to solve hard optimization problems with a big search space, using massively parallel architectures. We use the basic concepts and principles of some metaheuristic approaches to propose an extension of the NBP model, which is able to solve actual problems in the optimization field from a practical point of view.
In this paper, we propose a novel exact/successive line search method for stepsize calculation in iterative algorithms for nonsmooth optimization problems. The proposed approach is to perform line search over a proper...
详细信息
ISBN:
(纸本)9781479988518
In this paper, we propose a novel exact/successive line search method for stepsize calculation in iterative algorithms for nonsmooth optimization problems. The proposed approach is to perform line search over a properly constructed differentiable function based on the original nonsmooth objective function, and it outperforms state-of-the-art techniques from the perspective of convergence speed, computational complexity and signaling burden. When applied to LASSO, the proposed exact line search is shown, either analytically or numerically, to exhibit several desirable advantages, namely: it is implementable in closed-form, converges fast and is robust with respect to the choice of problem parameters.
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is o...
详细信息
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over 100 processors and indicate that similar speedup can be obtained with several hundred processors. (c) 2006 Elsevier Inc. All rights reserved.
distributed computations are concurrent programs in which processes communicate by message passing. Such programs typically execute on network architectures such as networks of workstations or distributed memory paral...
详细信息
distributed computations are concurrent programs in which processes communicate by message passing. Such programs typically execute on network architectures such as networks of workstations or distributed memory parallel machines (i.e., multicomputers such as hypercubes). Several paradigms-examples or models-for process interaction in distributed computations are described. These include networks of filters, clients, and servers, heartbeat algorithms, probe echo algorithms, broadcast algorithms, token-passing algorithms, decentralized servers, and bags of tasks. These paradigms are applicable to numerous practical problems. They are illustrated by solving problems, including parallel sorting, file servers, computing the topology of a network, distributed termination detection, replicated databases, and parallel adaptive quadrature. Solutions to all problems are derived in a step-wise fashion from a general specification of the problem to a concrete solution. The derivations illustrate techniques for developing distributedalgorithms.
暂无评论