top-k and selection operations are critical in data processing and analysis, and their efficient implementation on GPUs is increasingly important due to the growing demands of data analysis. Existing methods, primaril...
详细信息
top-k and selection operations are critical in data processing and analysis, and their efficient implementation on GPUs is increasingly important due to the growing demands of data analysis. Existing methods, primarily relying on the bucket partition execution model, encounter challenges such as uneven bucket distribution and latency in merging processes. To address these issues, we introduce a novel Split-Bucket Partition (SBP) execution model that specifically addresses these challenges. Additionally, we propose task and control flow optimizations targeted at top-k and selection algorithms, which further contribute to performance improvements. Our optimized algorithms significantly outperform existing approaches, delivering performance gains of up to 2.3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2.3$$\end{document} times and 1.6 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.6$$\end{document} times for different bucket partitioning rules. Our algorithms show robust performance improvements in non-uniform data scenarios, with gains ranging from 1.9 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1.9$$\end{document} times to 15.5 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$15.5$$\end{document} times. However, it should be noted that the SB
Result diversification is extensively studied in the context of search, recommendation, and data exploration. There are numerous algorithms that return top-k results that are both diverse and relevant. These algorithm...
详细信息
Result diversification is extensively studied in the context of search, recommendation, and data exploration. There are numerous algorithms that return top-k results that are both diverse and relevant. These algorithms typically have computational loops that compare the pairwise diversity of records to decide which ones to retain. We propose an access primitive DivGetBatch() that replaces repeated pairwise comparisons of diversity scores of records by pairwise comparisons of "aggregate" diversity scores of a group of records, thereby improving the running time of these algorithms while preserving the same results. We integrate the access primitive inside three representative diversity algorithms and prove that the augmented algorithms leveraging the access primitive preserve original results. We analyze the worst and expected case running times of these algorithms. We propose a computational framework to design this access primitive that has a pre-computed index structure I-tree that is agnostic to the specific details of diversity algorithms. We develop principled solutions to construct and maintain I-tree. Our experiments on multiple large real-world datasets corroborate our theoretical findings, while ensuring up to a 24x speedup.
We address the problem of search on graphs with multiple nodal attributes. We call such graphs weighted attribute graphs (WAGs). Nodes of a WAG exhibit multiple attributes with varying, non-negative weights. WAGs are ...
详细信息
We address the problem of search on graphs with multiple nodal attributes. We call such graphs weighted attribute graphs (WAGs). Nodes of a WAG exhibit multiple attributes with varying, non-negative weights. WAGs are ubiquitous in real-world applications. For example, in a co-authorship WAG, each author is a node;each attribute corresponds to a particular topic (e.g., databases, data mining, and machine learning);and the amount of expertise in a particular topic is represented by a non-negative weight on that attribute. A typical search in this setting specifies both connectivity between nodes and constraints on weights of nodal attributes. For example, a user's search may be: find three coauthors (i.e., a triangle) where each author's expertise is greater than 50 percent in at least one topic area (i.e., attribute). We propose a ranking function which unifies ranking between the graph structure and attribute weights of nodes. We prove that the problem of retrieving the optimal answer for graph search on WAGs is NP-complete. Moreover, we propose a fast and effective top-k graph search algorithm for WAGs. In an extensive experimental study with multiple real-world graphs, our proposed algorithm exhibits significant speed-up over competing approaches. On average, our proposed method is more than 7x faster in query processing than the best competitor.
In recent years, there has been a great deal of interest in developing effective techniques for ad-hoc search and retrieval in relational databases, document and multimedia databases, scientific information systems, a...
详细信息
ISBN:
(纸本)9783642008863
In recent years, there has been a great deal of interest in developing effective techniques for ad-hoc search and retrieval in relational databases, document and multimedia databases, scientific information systems, and so on. A popular paradigm for tackling this problem is top-k querying, i.e., the rankinf of the results and returning the k results with the highest scores. Numerous variants of the top-k retrieval problem and several algorithms have been introduced in recent years. In this tutorial we shall discuss the top-k problem in detail, especially the fundamental algorithms such as FA and TA, important variants such as algorithms operating under restricted sorted/random access, deterministic and probabilistic approximations, as well as distributed and streaming top-k computations. A significant portion of the tutorial will be focused on applications of these top-k algorithms, especially in the context of the Web services and online databases, multimedia, documents and relational databases.
We propose a new method for approximate k-NN search in large scale image databases, based on top-k multi-criteria search techniques. The method defines a simple index structure based on sorted lists, which provides a ...
详细信息
ISBN:
(纸本)9781479923410
We propose a new method for approximate k-NN search in large scale image databases, based on top-k multi-criteria search techniques. The method defines a simple index structure based on sorted lists, which provides a good compromise between fast retrieval, storage requirements and update cost. The search algorithm delivers approximate results with guarantees about false negatives, with fast emergence of good approximations, monotonically improved and leading if necessary to an exact result. Experiments with the on-disk implementation show that our method produces very good approximate results several times faster than the Baseline method.
We propose a new method for approximate k-NN search in large scale image databases, based on top-k multi-criteria search techniques. The method defines a simple index structure based on sorted lists, which provides a ...
详细信息
ISBN:
(纸本)9781479923427
We propose a new method for approximate k-NN search in large scale image databases, based on top-k multi-criteria search techniques. The method defines a simple index structure based on sorted lists, which provides a good compromise between fast retrieval, storage requirements and update cost. The search algorithm delivers approximate results with guarantees about false negatives, with fast emergence of good approximations, monotonically improved and leading if necessary to an exact result. Experiments with the on-disk implementation show that our method produces very good approximate results several times faster than the Baseline method.
Identifying the top-k most frequent elements is one of the many problems associated with data streams analysis. It is a well-known and difficult problem, especially if the analysis is to be performed and maintained up...
详细信息
Identifying the top-k most frequent elements is one of the many problems associated with data streams analysis. It is a well-known and difficult problem, especially if the analysis is to be performed and maintained up to date in near real time. Analyzing data streams in time sliding window model is of particular interest as only the most recent, more relevant events are considered. Approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm, the Filtered Space-Saving with Sliding Window Algorithm (FSW) that addresses this problem by introducing in the Filtered Space Saving ( FSS) algorithm an approximated time sliding window counter. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value within the sliding window. It provides strong guarantees on the results, depending on the elements real frequencies. Experimental results detail performance on real life cases.
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use ...
详细信息
Identifying the most frequent elements in a data stream is a well known and difficult problem. Identifying the most frequent elements for each individual, especially in very large populations, is even harder. The use of fast and small memory footprint algorithms is paramount when the number of individuals is very large. In many situations such analysis needs to be performed and kept up to date in near real time. Fortunately, approximate answers are usually adequate when dealing with this problem. This paper presents a new and innovative algorithm that addresses this problem by merging the commonly used counter-based and sketch-based techniques for top-k identification. The algorithm provides the top-k list of elements, their frequency and an error estimate for each frequency value. It also provides strong guarantees on the error estimate, order of elements and inclusion of elements in the list depending on their real frequency. Additionally the algorithm provides stochastic bounds on the error and expected error estimates. Telecommunications customer's behavior and voice call data is used to present concrete results obtained with this algorithm and to illustrate improvements over previously existing algorithms. (C) 2010 Elsevier Inc. All rights reserved.
暂无评论