Scan is a fundamental operation widely used in main-memory analytical database systems. To accelerate scans, previous studies build either record-order or sort-order structures known as scan indices. While achieving g...
详细信息
Scan is a fundamental operation widely used in main-memory analytical database systems. To accelerate scans, previous studies build either record-order or sort-order structures known as scan indices. While achieving good performance, scan indices often incur significant space overhead, limiting their use in main-memory databases. For example, the most recent and best performing scan index, BinDex, consists of a sort-order position array, which is an array of rowIDs in the value order, and a set of record-order bit vectors, representing records in pre-defined value intervals. The structures can be much larger than the base data column *** this paper, we propose a novel scan index, Cabin, that exploits the following three techniques for better time-space tradeoff. 1) filter sketches that represent every 2^w-2 value intervals with a w-bit sketched vector, thereby exponentially reducing the space for the bit vectors; 2) selective position array that removes the rowID array for a fraction of intervals in order to lower the space overhead for the position array; and 3) data-aware intervals that judiciously select interval boundaries based on the data characteristics to better support popular values in skewed data distributions or categorical attributes. Experimental results show that compared with state-of-the-art scan solutions, Cabin achieves better time-space tradeoff, and attains 1.70 -- 4.48x improvement for average scan performance given the same space budget.
Hash join is one of the most important and widely used query processing operations. In many real-world applications, such as graph data analysis, join keys can be highly skewed. However, data skew is often only a seco...
详细信息
ISBN:
(数字)9798350384031
ISBN:
(纸本)9798350384048
Hash join is one of the most important and widely used query processing operations. In many real-world applications, such as graph data analysis, join keys can be highly skewed. However, data skew is often only a secondary design consideration in existing CPU and GPU hash join algorithms. We find that the data structures and algorithm steps in existing hash joins are less efficient for handling highly skewed join keys, leading to significant performance degradation. In this paper, we optimize hash joins for skewed data, and propose a CPU Skewconscious Hash join (CSH) and a GPU Skew-conscious Hash join (GSH). Our main idea is to detect skewed join keys and handle skewed vs. normal keys in separate routines optimized for their target cases. Our preliminary experiments show that compared to state-of-the-art CPU and GPU hash joins with existing skew-handling techniques, our proposed CSH and GSH achieve up to 8.0x and 13.5x improvement, respectively.
Hashing index is widely used to support efficient point operations. We observe that there is a conflict between performance and memory utilization goals. Existing hashing indices often have to trade off hash table acc...
详细信息
Hashing index is widely used to support efficient point operations. We observe that there is a conflict between performance and memory utilization goals. Existing hashing indices often have to trade off hash table access latency for better memory utilization. Moreover, many designs support only unique keys, and their performance is often suboptimal with skew *** this paper, we propose Pea Hash with two techniques to address the above two problems: (i) adaptive hashing strategy that holistically optimizes both access latency and memory utilization, and (ii) data-aware adaptive buckets that accommodate unique keys, and keys with various numbers of duplicates. We develop both an NVM-optimized Pea Hash and a DRAM-based Pea Hash index. Experiments on a machine equipped with Intel Optane DC Persistent memory show that compared to state-of-the-art NVM-optimized hashing indices, the NVM-optimized Pea Hash achieves up to 13.8x performance improvements with similar memory utilization. The DRAM-based Pea Hash outperforms existing in-DRAM hashing index designs, showing the generality of the proposed techniques.
暂无评论