Sequences of integers are common data types, occurring either as primary data or ancillary structures. The sizes of sequences can be large, making compression an interesting option. Effective compression presupposes v...
详细信息
Sequences of integers are common data types, occurring either as primary data or ancillary structures. The sizes of sequences can be large, making compression an interesting option. Effective compression presupposes variable-length coding, which destroys the regular alignment of values. Yet it would often be desirable to access only a small subset of the entries, either by position (ordinal number) or by content (element value), without having to decode most of the sequence from the start. Here such a random access technique for compressed integers is described, with the special feature that no auxiliary index is needed. The solution applies a method called interpolative coding, which is one of the most efficient non-statistical codes for integers. Indexing is avoided by address calculation guaranteeing sufficient space for codes even in the worst case. The additional redundancy, compared to regular interpolative coding, is only about 1 bit per source integer for uniform distribution. The time complexity of random access is logarithmic with respect to the source size for both position-based and content-based retrieval. According to experiments, random access is faster than full decoding when the number of accessed integers is not more than approximately 0.75. n/log(2)n for sequence length n. The tests also confirm that the method is quite competitive with other approaches to random access coding, suggested in the literature. (C) 2010 Elsevier Ltd. All rights reserved.
Lossless compression methods based on the Burrows-Wheeler transform (BWT) are regarded as an excellent compromise between speed and compression efficiency: they provide compression rates close to the PPM algorithms, w...
详细信息
Lossless compression methods based on the Burrows-Wheeler transform (BWT) are regarded as an excellent compromise between speed and compression efficiency: they provide compression rates close to the PPM algorithms, with the speed of dictionary-based methods. Instead of the laborious statistics-gathering process used in PPM, the BWT reversibly sorts the input symbols, using as the sort key as many following characters as necessary to make the sort unique. Characters occurring in similar contexts are sorted close together, resulting in a clustered symbol sequence. Run-length encoding and Move-to-Front (MTF) recoding, combined with a statistical Huffman or arithmetic coder, is then typically used to exploit the clustering. A drawback of the MTF recoding is that knowledge of the character that produced the MTF number is lost. In this paper, we present a new, competitive Burrows-Wheeler posttransform stage that takes advantage of interpolative coding-a fast binary encoding method for integer sequences, being able to exploit clusters without requiring explicit statistics. We introduce a fast and simple way to retain knowledge of the run characters during the MTF recoding and use this to improve the clustering of MTF numbers and run-lengths by applying reversible, stable sorting, with the run characters as sort keys, achieving significant improvement in the compression rate, as shown here by experiments on common text corpora.
This paper presents a size reduction method for the inverted file, the most suitable indexing structure for an information retrieval system (IRS). We notice that in an inverted file the document identifiers for a give...
详细信息
This paper presents a size reduction method for the inverted file, the most suitable indexing structure for an information retrieval system (IRS). We notice that in an inverted file the document identifiers for a given word are usually clustered. While this Clustering property can be used in reducing the size of the inverted file, good compression as well as fast decompression must both be available. In this paper, we present it method that can facilitate coding and decoding processes for interpolative coding using recursion elimination and loop unwinding. We call this method the unique-order interpolative coding. It can calculate the lower and upper bounds of every document identifier for a binary code without using a recursive process, hence the decompression time can be greatly reduced. Moreover, it also can exploit document identifier Clustering to compress the inverted file efficiently. Compared with the other well-known compression methods, our method provides fast decoding speed and excellent compression. This method can also be used to support a self-indexing strategy. Therefore our research work in this paper provides a feasible way to build a fast and space-economical IRS. (c) 2005 Elsevier Ltd. All rights reserved.
Inverted indexes are mainstream in Information Retrieval systems and many compression techniques have been proposed. The purpose of this paper is to explore the compression efficiency on a two-level inverted index tai...
详细信息
Inverted indexes are mainstream in Information Retrieval systems and many compression techniques have been proposed. The purpose of this paper is to explore the compression efficiency on a two-level inverted index tailored for n-gram indices. We use two compression techniques Optimal PForDelta and IPC. Both techniques are applied to a previous work of us, that has focused on developing a threshold to efficiently store subsequences inside a one or two level inverted index, based on their number of occurrences inside a biological sequence. We study the performance of these two compression algorithms over different fluctuations of the threshold. The compression ratio of the OptPFD is affected by the changes in the threshold and is also efficient as in text documents. Whereas, IPC has a different performance for each threshold and it is more stable, although it is much less efficient than in text documents. (C) 2019 Elsevier Inc. All rights reserved.
Pulse-code modulation (PCM) with embedded quantization allows the rate of the PCM bitstream to be reduced by simply removing a fixed number of least significant bits from each codeword. Although this source coding tec...
详细信息
Pulse-code modulation (PCM) with embedded quantization allows the rate of the PCM bitstream to be reduced by simply removing a fixed number of least significant bits from each codeword. Although this source coding technique is extremely simple, it has poor coding efficiency. In this paper, we present a generalized PCM (GPCM) algorithm for images that simply removes bits from each codeword. In contrast to PCM, however, the number and the specific bits that a GPCM encoder removes in each codeword depends on its position in the bitstream and the statistics of the image. Since GPCM allows the encoding to be performed with different degrees of computational complexity, it can adapt to the computational resources that are available in each application. Experimental results show that GPCM outperforms PCM with a gain that depends on the rate, the computational complexity of the encoding, and the degree of inter-pixel correlation of the image.
Binary tree predictive coding (BTPC) is an efficient general-purpose still-image compression scheme, competitive with JPEG for natural image coding and with GIF for graphics. We report in this paper the extension of B...
详细信息
Binary tree predictive coding (BTPC) is an efficient general-purpose still-image compression scheme, competitive with JPEG for natural image coding and with GIF for graphics. We report in this paper the extension of BTPC to video compression using motion estimation and compensation techniques which are simple, efficient, nonlinear and predictive. The new methods, binary tree recursive motion estimation coding (BTRMEC), and binary tree residue coding (BTRC) exploit the hierarchical structure of BTPC, in the first case giving progressively refined motion estimates for increasing numbers of pels and in the second case providing efficient residue coding. Compression results for BTRMEC and BTRC are compared against conventional block-based motion compensated coding as provided by MPEG, They show that both BTRMEC and BTRC are efficient methods to code video sequences.
A new, simple non-statistical source coding technique for sequences of integers is suggested. The method is based on a tournament scheme, with the sequence arranged into pairs, where maxima ('winners') are enc...
详细信息
A new, simple non-statistical source coding technique for sequences of integers is suggested. The method is based on a tournament scheme, with the sequence arranged into pairs, where maxima ('winners') are encoded recursively, and minima are encoded by semi-fixed-length codes using the related maxima to bound the code lengths. In the experiments, tournament coding has outperformed the other non-statistical methods (gamma, delta, Fibonacci and interpolative coding) for uniform distribution of numbers. Also for non-uniform distributions the method is quite competitive.
暂无评论