Background: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increa...
详细信息
Background: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. Results: In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. Conclusions: The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://***/staff/zhuzx/LWFQZip2 and https://***/Zhuzxlab/LW-FQZip2.
Transmission, storage, and archival of high-throughput sequencing (HTS) short-read datasets pose significant challenges due to the large size of such datasets. Constant improvements to HTS technology, in the form of i...
详细信息
ISBN:
(纸本)9781450366663
Transmission, storage, and archival of high-throughput sequencing (HTS) short-read datasets pose significant challenges due to the large size of such datasets. Constant improvements to HTS technology, in the form of increasing throughput and decreasing cost, and its increasing adoption amplify the problem. General-purpose compression algorithms have been widely adopted for representing read datasets in a compact form. However, they are unable to fully leverage the domain-specific properties of read datasets. In response, researchers proposed special-purpose compression algorithms which improve upon the compression efficiency of generalpurpose compression algorithms. In this paper, we present Par-RefCom, a parallel reference-based algorithm for compressing HTS genomics short-read datasets. HTS instruments are typically used to generate paired-end reads as they hold significance for biological analysis. In contrast to existing special-purpose compression algorithms, ParRefCom treats paired-end reads as first-class citizens. Owing to this treatment of paired-end reads, our algorithm is able to significantly improve compression efficiency over the state-of-the-art. More specifically, for a benchmark human dataset, the size of the compressed output is 21% smaller than that produced by the current best algorithm. Further, ParRefCom is scalable and its compression and decompression speeds are better than those of reference-free methods.
In the blossoming age of Next Generation Sequencing (NGS) technologies, genome sequencing has become much easier and more affordable. The large number of enormous genomic sequences obtained demand the availability of ...
详细信息
In the blossoming age of Next Generation Sequencing (NGS) technologies, genome sequencing has become much easier and more affordable. The large number of enormous genomic sequences obtained demand the availability of huge storage space in order to be kept for analysis. Since the storage cost has become an impediment facing biologists, there is a constant need of software that provides efficient compression of genomic sequences. Most general-purpose compression algorithms do not exploit the inherent redundancies that exist in genomic sequences which is the reason for the success and popularity of reference-based compression algorithms. In this research, a new reference-based lossless compression technique is proposed for deoxyribonucleic acid (DNA) sequences stored in FASTA format which can act as a layer above gzip compression. Several experiments were performed to evaluate this technique and the experimental results show that it is able to obtain promising compression ratios saving up to 99.9% space and reaching a gain of 80% for some plant genomes. The proposed technique also succeeds in performing the compression at acceptable time;even saving more than 50% of the time taken by ERGC in most experiments.
The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in ...
详细信息
The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in two different versions of the problem: reference-based compression, when a single side information string is used repeatedly in compressing different source messages, and pair-basedcompression, where a different side information string is used for each source message. General achievability and converse theorems are established for arbitrary source-side information pairs. Nonasymptotic normal approximation expansions are proved for the optimal rate in both the reference-based and pair-based settings, for memoryless sources. These are stated in terms of explicit, finite-blocklength bounds, that are tight up to third-order terms. Extensions that go significantly beyond the class of memoryless sources are obtained. The relevant source dispersion is identified and its relationship with the conditional varentropy rate is established. Interestingly, the dispersion is different in reference-based and pair-basedcompression, and it is proved that the reference-based dispersion is in general smaller.
The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. compression is a critical tool to address these challenges, where many method...
详细信息
The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to...
详细信息
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://***/projects/***.
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to...
详细信息
Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://***/projects/***.
In this paper, we propose a computational approach to quantify inverted repeats. This is important, because it is known that the presence of inverted repeats in genomic data may be associated to certain chromosomal re...
详细信息
ISBN:
(纸本)9783319608167;9783319608150
In this paper, we propose a computational approach to quantify inverted repeats. This is important, because it is known that the presence of inverted repeats in genomic data may be associated to certain chromosomal rearrangements. First, we present a reference-based relative compression method, which employs statistical characteristics of the genomic data. Then, for determining the similarity between genomic sequences, we use the normalized relative compression measure, which is light-weight regarding computational time and memory. Testing this approach on various species, including human, chimpanzee, gorilla, chicken, turkey and archaea genomes, we unveil unreported results that may support several evolution insights.
Background: With the reduction of gene sequencing cost and demand for emerging technologies such as precision medical treatment and deep learning in genome, it is an era of gene data outbreaks today. How to store, tra...
详细信息
Background: With the reduction of gene sequencing cost and demand for emerging technologies such as precision medical treatment and deep learning in genome, it is an era of gene data outbreaks today. How to store, transmit and analyze these data has become a hotspot in the current research. Now the compression algorithm based on reference is widely used due to its high compression ratio. There exists a big problem that the data from different gene banks can't merge directly and share information efficiently, because these data are usually compressed with different references. The traditional workflow is decompression-and-recompression, which is too simple and time-consuming. We should improve it and speed it up. Results: In this paper, we focus on this problem and propose a set of transformation algorithms to cope with it. We will 1) analyze some different compression algorithms to find the similarities and the differences among all of them, 2) come up with a naive method named TDM for data transformation between difference gene banks and finally 3) optimize former method TDM and propose the method named WI and the method named TGI. A number of experiment result proved that the three algorithms we proposed are an order of magnitude faster than traditional decompression-and-recompression workflow. Conclusions: Firstly, the three algorithms we proposed all have good performance in terms of time. Secondly, they have their own different advantages faced with different dataset or situations. TDM and WI are more suitable for small-scale gene data transformation, while TGI is more suitable for large-scale gene data transformation.
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, b...
详细信息
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://***/haichangyao/SparkGC.
暂无评论