Nowadays, genomics has gained relevance since it allows preventing, diagnosing and treating diseases in a personalized way. The reduction in sequencing time and cost has increased the demand and, thus, the amount of g...
详细信息
ISBN:
(纸本)9783031708060;9783031708077
Nowadays, genomics has gained relevance since it allows preventing, diagnosing and treating diseases in a personalized way. The reduction in sequencing time and cost has increased the demand and, thus, the amount of genomicdata that must be stored or transferred. Consequently, it becomes necessary to develop genome compression algorithms that help to reduce storage usage without consuming too much time. This is now possible thanks to modern multicore machines. This paper improves MtHRCM, a multi-threaded compression algorithm for large collections of genomes, by reducing its sequential component in order to enhance performance and scalability. Experimental results show that our optimized version is faster than MtHRCM and achieves the same compression ratio. Also, they reveal that this new version scales well when increasing the number of threads/cores for smaller test collections, while the high amount of simultaneous I/O requests to disk limits the scalability for larger test collections.
The pressure to find efficient genomiccompression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of t...
详细信息
ISBN:
(纸本)9780992862619
The pressure to find efficient genomiccompression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables us to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores).
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The p...
详细信息
ISBN:
(纸本)9780769545745
In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.
As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research an...
详细信息
As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze's benefits include an auto-tuning compression protocol based on each file's distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze's current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ' + ' lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING's traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ' + ' on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly
The research and development of tools for genomic data compression has focused so far on data generated by second-generation sequencing technologies, while third-generation technologies, such as nanopore technologies,...
详细信息
ISBN:
(纸本)9783030179380;9783030179373
The research and development of tools for genomic data compression has focused so far on data generated by second-generation sequencing technologies, while third-generation technologies, such as nanopore technologies, have received little attention in the datacompression research community. In this paper, we investigate compression schemes for nanopore FASTQ files. We propose a nanopore quality scores compressor, called DualCtx, which yields significant improvements in compression performance with respect to the state-of-the-art. We also extend DualCtx to a full FASTQ compressor, termed DualFqz, by substituting DualCtx for the quality score compression module in a variant of Fqzcomp. We tested DualFqz and various existing compressors on a large nanopore data set. The results show that DualFqz achieves the best compression performance. The experiments also show that most current implementations of compressors fail to execute correctly on files with long variable length reads. DualCtx and DualFqz are freely available for download at: https:// ***/guidufort/DualFqz.
暂无评论