检索结果-内蒙古大学图书馆

12th Conference on Cloud Computing, Big data and Emerging Topics

作者： Sanz, Victoria Pousa, Adrian Naiouf, Marcelo De Giusti, Armando Natl Univ Plata Sch Comp Sci III LIDI La Plata Argentina CIC Buenos Aires DF Argentina Consejo Nacl Invest Cient & Tecn Buenos Aires DF Argentina

ISBN: (纸本)9783031708060;9783031708077

Nowadays, genomics has gained relevance since it allows preventing, diagnosing and treating diseases in a personalized way. The reduction in sequencing time and cost has increased the demand and, thus, the amount of genomic data that must be stored or transferred. Consequently, it becomes necessary to develop genome compression algorithms that help to reduce storage usage without consuming too much time. This is now possible thanks to modern multicore machines. This paper improves MtHRCM, a multi-threaded compression algorithm for large collections of genomes, by reducing its sequential component in order to enhance performance and scalability. Experimental results show that our optimized version is faster than MtHRCM and achieves the same compression ratio. Also, they reveal that this new version scales well when increasing the number of threads/cores for smaller test collections, while the high amount of simultaneous I/O requests to disk limits the scalability for larger test collections.

关键词： genomic data compression Multi-threaded Hybrid Referential compression Method Multicore Performance DNA

来源：评论

学校读者我要写书评

暂无评论

EXPLORING DEEP MARKOV MODELS IN genomic data compression USING SEQUENCE PRE-ANALYSIS 22

EXPLORING DEEP MARKOV MODELS IN GENOMIC DATA COMPRESSION USI...

引用

22nd European Signal Processing Conference (EUSIPCO)

作者： Pratas, Diogo Pinho, Armando J. Univ Aveiro Signal Proc Lab DETI IEETA P-3810193 Aveiro Portugal

ISBN: (纸本)9780992862619

The pressure to find efficient genomic compression algorithms is being felt worldwide, as proved by several prizes and competitions. In this paper, we propose a compression algorithm that relies on a pre-analysis of the data before compression, with the aim of identifying regions of low complexity. This strategy enables us to use deeper context models, supported by hash-tables, without requiring huge amounts of memory. As an example, context depths as large as 32 are attainable for alphabets of four symbols, as is the case of genomic sequences. These deeper context models show very high compression capabilities in very repetitive genomic sequences, yielding improvements over previous algorithms. Furthermore, this method is universal, in the sense that it can be used in any type of textual data (such as quality-scores).

关键词： genomic data compression hash-tables finite-context models

来源：评论

学校读者我要写书评

暂无评论

No-Reference compression of genomic data Stored In FASTQ Format

No-Reference Compression of Genomic Data Stored In FASTQ For...

引用

IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

作者： Bhola, Vishal Bopardikar, Ajit S. Narayanan, Rangavittal Lee, Kyusang Ahn, TaeJin Samsung India Software Operat SAIT India Bangalore Karnataka India Samsung Elect Co Ltd Suwon SAIT Suwon South Korea

ISBN: (纸本)9780769545745

In this paper, we propose a system to compress Next Generation Sequencing (NGS) information stored in a FASTQ file. A FASTQ file contains text, DNA read and quality information for millions or billions of reads. The proposed system first parses the FASTQ file into its component fields. In a partial first pass it gathers statistics which are then used to choose a representation for each field that can give the best compression. Text data is further parsed into repeating and variable components and entropy coding is used to compress the latter. Similarly, Markov encoding and repeat finding based methods are used for DNA read compression. Finally, we propose several run length based methods to encode quality data choosing the method that gives the best performance for a given set of quality values. The compression system provides features for lossless and nearly lossless compression as well as compressing only read and read + quality data. We compare its performance to bzip2 text compression utility and an existing benchmark algorithm. We observe that the performance of the proposed system is superior to that of both the systems.

关键词： FASTQ Next generation sequencing genomic data compression

来源：评论

学校读者我要写书评

暂无评论

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

引用

SCIENTIFIC REPORTS 2025年第1期15卷 1-19页

作者： Nazari, Foad Patel, Sneh Larocca, Melissa Sansevich, Alina Czarny, Ryan Schena, Giana Murray, Emma K. Rajant Hlth Inc 200 Chesterfield Pkwy Malvern PA 19355 USA

As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze's benefits include an auto-tuning compression protocol based on each file's distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze's current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ' + ' lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING's traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ' + ' on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly

关键词： genomic data compression FASTQ FASTA Lossless compression k-mer sequence DNA RNA Next-generation sequencing Storage

来源：评论

学校读者我要写书评

暂无评论

compression of Nanopore FASTQ Files 7th

Compression of Nanopore FASTQ Files

引用

7th International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO)

作者： Dufort y Alvarez, Guillermo Seroussi, Gadiel Smircich, Pablo Sotelo, Jose Ochoa, Idoia Martin, Alvaro Univ Republica Fac Ingn Montevideo Uruguay Xperi Corp San Jose CA USA Univ Republica Fac Ciencias Montevideo Uruguay Inst Invest Biol Clemente Estable Dept Genom Montevideo Uruguay Univ Illinois Elect & Comp Engn Urbana IL 61801 USA

ISBN: (纸本)9783030179380;9783030179373

The research and development of tools for genomic data compression has focused so far on data generated by second-generation sequencing technologies, while third-generation technologies, such as nanopore technologies, have received little attention in the data compression research community. In this paper, we investigate compression schemes for nanopore FASTQ files. We propose a nanopore quality scores compressor, called DualCtx, which yields significant improvements in compression performance with respect to the state-of-the-art. We also extend DualCtx to a full FASTQ compressor, termed DualFqz, by substituting DualCtx for the quality score compression module in a variant of Fqzcomp. We tested DualFqz and various existing compressors on a large nanopore data set. The results show that DualFqz achieves the best compression performance. The experiments also show that most current implementations of compressors fail to execute correctly on files with long variable length reads. DualCtx and DualFqz are freely available for download at: https:// ***/guidufort/DualFqz.

关键词： genomic data compression FASTQ compression Nanopore sequencing technology

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：