检索结果-内蒙古大学图书馆

LW-FQZip 2: a parallelized reference-based compression of FASTQ files

BMC BIOINFORMATICS 2017年第1期18卷 1-8页

作者： Huang, Zhi-An Wen, Zhenkun Deng, Qingjin Chu, Ying Sun, Yiwen Zhu, Zexuan Shenzhen Univ Coll Comp Sci & Software Engn Shenzhen 518060 Peoples R China Shenzhen Univ Sch Med Shenzhen 518060 Peoples R China

Background: The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. Results: In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. Conclusions: The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://***/staff/zhuzx/LWFQZip2 and https://***/Zhuzxlab/LW-FQZip2.

关键词： High-throughput sequencing Sequencing data compression reference-based compression Sequence alignment

来源：评论

学校读者我要写书评

暂无评论

ParRefCom : Parallel reference-based compression of Paired-end Genomics Read Datasets 19

ParRefCom : Parallel Reference-based Compression of Paired-e...

引用

10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB)

作者： Jammula, Nagakishore Aluru, Srinivas Georgia Inst Technol Atlanta GA 30332 USA

ISBN: (纸本)9781450366663

Transmission, storage, and archival of high-throughput sequencing (HTS) short-read datasets pose significant challenges due to the large size of such datasets. Constant improvements to HTS technology, in the form of increasing throughput and decreasing cost, and its increasing adoption amplify the problem. General-purpose compression algorithms have been widely adopted for representing read datasets in a compact form. However, they are unable to fully leverage the domain-specific properties of read datasets. In response, researchers proposed special-purpose compression algorithms which improve upon the compression efficiency of generalpurpose compression algorithms. In this paper, we present Par-RefCom, a parallel reference-based algorithm for compressing HTS genomics short-read datasets. HTS instruments are typically used to generate paired-end reads as they hold significance for biological analysis. In contrast to existing special-purpose compression algorithms, ParRefCom treats paired-end reads as first-class citizens. Owing to this treatment of paired-end reads, our algorithm is able to significantly improve compression efficiency over the state-of-the-art. More specifically, for a benchmark human dataset, the size of the compressed output is 21% smaller than that produced by the current best algorithm. Further, ParRefCom is scalable and its compression and decompression speeds are better than those of reference-free methods.

关键词： reference-based compression high-throughput sequencing paired-end genomics short-read datasets parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

HADC: A Hybrid compression Approach for DNA Sequences

引用

IEEE ACCESS 2022年 10卷 106841-106848页

作者： Elnady, Sarah Sayed, Sabah Salah, Akram Cairo Univ Fac Comp & Artificial Intelligence Comp Sci Dept Cairo 12613 Egypt

In the blossoming age of Next Generation Sequencing (NGS) technologies, genome sequencing has become much easier and more affordable. The large number of enormous genomic sequences obtained demand the availability of huge storage space in order to be kept for analysis. Since the storage cost has become an impediment facing biologists, there is a constant need of software that provides efficient compression of genomic sequences. Most general-purpose compression algorithms do not exploit the inherent redundancies that exist in genomic sequences which is the reason for the success and popularity of reference-based compression algorithms. In this research, a new reference-based lossless compression technique is proposed for deoxyribonucleic acid (DNA) sequences stored in FASTA format which can act as a layer above gzip compression. Several experiments were performed to evaluate this technique and the experimental results show that it is able to obtain promising compression ratios saving up to 99.9% space and reaching a gain of 80% for some plant genomes. The proposed technique also succeeds in performing the compression at acceptable time;even saving more than 50% of the time taken by ERGC in most experiments.

关键词： Bioinformatics Genomics DNA Sequential analysis compression algorithms Encoding Redundancy Bioinformatics DNA sequences reference-based compression greedy alignment

来源：评论

学校读者我要写书评

暂无评论

Fundamental Limits of Lossless Data compression With Side Information

引用

IEEE TRANSACTIONS ON INFORMATION THEORY 2021年第5期67卷 2680-2692页

作者： Gavalakis, Lampros Kontoyiannis, Ioannis Univ Cambridge Dept Engn Cambridge CB2 1PZ England Univ Cambridge Ctr Math Sci Stat Lab DPMMS Cambridge CB3 0WB England

The problem of lossless data compression with side information available to both the encoder and the decoder is considered. The finite-blocklength fundamental limits of the best achievable performance are defined, in two different versions of the problem: reference-based compression, when a single side information string is used repeatedly in compressing different source messages, and pair-based compression, where a different side information string is used for each source message. General achievability and converse theorems are established for arbitrary source-side information pairs. Nonasymptotic normal approximation expansions are proved for the optimal rate in both the reference-based and pair-based settings, for memoryless sources. These are stated in terms of explicit, finite-blocklength bounds, that are tight up to third-order terms. Extensions that go significantly beyond the class of memoryless sources are obtained. The relevant source dispersion is identified and its relationship with the conditional varentropy rate is established. Interestingly, the dispersion is different in reference-based and pair-based compression, and it is proved that the reference-based dispersion is in general smaller.

关键词： Image coding Dispersion Entropy Genomics Bioinformatics Software Reactive power Entropy lossless data compression side information conditional entropy central limit theorem reference-based compression pair-based compression nonasymptotic bounds conditional varentropy reference-based dispersion pair-based dispersion

来源：评论

学校读者我要写书评

暂无评论

High-throughput DNA sequence data compression

引用

BRIEFINGS IN BIOINFORMATICS 2015年第1期16卷 1-15页

作者： Zhu, Zexuan Zhang, Yongpeng Ji, Zhen He, Shan Yang, Xiao Shenzhen Univ Colloge Comp Sci & Software Engn Shenzhen Peoples R China Shenzhen Univ Shenzhen Peoples R China Univ Birmingham Sch Comp Sci Birmingham B15 2TT W Midlands England Broad Inst Cambridge MA 02142 USA

The exponential growth of high-throughput DNA sequence data has posed great challenges to genomic data storage, retrieval and transmission. compression is a critical tool to address these challenges, where many methods have been developed to reduce the storage size of the genomes and sequencing data (reads, quality scores and metadata). However, genomic data are being generated faster than they could be meaningfully analyzed, leaving a large scope for developing novel compression algorithms that could directly facilitate data analysis beyond data transfer and storage. In this article, we categorize and provide a comprehensive review of the existing compression methods specialized for genomic data and present experimental results on compression ratio, memory usage, time for compression and decompression. We further present the remaining challenges and potential directions for future research.

关键词： next-generation sequencing compression reference-based compression reference-free compression

来源：评论

学校读者我要写书评

暂无评论

CoGI: Towards Compressing Genomes as an Image

引用

IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015年第6期12卷 1275-1285页

作者： Xie, Xiaojing Zhou, Shuigeng Guan, Jihong Fudan Univ Shanghai Key Lab Intelligent Informat Proc Shanghai 200433 Peoples R China Fudan Univ Sch Comp Sci Shanghai 200433 Peoples R China Tongji Univ Dept Comp Sci & Technol Shanghai 201804 Peoples R China

Genomic science is now facing an explosive increase of data thanks to the fast development of sequencing technology. This situation poses serious challenges to genomic data storage and transferring. It is desirable to compress data to reduce storage and transferring cost, and thus to boost data distribution and utilization efficiency. Up to now, a number of algorithms / tools have been developed for compressing genomic sequences. Unlike the existing algorithms, most of which treat genomes as one-dimensional text strings and compress them based on dictionaries or probability models, this paper proposes a novel approach called CoGI (the abbreviation of Compressing Genomes as an Image) for genome compression, which transforms the genomic sequences to a two-dimensional binary image (or bitmap), then applies a rectangular partition coding algorithm to compress the binary image. CoGI can be used as either a reference-based compressor or a reference-free compressor. For the former, we develop two entropy-based algorithms to select a proper reference genome. Performance evaluation is conducted on various genomes. Experimental results show that the reference-based CoGI significantly outperforms two state-of-the-art reference-based genome compressors GReEn and RLZ-opt in both compression ratio and compression efficiency. It also achieves comparable compression ratio but two orders of magnitude higher compression efficiency in comparison with XM-one state-of-the-art reference-free genome compressor. Furthermore, our approach performs much better than Gzip-a general-purpose and widely-used compressor, in both compression speed and compression ratio. So, CoGI can serve as an effective and practical genome compressor. The source code and other related documents of CoGI are available at: http://***/projects/***.

关键词： Genomics genomes compression reference-based compression sequence matrixization rectangular partition coding entropy coding

来源：评论

学校读者我要写书评

暂无评论

CoGI: Towards Compressing Genomes as an Image

CoGI: Towards Compressing Genomes as an Image

引用

13th Asia Pacific Bioinformatics Conference (APBC) - Systems Biology

关键词： Genomics genomes compression reference-based compression sequence matrixization rectangular partition coding entropy coding

来源：评论

学校读者我要写书评

暂无评论

On the Role of Inverted Repeats in DNA Sequence Similarity 11th

On the Role of Inverted Repeats in DNA Sequence Similarity

引用

11th International Conference on Practical Applications of Computational Biology and Bioinformatics (PACBB)

作者： Hosseini, Morteza Pratas, Diogo Pinho, Armando J. Univ Aveiro DETI IEETA Aveiro Portugal

ISBN: (纸本)9783319608167;9783319608150

In this paper, we propose a computational approach to quantify inverted repeats. This is important, because it is known that the presence of inverted repeats in genomic data may be associated to certain chromosomal rearrangements. First, we present a reference-based relative compression method, which employs statistical characteristics of the genomic data. Then, for determining the similarity between genomic sequences, we use the normalized relative compression measure, which is light-weight regarding computational time and memory. Testing this approach on various species, including human, chimpanzee, gorilla, chicken, turkey and archaea genomes, we unveil unreported results that may support several evolution insights.

关键词： Inverted repeats Relative compression Finite-context model reference-based compression Chromosomal rearrangement

来源：评论

学校读者我要写书评

暂无评论

Algorithms designed for compressed-gene-data transformation among gene banks with different references

引用

BMC BIOINFORMATICS 2018年第1期19卷 230-230页

作者： Luo, Qiuming Guo, Chao Zhang, Yi Jun Cai, Ye Liu, Gang Shenzhen Univ NHPCC Guangdong Key Lab Popular HPC Shenzhen 518060 Peoples R China Shenzhen Univ Coll Comp Sci & Software Engn Shenzhen 518060 Peoples R China

Background: With the reduction of gene sequencing cost and demand for emerging technologies such as precision medical treatment and deep learning in genome, it is an era of gene data outbreaks today. How to store, transmit and analyze these data has become a hotspot in the current research. Now the compression algorithm based on reference is widely used due to its high compression ratio. There exists a big problem that the data from different gene banks can't merge directly and share information efficiently, because these data are usually compressed with different references. The traditional workflow is decompression-and-recompression, which is too simple and time-consuming. We should improve it and speed it up. Results: In this paper, we focus on this problem and propose a set of transformation algorithms to cope with it. We will 1) analyze some different compression algorithms to find the similarities and the differences among all of them, 2) come up with a naive method named TDM for data transformation between difference gene banks and finally 3) optimize former method TDM and propose the method named WI and the method named TGI. A number of experiment result proved that the three algorithms we proposed are an order of magnitude faster than traditional decompression-and-recompression workflow. Conclusions: Firstly, the three algorithms we proposed all have good performance in terms of time. Secondly, they have their own different advantages faced with different dataset or situations. TDM and WI are more suitable for small-scale gene data transformation, while TGI is more suitable for large-scale gene data transformation.

关键词： reference-based compression DNA sequence compression Gene data transformation

来源：评论

学校读者我要写书评

暂无评论

SparkGC: Spark based genome compression for large collections of genomes

引用

BMC BIOINFORMATICS 2022年第1期23卷 297-297页

作者： Yao, Haichang Hu, Guangyong Liu, Shangdong Fang, Houzhi Ji, Yimu Nanjing Vocat Univ Ind Technol Sch Comp & Software Nanjing 210023 Peoples R China Nanjing Univ Posts & Telecommun Sch Comp Sci Nanjing 210023 Peoples R China Jiangsu HPC & Intelligent Proc Engineer Res Ctr Nanjing 210003 Peoples R China Nanjing Univ Posts & Telecommun Inst High Performance Comp & Bigdata Nanjing 210023 Peoples R China

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://***/haichangyao/SparkGC.

关键词： Genome compression reference-based compression Spark Distributed parallel

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：