咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Portability of Scientific Work... 收藏
arXiv

Portability of Scientific Workflows in NGS Data Analysis: A Case Study

作     者:Schiefer, Christopher Bux, Marc Brandt, Jörgen Messerschmidt, Clemens Reinert, Knut Beule, Dieter Leser, Ulf 

作者机构:Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics Berlin Germany  Berlin Germany Charité - Universitätsmedizin Berlin Berlin Germany Freie Universität Berlin Algorithms in Bioinformatics Berlin Germany Max Delbrück Center for Molecular Medicine Berlin Germany 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2020年

核心收录:

主  题:Reusability 

摘      要:The analysis of next-generation sequencing (NGS) data requires complex computational workflows consisting of dozens of autonomously developed yet interdependent processing steps. Whenever large amounts of data need to be processed, these workflows must be executed on a parallel and/or distributed systems to ensure reasonable runtime. To simplify the development and parallel execution of workflows, researchers rely on existing services such as distributed file systems, specialized workflow languages, resource managers, or workflow scheduling tools. Systems that cover some or all of these functionalities are categorized under labels like scientific workflow management systems, big data processing frameworks, or batch-queuing systems. Porting a workflow developed for a particular system on a particular hardware infrastructure to another system or to another infrastructure is non-trivial, which poses a major impediment to the scientific necessities of workflow reproducibility and workflow reusability. In this work, we describe our efforts to port a state-of-the-art workflow for the detection of specific variants in whole-exome sequencing of mice. The workflow originally was developed in the scientific workflow system snakemake for execution on a high-performance cluster controlled by Sun Grid Engine. In the project, we ported it to the scientific workflow system SaasFee that can execute workflows on (multi-core) stand-alone servers or on clusters of arbitrary sizes using the Hadoop cluster management software. The purpose of this port was that also owners of low-cost hardware infrastructures, for which Hadoop was made for, become able to use the workflow. Although both the source and the target system are called scientific workflow systems, they differ in numerous aspects, ranging from the workflow languages to the scheduling mechanisms and the file access interfaces. These differences resulted in various problems, some expected and more unexpected, that had to be resol

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分