The ability to store, organize, process and distribute experimental data effectively, efficiently and securely is particularly important for large user facilities like the Advanced Photon Source. In this article, the ...
详细信息
The ability to store, organize, process and distribute experimental data effectively, efficiently and securely is particularly important for large user facilities like the Advanced Photon Source. In this article, the deployment of the APS data Management System (DM) at the 1-ID and 6-BM beamlines of the APS is described. These two beamlines support a wide range of experimental techniques and generate data at relatively high rates, making them ideal candidates to illustrate the deployment and customization of the DM system and its tools. Using several usage examples at these beamlines, various capabilities of the DM system are described.
Multi-OMICS approaches aim on the integration of quantitative data obtained for different biological molecules in order to understand their interrelation and the functioning of larger systems. This paper deals with se...
详细信息
Multi-OMICS approaches aim on the integration of quantitative data obtained for different biological molecules in order to understand their interrelation and the functioning of larger systems. This paper deals with several data integration and dataprocessing issues that frequently occur within this context. To this end, the data processing workflow within the PROFILE project is presented, a multi-OMICS project that aims on identification of novel biomarkers and the development of new therapeutic targets for seven important liver diseases. Furthermore, a software called CrossPlatformCommander is sketched, which facilitates several steps of the proposed workflow in a semi-automatic manner. Application of the software is presented for the detection of novel biomarkers, their ranking and annotation with existing knowledge using the example of corresponding Transcriptomics and Proteomics data sets obtained from patients suffering from hepatocellular carcinoma. Additionally, a linear regression analysis of Transcriptomics vs. Proteomics data is presented and its performance assessed. It was shown, that for capturing profound relations between Transcriptomics and Proteomics data, a simple linear regression analysis is not sufficient and implementation and evaluation of alternative statistical approaches are needed. Additionally, the integration of multivariate variable selection and classification approaches is intended for further development of the software. Although this paper focuses only on the combination of data obtained from quantitative Proteomics and Transcriptomics experiments, several approaches and data integration steps are also applicable for other OMICS technologies. Keeping specific restrictions in mind the suggested workflow (or at least parts of it) may be used as a template for similar projects that make use of different high throughput techniques. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era
Due to the increasing volume of data to be analyzed and the need for global collaborations, many scientific applications have been deployed in a geo-distributed manner. Scientific workflows provide a good model for ru...
详细信息
Due to the increasing volume of data to be analyzed and the need for global collaborations, many scientific applications have been deployed in a geo-distributed manner. Scientific workflows provide a good model for running and managing geo-distributed scientific data analytics. However, due to the multi-level data privacy requirements in geo-distributed data centers (DCs), as well as the costly and heterogeneous inter-DC network performance, executing scientific workflows efficiently in such a geo-distributed environment is not easy. In this paper, we propose a privacy-preserving workflow scheduling algorithm named PPPS, which aims at minimizing the inter-DC data transfer time for workflows while satisfying data privacy requirements. We compare PPPS with five state-of-the-art workflow scheduling algorithms using Windows Azure cloud performance traces and real scientific workflows. Experimental results show that PPPS can greatly reduce the workflow execution time compared to the other algorithms by up to 93% while satisfying complicated data privacy constraints. (C) 2021 Elsevier B.V. All rights reserved.
AimPalaeoecological data are crucial for comprehending large-scale biodiversity patterns and the natural and anthropogenic drivers that influence them over time. Over the last decade, the availability of open-access r...
详细信息
AimPalaeoecological data are crucial for comprehending large-scale biodiversity patterns and the natural and anthropogenic drivers that influence them over time. Over the last decade, the availability of open-access research databases of palaeoecological proxies has substantially increased. These databases open the door to research questions needing advanced numerical analyses and modelling based on big-data compilations. However, compiling and analysing palaeoecological data pose unique challenges that require a guide for producing standardized and reproducible compilations. InnovationWe present a step-by-step guide of how to process fossil pollen data into a standardized dataset compilation ready for macroecological and palaeoecological analyses. We describe successive criteria that will enhance the quality of the compilations. Though these criteria are project and research question-dependent, we discuss the most important assumptions that should be considered and adjusted accordingly. Our guide is accompanied by an R-workflow-called FOSSILPOL-and corresponding R-package-called R-Fossilpol-that provide a detailed protocol ready for interdisciplinary users. We illustrate the workflow by sourcing and processing Scandinavian fossil pollen datasets and show the reproducibility of continental-scale dataprocessing. Main ConclusionsThe study of biodiversity and macroecological patterns through time and space requires large-scale syntheses of palaeoecological datasets. The data preparation for such syntheses must be transparent and reproducible. With our FOSSILPOL workflow and R-package, we provide a protocol for optimal handling of large compilations of fossil pollen datasets and workflow reproducibility. Our workflow is also relevant for the compilation and synthesis of other palaeoecological proxies and as such offers a guide for synthetic and cross-disciplinary analyses with macroecological, biogeographical and palaeoecological perspectives. However, we emphasize tha
In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Tr...
详细信息
ISBN:
(纸本)9783031210464;9783031210471
In business applications, data integration is typically implemented as a data warehouse architecture. In this architecture, heterogeneous and distributed data sources are accessed and integrated by means of Extract-Transform-Load (ETL) processes. Designing these processes is challenging due to the heterogeneity of data models and formats, data errors and missing values, multiple data pieces representing the same realworld objects. As a consequence, ETL processes are very complex, which results in high development and maintenance costs aswell as long runtimes. To ease the development of ETL processes, various research and technological solutions were development. They include among others: (1) ETL design methods, (2) data cleaning pipelines, (3) data deduplication pipelines, and (4) performance optimization techniques. In spite of the fact that these solutions were included in commercial (and some open license) ETL design environments and ETL engines, there still exist multiple open issues and the existing solutions still need to advance. In this paper (and its accompanying talk), I will provoke a discussion on what problems one can encounter while implementing ETL pipelines in real business (industrial) projects. The presented findings are based on my experience from research and commercial data integration projects in financial, healthcare, and software development sectors. In particular, I will focus on a few particular issues, namely: (1) performance optimization of ETL processes, (2) cleaning and deduplicating large row-like data sets, and (3) integrating medical data.
Simple light isotope metabolic labeling (SLIM labeling) is an innovative method to quantify variations in the proteome based on an original in vivo labeling strategy. Heterotrophic cells grown in U-[C-12] as the sole ...
详细信息
Simple light isotope metabolic labeling (SLIM labeling) is an innovative method to quantify variations in the proteome based on an original in vivo labeling strategy. Heterotrophic cells grown in U-[C-12] as the sole source of carbon synthesize U-[C-12]-amino acids, which are incorporated into proteins, giving rise to U-[C-12]-proteins. This results in a large increase in the intensity of the monoisotope ion of peptides and proteins, thus allowing higher identification scores and protein sequence coverage in mass spectrometry experiments. This method, initially developed for signal processing and quantification of the incorporation rate of C-12 into peptides, was based on a multistep process that was difficult to implement for many laboratories. To overcome these limitations, we developed a new theoretical background to analyze bottom-up proteomics data using SLIM-labeling (bSLIM) and established simple procedures based on open-source software, using dedicated OpenMS modules, and embedded R scripts to process the bSLIM experimental data. These new tools allow computation of both the C-12 abundance in peptides to follow the kinetics of protein labeling and the molar fraction of unlabeled and C-12-labeled peptides in multiplexing experiments to determine the relative abundance of proteins extracted under different biological conditions. They also make it possible to consider incomplete C-12 labeling, such as that observed in cells with nutritional requirements for nonlabeled amino acids. These tools were validated on an experimental dataset produced using various yeast strains of Saccharomyces cerevisiae and growth conditions. The workflows are built on the implementation of appropriate calculation modules in a KNIME working environment. These new integrated tools provide a convenient framework for the wider use of the SLIM-labeling strategy.
During recent years, we observe a widespread of new data sources, especially all types of social media and IoT devices, which produce huge data volumes, whose content ranges from fully structured to totally unstructur...
详细信息
In mass spectrometry-based proteomics, peptides are typically identified from tandem mass spectra using spectrum comparison. A sequence search engine compares experimentally obtained spectra with those predicted from ...
详细信息
In mass spectrometry-based proteomics, peptides are typically identified from tandem mass spectra using spectrum comparison. A sequence search engine compares experimentally obtained spectra with those predicted from protein sequences, applying enzyme cleavage and fragmentation rules. To this, there are two main alternatives: spectral libraries and de novo sequencing. The former compares measured spectra with a collection of previously acquired and identified spectra in a library. De novo attempts to sequence peptides from the tandem mass spectra alone. We here present a theoretical framework and a data processing workflow for visualizing and comparing the results of these different types of algorithms. The method considers the three search strategies as different dimensions, identifies distinct agreement classes and visualizes the complementarity of the search strategies. We have included X! Tandem, SpectraST and PepNovo, as they are in common use and representative for algorithms of each type. Our method allows advanced investigation of how the three search methods perform relatively to each other and shows the impact of the currently used decoy sequences for evaluating the false discovery rates.
data-intensive and long-lasting applications running in the form of workflows are being increasingly dispatched to cloud computing systems. Current scheduling approaches for graphs of dependencies fail to deliver high...
详细信息
data-intensive and long-lasting applications running in the form of workflows are being increasingly dispatched to cloud computing systems. Current scheduling approaches for graphs of dependencies fail to deliver high resource efficiency while keeping computation costs low, especially for continuous data processing workflows, where the scheduler does not perform any reasoning about the impact new input data may have in the workflow final output. To face such a challenge, we introduce a new scheduling criterion, Quality-of-data (QoD), which describes the requirements about the data that are worthy of the triggering of tasks in workflows. Based on the QoD notion, we propose a novel service-oriented scheduler planner, for continuous data processing workflows, that is capable of enforcing QoD constraints and guide the scheduling to attain resource efficiency, overall controlled performance and task prioritization. To contrast the advantages of our scheduling model against others, we developed WaaS (workflow-as-a-Service), a workflow coordinator system for the Cloud where data is shared among tasks via cloud columnar database.
暂无评论