big data workflow management systems (BDWFMSs) have recently emerged as popular platforms to perform large-scale data analytics in the cloud. However, the protection of data confidentiality and secure execution of wor...
详细信息
ISBN:
(纸本)9781450376280
big data workflow management systems (BDWFMSs) have recently emerged as popular platforms to perform large-scale data analytics in the cloud. However, the protection of data confidentiality and secure execution of workflow applications remains an important and challenging problem. Although a few data analytics systems were developed to address this problem, they are limited to specific structures such as Map-Reduce-style workflows and SQL queries. This paper proposes SecdataVIEW, a BDWFMS that leverages Intel Software Guard eXtensions (SGX) and AMD Secure Encrypted Virtualization (SEV) to develop a heterogeneous trusted execution environment for workflows. SecdataVIEW aims to (1) provide the confidentiality and integrity of code and data for workflows running on public untrusted clouds, (2) minimize the TCB size for a BDWFMS, (3) enable the trade-off between security and performance for workflows, and (4) support the execution of Java-based workflow tasks in SGX. Our experimental results show that SecdataVIEW imposes 1.69x to 2.62x overhead on workflow execution time on SGX worker nodes, 1.04x to 1.29x overhead on SEV worker nodes, and 1.20x to 1.43x overhead on a heterogeneous setting in which both SGX and SEV worker nodes are used.
Cloud computing is one of the critical technologies that meet the demand of various businesses for the high -capacity computational processing power needed to gain knowledge from their ever-growing business data. When...
详细信息
Cloud computing is one of the critical technologies that meet the demand of various businesses for the high -capacity computational processing power needed to gain knowledge from their ever-growing business data. When utilizing cloud computing resources to deal with bigdata processing, companies face the challenge of determining the optimal use of resources within their business processes. The miscalculation of the necessary resources directly affects their budget and can cause delays in the cycle time of their key processes. This study investigates the simulation of cloud resource optimization for big data workflows modeled with the Business Process Modeling Notation (BPMN). To this end, a BPMN performance evaluation framework was developed. The framework's capabilities were presented using real-world data science workflow and later evaluated on workflows consisting of 13, 52, and 104 tasks. The results show that the developed framework is adequate for estimating the overall run-time distribution and optimizing the cloud resource deployment and that the BPMN can be utilized for bigdata processing workflows. Therefore, this study contributes to BPMN practitioners by providing a tool to apply BPMN for their big data workflows and decision-makers by giving them critical insights into their key business processes. The framework source code is available at https://***/ntankovic/python-bpmn-engine.
data validation is about verifying the correctness of data. When organisations update and refine their data transformations to meet evolving requirements, it is imperative to ensure that the new version of a workflow ...
详细信息
data validation is about verifying the correctness of data. When organisations update and refine their data transformations to meet evolving requirements, it is imperative to ensure that the new version of a workflow still produces the correct output. We motivate the need for workflows and describe the implementation of a validation tool called Diftong. This tool compares two tabular databases resulting from different versions of a workflow to detect and prevent potential unwanted alterations. Row-based and column-based statistics are used to quantify the results of the database comparison. Diftong was shown to provide accurate results in test scenarios, bringing benefits to companies that need to validate the outputs of their workflows. By automating this process, the risk of human error is also eliminated. Compared to the more labour-intensive manual alternative, it has the added benefit of improved turnaround time for the validation process. Together this allows for a more agile way of updating data transformation workflows.
Many large-scale applications in various domains are generating bigdata, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makes...
详细信息
ISBN:
(纸本)9781728111414
Many large-scale applications in various domains are generating bigdata, which are increasingly processed and analyzed by MapReduce-based workflows deployed in Hadoop systems. In addition to computing time, the makespan of such data-intensive workflows is also largely affected by communication cost. Particularly, there are two levels of data movement during the execution of distributed workflows in Hadoop: i) from map tasks to reduce tasks within each individual MapReduce module and ii) between each pair of adjacent modules in the workflow. Traditionally, these two aspects of network traffic have been treated separately as data locality at the task and module or job level, respectively. However, the interactions between these two levels of data movement may create complicated dynamics and their compound effects remain largely unexplored. In this paper, we formulate a task scheduling problem that considers data movement at both levels to minimize the end-to-end delay of a MapReduce-based workflow. We show this problem to be NP-complete, and design a storage-aware big data workflow scheduling algorithm, referred to as SA-BWS, to optimize workflow performance in Hadoop environments. The performance superiority of SA-BWS is illustrated by extensive simulations in comparison with the default workflow engine in Hadoop and existing scheduling methods.
workflow makespan is the total execution time for running a workflow in the Cloud. The workflow makespan significantly depends on how the workflow tasks and datasets are allocated and placed in a distributed computing...
详细信息
ISBN:
(纸本)9781479999255
workflow makespan is the total execution time for running a workflow in the Cloud. The workflow makespan significantly depends on how the workflow tasks and datasets are allocated and placed in a distributed computing environment such as Clouds. Incorporating data and task allocation strategies to minimize makespan delivers significant benefits to scientific users in receiving their results in time. The main goal of a task placement algorithm is to minimize the total amount of data movement between virtual machines during the execution of the workflows. In this paper, we do the following: 1) formalize the task placement problem in big data workflows;2) propose a task placement strategy (TPS) that considers both initial input datasets and intermediate datasets to calculate the dependency between workflow tasks;and 3) perform extensive experiments in the distributed environment to demonstrate that the proposed strategy provides an effective task distribution and placement tool.
The objective of the work is to offer workflow enabling us to execute both empirical and analytical studies of enzyme kinetics. For this purpose, on the one hand, we are based on a series of experimental research invo...
详细信息
ISBN:
(纸本)9783031425073;9783031425080
The objective of the work is to offer workflow enabling us to execute both empirical and analytical studies of enzyme kinetics. For this purpose, on the one hand, we are based on a series of experimental research involving the traditional methods and techniques used when studying biochemical reactions and designing electrochemical biosensors: conductance research, spectroscopy, and electromagnetic field study. On the other hand, when studying enzyme kinetics analytically we employ the Michaelis-Menten approach while modelling enzyme-substrate-inhibitor interactions and extend it to multi-substrate multi-inhibitor complexes. Enforcing traditional big data workflow is offered with the help of meta-analysis facilities of existing repositories of biochemical studies located on the BRENDA platform.
In the bigdata era, workflow systems must embrace data parallel computing techniques for efficient data analysis and analytics. Here, an easy-to-use, scalable approach is presented to build and execute bigdata appli...
详细信息
In the bigdata era, workflow systems must embrace data parallel computing techniques for efficient data analysis and analytics. Here, an easy-to-use, scalable approach is presented to build and execute bigdata applications using actor-oriented modeling in data parallel computing. Two bioinformatics use cases for next-generation sequencing data analysis demonstrate the approach's feasibility.
big data workflow management systems (BDWMS)s have recently emerged as popular data analytics platforms to conduct large-scale data analytics in the cloud. However, the protection of data confidentiality and secure ex...
详细信息
big data workflow management systems (BDWMS)s have recently emerged as popular data analytics platforms to conduct large-scale data analytics in the cloud. However, the protection of data confidentiality and secure execution of workflow applications remains an important and challenging problem. Although a few data analytics systems, such as VC3 and Opaque, were developed to address security problems, they are limited to specific domains such as Map-Reduce-style and SQL query workflows. A generic secure framework for BDWMSs is still missing. In this article, we propose SecdataVIEW, a distributed BDWMS that employs heterogeneous workers, such as Intel SGX and AMD SEV, to protect both workflow and workflowdata execution, addressing three major security challenges: (1) Reducing the TCB size of the big data workflow management system in the untrusted cloud by leveraging the hardware-assisted TEE and software attestation;(2) Supporting Java-written workflow tasks to overcome the limitation of SGX's lack of support for Java programs;and (3) Reducing the adverse impact of SGX enclave memory paging overhead through a "Hybrid" workflow task scheduling system that selectively deploys sensitive tasks to a mix of SGX and SEV worker nodes. Our experimental results show that SecdataVIEW imposes moderate overhead on the workflow execution time.
暂无评论