Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very ...
详细信息
ISBN:
(纸本)9783030576752;9783030576745
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
An essential ingredient of the smart grid is software-based services. Increasingly, software is used to support control strategies and services that are critical to the grid's operation. Therefore, its correct ope...
详细信息
ISBN:
(纸本)9781728161273
An essential ingredient of the smart grid is software-based services. Increasingly, software is used to support control strategies and services that are critical to the grid's operation. Therefore, its correct operation is essential. For various reasons, software and its configuration needs to be updated. This update process represents a significant overhead for smart grid operators and failures can result in financial losses and grid instabilities. In this paper, we present a framework for determining the root causes of software rollout failures in the smart grid. It uses distributed sensors that indicate potential issues, such as anomalous grid states and cyber-attacks, and a causal inference engine based on a formalism called evidential networks. The aim of the framework is to support an adaptive approach to software rollouts, ensuring that a campaign completes in a timely and secure manner. The framework is evaluated for a software rollout use-case in a low voltage distribution grid. Experimental results indicate it can successfully discriminate between different root causes of failure, supporting an adaptive rollout strategy.
Generative Adversarial Networks (GAN) are approaches that are utilized for data augmentation, which facilitates the development of more accurate detection models for unusual or unbalanced datasets. Computer-assisted d...
详细信息
ISBN:
(纸本)9781450397612
Generative Adversarial Networks (GAN) are approaches that are utilized for data augmentation, which facilitates the development of more accurate detection models for unusual or unbalanced datasets. Computer-assisted diagnostic methods may be made more reliable by using synthetic pictures generated by GAN. Generative adversarial networks are challenging to train because too unpredictable training dynamics may occur throughout the learning process, such as model collapse and vanishing gradients. For accurate and faster results the GAN network need to trained in parallel and distributed manner. We enhance the speed and precision of the Deep Convolutional Generative Adversarial Networks (DCGAN) architecture by using its parallelism and executing it on High-Performance computing platforms. The effective analysis of a DCGAN in Graphic Processing Unit and Tensor Processing Unit platforms in which each layer execution pattern is analyzed. The bottleneck is identified for the GAN structure for each execution platforms. The Central Processing Unit is capable of processing neural network models, but it requires a great deal of time to do it. Graphic Processing Unit in contrast, side, are a hundred times quicker than CPUs for Neural Networks, however, they are prohibitively expensive compared to CPUs. Using the systolic array structure, TPU performs well on neural networks with high batch sizes but in GAN the shift between CPU and TPU is huge so it does not perform well.
The proceedings contain 67 papers. The topics discussed include: windsurfing with APPA: automating computational fluid dynamics simulations of wind flow using cloud computing;parallel comparison of huge DNA sequences ...
ISBN:
(纸本)9781728165820
The proceedings contain 67 papers. The topics discussed include: windsurfing with APPA: automating computational fluid dynamics simulations of wind flow using cloud computing;parallel comparison of huge DNA sequences in multiple GPUs with block pruning;accelerating deep learning using multiple GPUs and FPGA-based 10GbE switch;adaptive load balancing based on machine learning for iterative parallel applications;switching at flit level: a congestion efficient flow control strategy for network-on-chip;robustness and energy-elasticity of crown schedules for sets of parallelizable tasks on many-core systems with DVFS;and scalable parallel genetic algorithm for solving large integer linear programming models derived from behavioral synthesis.
distributed generators are typically interfaced to the grid via power electronic converters that are usually operated in pulse width modulation mode. This results in undesirable higher-order harmonics in currents and ...
详细信息
The major usage of a distributed block storage integrated with a cloud computing platform is to provide the storage for VM (virtual machine) instances. Traditional desktop and server applications tend to be written wi...
详细信息
ISBN:
(纸本)9781665408790
The major usage of a distributed block storage integrated with a cloud computing platform is to provide the storage for VM (virtual machine) instances. Traditional desktop and server applications tend to be written with small I/O being dominant, and in limited parallelism. Hence the performance of block storage serving these applications migrated to cloud is largely determined by latency of small I/O. This paper presents IndigoStore, an optimized Ceph backend to implement cloud-scale block storage that provides virtual disks for cloud VMs. The design of IndigoStore aims to optimize Ceph BlueStore backend, the state-of-the-art distributed storage backend, to reduce both average and tail latency of small I/O, meanwhile not waste disk bandwidth serving large I/O. We use both microbenchmarks and our production workloads to demonstrate that IndigoStore achieves 29%∼44 % lower average latency, and up to 1.23×lower 99.99 th percentile tail latency than BlueStore, without any notable negative effects on other performance metrics.
The increased usage of IoT, containerization, and multiple clouds not only changed the way IT works but also the way IT Operations, i.e., the monitoring and management of IT assets, works. Monitoring a complex IT envi...
详细信息
ISBN:
(纸本)9781728147161
The increased usage of IoT, containerization, and multiple clouds not only changed the way IT works but also the way IT Operations, i.e., the monitoring and management of IT assets, works. Monitoring a complex IT environment leads to massive amounts of heterogeneous context data, usually spread across multiple data silos, which needs to be analyzed and acted upon autonomously. However, for a holistic overview of the IT environment, context data needs to be consolidated which leads to several problems. For scalable and automated processes, it is essential to know what context is required for a given monitored resource, where the context data are originating from, and how to access them across the data silos. Therefore, we introduce the Monitoring Resource Model for the holistic management of context data. We show what context is essential for the management of monitored resources and how it can be used for context reasoning. Furthermore, we propose a multi-layered framework for IT Operations with which we present the benefits of the Monitoring Resource Model.
We present a design and implementation of distributed sparse block grids that transparently scale from a single CPU to multi-GPU clusters. We support dynamic sparse grids as, e.g., occur in computer graphics with comp...
详细信息
The traditional partial wave analysis (PWA) algorithm is designed to process data serially which requires a large amount of memory that may exceed the memory capacity of one single node to store runtime data. It is qu...
详细信息
The traditional partial wave analysis (PWA) algorithm is designed to process data serially which requires a large amount of memory that may exceed the memory capacity of one single node to store runtime data. It is quite necessary to parallelize this algorithm in a distributed data computing framework to improve its performance. Within an existing production-level Hadoop cluster, we implement PWA algorithm on top of Spark to process data storing on low-level storage system HDFS. But in this case, sharing data through HDFS or internal data communication mechanism of Spark is extremely inefficient. In order to solve this problem, this paper presents an in-memory parallelcomputing method for PWA algorithm. With this system, we can easily share runtime data in parallel algorithms. We can ensure complete data locality to keep compatibility with the traditional data input/output way and cache most repeatedly used data in memory to improve the performance, owe to the data management mechanism of Alluxio.
Cloud-based deep learning (DL) solutions have been widely used in applications ranging from image recognition to speech recognition. Meanwhile, as commercial software and services, such solutions have raised the need ...
详细信息
ISBN:
(纸本)9781728190747
Cloud-based deep learning (DL) solutions have been widely used in applications ranging from image recognition to speech recognition. Meanwhile, as commercial software and services, such solutions have raised the need for intellectual property rights protection of the underlying DL models. Water-marking is the mainstream of existing solutions to address this concern, by primarily embedding pre-defined secrets in a model's training process. However, existing efforts almost exclusively focus on detecting whether a target model is pirated, without considering traitor tracing. In this paper, we present SecureMark_DL, which enables a model owner to embed a unique fingerprint for every customer within parameters of a DL model, extract and verify the fingerprint from a pirated model, and hence trace the rogue customer who illegally distributed his model for profits. We demonstrate that SecureMark_DL is robust against various attacks including fingerprints collusion and network transformation (e.g., model compression and model fine-tuning) Extensive experiments conducted on MNIST and CIFAR10 datasets, as well as various types of deep neural network show the superiority of SecureMark_DL in terms of training accuracy and robustness against various types of attacks.
暂无评论