realtime data processing is an important component of particle physics experiments with large computing resource requirements. As the Large Hadron Collider (LHC) at CERN is preparing for its next upgrade the LHCb exp...
详细信息
ISBN:
(纸本)9781538649756
realtime data processing is an important component of particle physics experiments with large computing resource requirements. As the Large Hadron Collider (LHC) at CERN is preparing for its next upgrade the LHCb experiment is upgrading its detector for a 30x increase in data throughput. In preparation for this upgrade the experiment is considering a number of architectural improvements encompassing both its software and hardware infrastructure. One of the hardware platforms under consideration is the Intel Xeon-Phi Knights Landing processor. Thanks to its on-package high-bandwidth memory and many-core architecture it offers an interesting alternative to more traditional server systems. We present a scalable, multi-threaded and NUMA-aware Kalman filter proto-application for particle track fitting expressed in terms of generic parallel patterns using the GRPPI interface. We show how code maintainability and readability improves, while maintaining comparable levels of performance to the baseline implementation. This is achieved by keeping the parallel algorithms in the underlying framework generic, but topology aware through the use of the Portable Hardware Locality (hwloc) library, which allows us to target different architectures with the same program. We measure the performance of our topology-aware GRPPI Kalman filter implementation on the Intel Xeon-Phi Knights Landing platform and conclude on the feasibility of integrating such high-level parallelization libraries in complex software frameworks such as LHCb's Gaudi framework.
Scheduling algorithms have a significant impact in the optimal utilization of HPC facilities. Waiting time, response time, slowdown and weighted slowdown are classical metrics used to compare the performance of differ...
详细信息
ISBN:
(纸本)9781450365239
Scheduling algorithms have a significant impact in the optimal utilization of HPC facilities. Waiting time, response time, slowdown and weighted slowdown are classical metrics used to compare the performance of different scheduling algorithms. This paper investigates the effects of four artefacts, namely non-determinism, shuffling, time shrinking and sampling, on these metrics. We present a scheduling framework based on emulation, that is, using a real scheduler (Slurm) with a sleep program able to take into account periods of suspension. The framework is able to emulate a 50K core cluster using 10 virtualized nodes, with the scheduler running on an isolated node. We find that the non-determinism in repeatedly running a workload has a small but discernible effect of these metrics, and that shuffling job order in a workload increases this by a factor of 5-10. Experiments with shuffled workloads indicate that the average difference of the Backfill and Suspend-Resume strategy performance is within this variation. We also propose methodologies for time shrinking and sampling to decrease the duration of emulations, while aiming to keep these metrics invariant (or linear variant) with the original workload. We find that time shrinking to a factor of up to 90% can have similar effect on the metrics as non-determinism. For sampling, our methodology preserved the distribution of job sizes to a high extent, but had a variation in the metrics somewhat greater than for shuffling. Finally, we use our framework to study in-depth Slurm's scheduling performance, and discover a deficiency in the Suspend-Resume implementation.
Build system, which can convert source codes into applications, is essential for the development of software. The general build systems that relying on single physical or cloud host to run bring problems such as syste...
详细信息
ISBN:
(纸本)9781450364607
Build system, which can convert source codes into applications, is essential for the development of software. The general build systems that relying on single physical or cloud host to run bring problems such as system security, resource shortage, overload, and low availability in the face of massive build requests. After modularizing and streamlining the steps during a build process, this paper proposes a system that introduces container technology and then builds a large-scale, real-time, and huge-concurrency supported build system based on Kubernetes[1]. The system provides a highly scalable and feature-stable cloud architecture that supports huge concurrency with lower resource consumption. Also, the system controls programs' behaviors very well to avoid potential security and resource issues and shows excellent performance in concurrency, scalability, security, and load balance even when handling a large number of build tasks.
Synchronization aspects in the method of large-scale simulation, knovvn as parallel discrete event simulation (PDES), analyzed using the models of the time profile evolutions. time profile is formed vvith the local vi...
详细信息
The proceedings contain 21 papers. The special focus in this conference is on Model and Data Engineering. The topics include: Towards a requirements engineering approach for capturing uncertainty in cyber-physical sys...
ISBN:
(纸本)9783030028510
The proceedings contain 21 papers. The special focus in this conference is on Model and Data Engineering. The topics include: Towards a requirements engineering approach for capturing uncertainty in cyber-physical systems environment;assessment of emerging standards for safety and security co-design on a railway case study;generation of behavior-driven development C++ tests from abstract state machine scenarios;hybrid systems and event-B: A formal approach to signalised left-turn assist;handling reparation in incremental construction of realizable conversation protocols;Analyzing a ROS based architecture for its cross reuse in ISO26262 settings;reliability in fully probabilistic event-B: How to bound the enabling of events;systematic construction of critical embedded systems using event-B;component design and adaptation based on behavioral contracts;An MDA approach for the specification of relay-based diagrams;Towards real-time semantics for a distributed event-based MOP language;Automatic planning: From event-B to PDDL;a problem-oriented approach to critical system design and diagnosis support;formal specification and verification of cloud resource allocation using timed petri-nets;Petri nets to event-B: Handling mathematical sequences through an ERTMS L3 Case;model-based verification and testing methodology for safety-critical airborne systems;gamification and serious games based learning for early childhood in rural areas;context-based sentiment analysis: A survey;a multi-agent system-based distributed intrusion detection system for a cloud computing.
We present a complete approach to highly efficient image registration for embedded systems, covering all steps from theory to practice. An optimization-based image registration algorithm using a least-squares data ter...
详细信息
We present a complete approach to highly efficient image registration for embedded systems, covering all steps from theory to practice. An optimization-based image registration algorithm using a least-squares data term is implemented on an embedded distributed multicore digital signal processor (DSP) architecture. All relevant parts are optimized, ranging from mathematics, algorithmics, and data transfer to hardware architecture and electronic components. The optimization for the rigid alignment of two-dimensional images is performed in a multilevel Gauss-Newton minimization framework. We propose a reformulation of the necessary derivative computations, which eliminates all sparse matrix operations and allows for parallel, memory-efficient computation. The pixelwise parallellism forms an ideal starting point for our implementation on a multicore, multichip DSP architecture. The reduction of data transfer to the particular DSP chips is key for an efficient calculation. By determining worst cases for the subimages needed on each DSP, we can substantially reduce data transfer and memory requirements. This is accompanied by a sophisticated padding mechanism that eliminates pipeline hazards and speeds up the generation of the multilevel pyramid. Finally, we present a reference hardware architecture consisting of four TI C6678 DSPs with eight cores each. We show that it is possible to register high-resolution images within milliseconds on an embedded device. In our example, we register two images with 4096 x 4096 pixels within 93 ms, while off-loading the CPU by a factor of 20 and requiring 3.12 times less electrical energy.
The proceedings contain 145 papers. The topics discussed include: user-transparent translation of machine instructions to programmable hardware;approximation algorithm for scheduling applications on hybrid multi-core ...
ISBN:
(纸本)9781538655559
The proceedings contain 145 papers. The topics discussed include: user-transparent translation of machine instructions to programmable hardware;approximation algorithm for scheduling applications on hybrid multi-core machines with communications delays;large scale data centers simulation based on baseline test model;application performance on a cluster-booster system;transport-triggered soft cores;robustness of surface EMG classifiers with fixed-point decomposition on reconfigurable architecture;streaming architecture for large-scale quantized neural networks on an FPGA-based dataflow platform;high-level reliability evaluation of reconfiguration-based fault tolerance techniques;dynamic reconfiguration for real-time automotive embedded systems in fail-operational context;and rerooting trees increases opportunities for concurrent computation and results in markedly improved performance for phylogenetic inference.
We present in this paper a novel load balancing and rescheduling approach based on the concept of the Sandpile cellular automaton: a decentralized multi-agent system working in a critical state at the edge of chaos. O...
详细信息
We present in this paper a novel load balancing and rescheduling approach based on the concept of the Sandpile cellular automaton: a decentralized multi-agent system working in a critical state at the edge of chaos. Our goal is providing fairness between concurrent job submissions in highly parallel and distributed environments such as currently built cloud computing systems by minimizing slowdown of individual applications and dynamically rescheduling them to the best suited resources. The algorithm design is experimentally validated by a number of numerical experiments showing the effectiveness and scalability of the scheme in the presence of a large number of jobs and resources and its ability to react to dynamic changes in realtime. (C) 2016 Elsevier B.V. All rights reserved.
This paper develops an offset-based response-time analysis technique for analyzing complex distributedreal-timesystems where processing and communication resources use the time-partitioning strategy to isolate the o...
详细信息
This paper develops an offset-based response-time analysis technique for analyzing complex distributedreal-timesystems where processing and communication resources use the time-partitioning strategy to isolate the operation of separate software components. time partitioning may be provided in the processors by an ARINC 653 compliant operating system, and in the networks via the TTP communication protocol. The software components executed by the system may themselves be distributed and complex, composed of many concurrent tasks and with one or more end-to-end flows that may have end-to-end timing requirements. The developed analysis supports hierarchical scheduling where a primary scheduler performs time partitioning into separate partitions, and secondary fixed-priority schedulers dispatch the different concurrent tasks inside each partition. It also supports end-to-end flows that are either synchronized with the partition schedule or not. This is the first time that this kind of analysis is developed. An evaluation of an improvement introduced in the analysis is discussed. Two representative case studies are described.
In recent years, the size and complexity of the datasets generated by the large-scale numerical simulations using modern HPC (High Performance Computing) systems have continuously increasing. These generated datasets ...
详细信息
ISBN:
(纸本)9789811328534;9789811328527
In recent years, the size and complexity of the datasets generated by the large-scale numerical simulations using modern HPC (High Performance Computing) systems have continuously increasing. These generated datasets can possess different formats, types, and attributes. In this work, we have focused on the large-scale distributed unstructured volume datasets, which are still applied on numerical simulations in a variety of scientific and engineering fields. Although volume rendering is one of the most popular techniques for analyzing and exploring a given volume data, in the case of unstructured volume data, the time-consuming visibility sorting becomes problematic as the data size increases. Focusing on an effective volume rendering of large-scale distributed unstructured volume datasets generated in HPC environments, we opted for using the well-known PBVR (Particle-based Volume Rendering) method. Although PBVR does not require any visibility sorting during the rendering process, the CPU-based approach has a notorious image quality and memory consumption tradeoff. This is because that the entire set of the intermediate rendering primitives (particles) was required to be stored a priori to the rendering processing. In order to minimize the high pressure on the memory consumption, we propose a fully parallel PBVR approach, which eliminates the necessity for storing these intermediate rendering primitives, as required by the existing approaches. In the proposed method, each set of the rendering primitives is directly converted to a partial image by the processes, and then they are gathered and merged by the utilized parallel image composition library (234Compositor). We evaluated the memory cost and processing time by using a real CFD simulation result, and we could verify the effectiveness of our proposed method compared to the already existing parallel PBVR method.
暂无评论