parallel and distributed computing have enabled development of much more scalable software. However, developing concurrent software requires the programmer to be aware of non-determinism, data races, and deadlocks. MP...
详细信息
ISBN:
(纸本)9781538609415
parallel and distributed computing have enabled development of much more scalable software. However, developing concurrent software requires the programmer to be aware of non-determinism, data races, and deadlocks. MPI (message passing interface) is a popular standard for writing message-oriented distributed applications. Some messages in MPI systems can be processed by one of the many machines and in many possible orders. This non-determinism can affect the result of an MPI application. The alternate results may or may not be correct. To verify MPI applications, we need to check all these possible orderings and use an application specific oracle to decide if these orderings give correct output. MPJ Express is an open source Java implementation of the MPI standard. Model checking of MPI Java programs is a challenging task due to their parallel nature. We developed a Java based model of MPJ Express, where processes are modeled as threads, and which can run unmodified MPI Java programs on a single system. This model enabled us to adapt the Java PathFinder explicit state software model checker (JPF) using a custom listener to verify our model running real MPI Java programs. The evaluation of our approach shows that model checking reveals incorrect system behavior that results in very intricate message orderings.
We present a highly scalable, parallel implementation of first-principles electron dynamics coupled with molecular dynamics (MD). By using optimized kernels, network topology aware communication, and by fully distribu...
详细信息
ISBN:
(纸本)9781509021406
We present a highly scalable, parallel implementation of first-principles electron dynamics coupled with molecular dynamics (MD). By using optimized kernels, network topology aware communication, and by fully distributing all terms in the time-dependent Kohn-Sham equation, we demonstrate unprecedented time to solution for disordered aluminum systems of 2000 atoms (22,000 electrons) and 5400 atoms (59,400 electrons), with wall clock time as low as 7.5 s per MD time step. Despite a significant amount of non-local communication required in every iteration, we achieved excellent strong scaling and sustained performance on the Sequoia Blue Gene/Qsupercomputer at LLNL. We obtained up to 59% of the theoretical sustained peak performance on 16,384 nodes and performance of 8.75 Petaflop/s (43% of theoretical peak) on the full 98,304 node machine (1,572,864 cores). Scalable explicit electron dynamics allows for the study of phenomena beyond the reach of standard first-principles MD, in particular, materials subject to strong or rapid perturbations, such as pulsed electromagnetic radiation, particle irradiation, or strong electric currents. Published by Elsevier Inc.
We consider the problem of computing the convolution of two long vectors using parallel processors in the presence of "stragglers". Stragglers refer to the small fraction of faulty or slow processors that de...
详细信息
ISBN:
(纸本)9781509040964
We consider the problem of computing the convolution of two long vectors using parallel processors in the presence of "stragglers". Stragglers refer to the small fraction of faulty or slow processors that delays the entire computation in time-critical distributedsystems. We first show that splitting the vectors into smaller pieces and using a linear code to encode these pieces provides improved resilience against stragglers than replication-based schemes under a simple, worst-case straggler analysis. We then demonstrate that under commonly used models of computation time, coding can dramatically improve the probability of finishing the computation within a target "deadline" time. As opposed to the more commonly used technique of expected computation time analysis, we quantify the exponents of the probability of failure in the limit of large deadlines. Our exponent metric captures the probability of failing to finish before a specified deadline time, i.e., the behavior of the "tail". Moreover, our technique also allows for simple closed form expressions for more general models of computation time, e.g. shifted Weibull models instead of only shifted exponentials. Thus, through this problem of coded convolution, we establish the utility of a novel asymptotic failure exponent analysis for distributedsystems.
Stream Processors (SPs) continuously transform huge volumes of input streams with a computational model that is inherently distributed, scalable, and fault-tolerant. For these reasons they are used in application envi...
详细信息
Many distributedsystems require coordination between the components involved. With the steady growth of such systems, the probability of failures increases, which necessitates scalable fault-tolerant agreement protoc...
详细信息
ISBN:
(纸本)9781450346993
Many distributedsystems require coordination between the components involved. With the steady growth of such systems, the probability of failures increases, which necessitates scalable fault-tolerant agreement protocols. The most common practical agreement protocol, for such scenarios, is leader-based atomic broadcast. In this work, we propose ALLCONCUR, a distributed system that provides agreement through a leaderless concurrent atomic broadcast algorithm, thus, not suffering from the bottleneck of a central coordinator. In ALLCONCUR, all components exchange messages concurrently through a logical overlay network that employs early termination to minimize the agreement latency. Our implementation of ALLCONCUR supports standard sockets-based TCP as well as high-performance InfiniBand Verbs communications. ALLCONCUR can handle up to 135 million requests per second and achieves 17x higher throughput than today's standard leader-based protocols, such as Libpaxos. Thus, ALLCONCUR is highly competitive with regard to existing solutions and, due to its decentralized approach, enables hitherto unattainable system designs in a variety of fields.
Highly-parallel, high-performance scientific applications must maximize performance inside of a power envelope while maintaining scalability. Emergent parallel and distributedsystems offer a growing number of operati...
详细信息
ISBN:
(纸本)9781450346993
Highly-parallel, high-performance scientific applications must maximize performance inside of a power envelope while maintaining scalability. Emergent parallel and distributedsystems offer a growing number of operating modes that provide unprecedented control of processor speed, memory latency, and memory bandwidth. Optimizing these systems for performance and power requires an understanding of the combined effects of these modes and thread concurrency on execution time. In this paper, we describe how an analytical performance model that separates pure computation time (C) and pure stall time (S) from computation-memory overlap time (0) can accurately capture these combined effects. We apply the COS model to predict the performance of thread and power mode combinations to within 7% and 17% for parallel applications (e.g. LULESH) on Intel x86 and IBM BG/Q architectures, respectively. The key insight of the COS model is that the combined effects of processor and memory throttling and concurrency on overlap trend differently than the combined effects on pure computation and pure stall time. The COS model is novel in that it enables independent approximation of overlap which leads to capabilities and accuracies that are as good or better than the best available approaches.
The proceedings contain 6 papers. The topics discussed include: impact of user’s emotion on software adaptation;experiences with data lineage metadata storing in relational and graph database;parallel itemset mining ...
ISBN:
(纸本)9788001061381
The proceedings contain 6 papers. The topics discussed include: impact of user’s emotion on software adaptation;experiences with data lineage metadata storing in relational and graph database;parallel itemset mining algorithms – an overview;framework and automated prioritization procedure for model-based testing of automotive distributedsystems;model transformations via XSLT;and analyzing musical pieces using *** tools.
Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the ...
详细信息
ISBN:
(纸本)9781538616796
Erasure codes are widely used in practical storage systems to prevent disk failure and data loss. However, these codes require excessive disk I/Os and network traffic for recovering unavailable data. As a result, the recovery performance of erasure codes is suboptimal. Among all erasure codes, Minimum Storage Regenerating (MSR) codes can achieve optimal repair bandwidth under the minimum storage during recovery, but some open issues remain to be addressed before applying them in real systems. In this paper, we present Hybrid Regenerating Codes (HybridRC), a new set of erasure codes with optimized recovery performance and low storage overhead. The codes utilize the superiority of MSR codes to compute a subset of data blocks while some other parity blocks are used for reliability maintenance. As a result, our design is near-optimal with respect to storage and network traffic. We show that Hybrid-RC reduces the reconstruction cost by up to 21% compared to the Local Reconstruction Codes (LRC) with the same storage overhead. Most importantly, in Hybrid-RC, each block contributes only half the amount of data when processing a single block failure. Therefore, the number of I/Os consumed per block is reduced by 50%, which is of great help to balance the network load and reduce the latency.
In bioinformatics, pair-wise alignment plays a significant role insequence comparison by rating the similarities and distances between protein, DeoxyriboNucleic Acid (DNA), and RiboNucleic Acid (RNA)sequences. Sequenc...
详细信息
ISBN:
(纸本)9781450348447
In bioinformatics, pair-wise alignment plays a significant role insequence comparison by rating the similarities and distances between protein, DeoxyriboNucleic Acid (DNA), and RiboNucleic Acid (RNA)sequences. Sequence comparison considered as a key stone in building distance matrices. Due to the rapid growth of molecular databases, the need for faster sequence comparison and alignment has become anecessity. High performance computing impacthas increased in the last decade through providing many high performance architectures and tools. In this paper we present a parallel shared memory design for a dynamic programming algorithm named Hash Table-N-Gram-Hirschberg (HT-NGH) an extension of Hashing-N-Gram-Hirschberg (HNGH) and N-Gram-Hirschberg (NGH) algorithm, to speed up the sequence alignment construction process. The focus of the proposed method ison the transformation phase of HT-NGH algorithm since it takes10% of HT-NGH overall run time. The experimental evaluation of the proposed parallel designshows an enhancement in the execution time and speedup without sacrificing the accuracy. However,the decomposition method might slightly slowdown the proposed algorithm due to the differences in performance between the processing units.
We present NICE, a key-value storage system design that leverages new software-defined network capabilities to build cluster-based network-efficient storage system. NICE presents novel techniques to co-design network ...
详细信息
ISBN:
(纸本)9781450346993
We present NICE, a key-value storage system design that leverages new software-defined network capabilities to build cluster-based network-efficient storage system. NICE presents novel techniques to co-design network routing and multicast with storage replication, consistency, and load balancing to achieve higher efficiency, performance, and scalability. We implement the NICEKV prototype. NICEKV follows the NICE approach in designing four essential network-centric storage mechanisms: request routing, replication, consistency, and load balancing. Our evaluation shows that the proposed approach brings significant performance gains compared to the current key-value systems design: up to 7x put/get performance improvement, up to 2 x reduction in network load, 3x to 9x load reduction on the storage nodes, and the elimination of scalability bottlenecks present in current designs.
暂无评论