The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by...
详细信息
The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the "programming-with-blocks" approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.
Introduction of various cryptographic modes of operation is induced with noted imperfections of symmetric block algorithms. Design of some cryptographic modes of operation has already been exploited as an idea for par...
详细信息
Introduction of various cryptographic modes of operation is induced with noted imperfections of symmetric block algorithms. Design of some cryptographic modes of operation has already been exploited as an idea for parallelization of certain algorithms execution. To the best of our knowledge, there is no evidence in the available literature that output feedback (OFB) mode, which is used in satellite communications, has ever been parallelized. In this paper, we consider the performance of a convenient mode of operation, which performs tweakable parallel encryption using xor encrypt xor (XEX) and xor encrypt (XE) constructions in OFB like mode. We make use of an idea similar to the XTS-AES in order to create two parallel tweakable block ciphers. The first of them is designed using XEX construction, while the second is based on XE construction. Each cipher uses two threads to produce corresponding keystreams. Keystreams are first merged with each other and then used in modified tweakable parallel OFB mode of operation. As a proof of the concept, we have implemented a Java application in which these parallel solutions are applied to collect empirical data. The results obtained show that under certain conditions tweakable parallel OFB modes using XEX and XE constructions can achieve performance accelerations up to 10% and to 20%, respectively. Copyright (c) 2015 John Wiley & Sons, Ltd
Growing system sizes and complexity, along with the large amount of data provided by phasor measurement units (PMUs), are the drivers to accurate state estimation algorithms for online monitoring and operation of powe...
详细信息
Growing system sizes and complexity, along with the large amount of data provided by phasor measurement units (PMUs), are the drivers to accurate state estimation algorithms for online monitoring and operation of power systems. In this paper, a distributed weighted-least-square state estimation method using an additive Schwarz domain decomposition technique is proposed to reduce the computational execution time. The proposed approach divides a data set into several subsets to be processed in parallel using a multiprocessor architecture considering data exchange among distributed areas. The slow coherency method and balanced partitioning are utilized to reduce the communication overhead and increase accuracy. Moreover, bad data analysis is also investigated in a distributed manner. The performance of the proposed distributed state estimator, along with the speed-up for several test systems, was compared with the traditional centralized state estimator. The simulation results show a speed-up of 6.5 for a 4992-bus system.
Data parallel programming provides an accessible model for exploiting the power of parallel computing elements without resorting to the explicit use of low level programming techniques based on locks, threads and moni...
详细信息
In "Process placement in multicore clusters: Algorithmic issues and practical techniques," Jeannot, Mercier, and Tessier presented an algorithm called TREEMATCH for determining the best placement of a set of...
详细信息
In "Process placement in multicore clusters: Algorithmic issues and practical techniques," Jeannot, Mercier, and Tessier presented an algorithm called TREEMATCH for determining the best placement of a set of communicating processes on a hierarchically structured computing architecture, described by a tree. In order to speed up the algorithm, it was suggested to decompose levels of the tree with high arity into several levels of smaller arity. The authors conjectured what the optimal strategy for decomposition is. In this contribution, we prove that their conjecture was right.
With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale sy...
详细信息
With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale systems: they can be largely modified or even completely rewritten from scratch. Therefore, now it is important to ensure that software is ready for Exascale computing and will utilize all Exascale resources well. Many parallel programming models try to take into account all possible hardware features and nuances. However, the HPC community does not yet have a precise answer whether, for Exascale computing, there should be a natural evolution of existing models interoperable with each other or it should be a disruptive approach. Here, we focus on the first option, particularly on a practical assessment of how some parallel programming models can coexist with each other. This work describes two API combination scenarios on the example of iPIC3D [26], an implicit Particle-in-Cell code for space weather applications written in C++ and MPI plus OpenMP. The first scenario is to enable multiple OpenMP threads call MPI functions simultaneously, with no restrictions, using an MPI THREAD MULTIPLE thread safety level. The second scenario is to utilize the OpenMP tasking model on top of the first scenario. The paper reports a step-by-step methodology and experience with these API combinations in iPIC3D; provides the scaling tests for these implementations with up to 2048 physical cores; discusses occurred interoperability issues; and provides suggestions to programmers and scientists who may adopt these API combinations in their own codes.
In this paper we present a correctness proof for Infosoft e-Detailing 1.0 presentation software using Isabelle proof assistant. This work illustrates a method of proving correctness of parallel software using proof as...
详细信息
ISBN:
(纸本)9781509030071
In this paper we present a correctness proof for Infosoft e-Detailing 1.0 presentation software using Isabelle proof assistant. This work illustrates a method of proving correctness of parallel software using proof assistants. Here we concentrate on the state-based approach for proving a safety property. We also give a comparison of this approach with the correctness proof method that was applied to this system in the previous works. The task, rationale, the details of the proof and comparative analysis of approaches are described in this paper.
The overwhelming wealth of parallelism exposed by Extreme-scale computing is rekindling the interest for finegrain multithreading, particularly at the intranode level. Indeed, popular parallel programming models, such...
详细信息
The overwhelming wealth of parallelism exposed by Extreme-scale computing is rekindling the interest for finegrain multithreading, particularly at the intranode level. Indeed, popular parallel programming models, such as OpenMP, are integrating fine-grain tasking in their newest standards. Yet, classical coarse-grain constructs are still largely preferred, as they are considered simpler to express parallelism. In this paper, we present a Multigrain parallel programming environment that allows programmers to use these well-known coarse-grain constructs to generate a fine-grain multithreaded application to be run on top of a fine-grain event-driven program execution model. Experimental results with four scientific benchmarks (Graph500, NAS Data Cube, NWChem-SCF, and ExMatEx's CoMD) show that fine-grain applications generated by and run on our environment are competitive and even outperform their OpenMP counterparts, especially for data-intensive workloads with irregular and dynamic parallelism, reaching speedups as high as 2.6x for Graph500 and 50x for NAS Data Cube.
Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution...
详细信息
Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.
In the current study, we examined how the saccadic system responds when visual information changes dynamically in our environment. Previous studies, using the double-step task, have shown that (a) saccade plans could ...
详细信息
In the current study, we examined how the saccadic system responds when visual information changes dynamically in our environment. Previous studies, using the double-step task, have shown that (a) saccade plans could overlap, such that saccade preparation to an object started even while the saccade preparation to another object was ongoing, and (b) saccade plans could be cancelled before they were completed. In these studies, saccade targets were restricted to a few, experimenter-defined locations. Here, we examined whether saccade plan overlap and cancellation mechanisms could be observed in free-viewing conditions. For each trial, we constructed sets of two images, each containing five objects. All objects have unique positions. Image I was presented for several fixations, before Image 2 was presented during a fixation, presumably while a saccade plan to an object in Image 1 was ongoing. There were two crucial findings: (a) First, the saccade immediately following the transition was sometimes executed towards objects in Image 2, and not an object in Image 1, suggesting that the earlier saccade plan to an Image 1 object had been cancelled. Second, analysis of the temporal data also suggested that preparation of the first post-transition saccade started before an earlier saccade plan to an Image 1 object was executed, implying that saccade plans overlapped. (C) 2016 Elsevier Ltd. All rights reserved.
暂无评论