检索结果-内蒙古大学图书馆

Exploiting GPUs with the Super Instruction Architecture

INTERNATIONAL JOURNAL OF parallel programming 2016年第2期44卷 309-324页

作者： Jindal, Nakul Lotrich, Victor Deumens, Erik Sanders, Beverly A. Univ Florida Dept Comp & Informat Sci Gainesville FL 32611 USA Univ Florida Dept Chem Gainesville FL 32611 USA

The Super Instruction Architecture (SIA) is a parallel programming environment designed for problems in computational chemistry involving complicated expressions defined in terms of tensors. Tensors are represented by multidimensional arrays which are typically very large. The SIA consists of a domain specific programming language, Super Instruction Assembly Language (SIAL), and its runtime system, Super Instruction Processor. An important feature of SIAL is that algorithms are expressed in terms of blocks (or tiles) of multidimensional arrays rather than individual floating point numbers. In this paper, we describe how the SIA was enhanced to exploit GPUs, obtaining speedups ranging from two to nearly four for computational chemistry calculations, thus saving hours of elapsed time on large-scale computations. The results provide evidence that the "programming-with-blocks" approach embodied in the SIA will remain successful in modern, heterogeneous computing environments.

关键词： parallel programming Tensors GPU Domain specific language

来源：评论

学校读者我要写书评

暂无评论

Tweakable parallel OFB mode of operation with delayed thread synchronization

引用

SECURITY AND COMMUNICATION NETWORKS 2016年第10期9卷 1119-1131页

作者： Damjanovic, Boris Simic, Dejan Univ Belgrade Fac Org Sci Dept Informat Syst Jove Ilica 154 Belgrade Serbia Univ Belgrade Fac Org Sci Dept Informat Technol Jove Ilica 154 Belgrade Serbia

Introduction of various cryptographic modes of operation is induced with noted imperfections of symmetric block algorithms. Design of some cryptographic modes of operation has already been exploited as an idea for parallelization of certain algorithms execution. To the best of our knowledge, there is no evidence in the available literature that output feedback (OFB) mode, which is used in satellite communications, has ever been parallelized. In this paper, we consider the performance of a convenient mode of operation, which performs tweakable parallel encryption using xor encrypt xor (XEX) and xor encrypt (XE) constructions in OFB like mode. We make use of an idea similar to the XTS-AES in order to create two parallel tweakable block ciphers. The first of them is designed using XEX construction, while the second is based on XE construction. Each cipher uses two threads to produce corresponding keystreams. Keystreams are first merged with each other and then used in modified tweakable parallel OFB mode of operation. As a proof of the concept, we have implemented a Java application in which these parallel solutions are applied to collect empirical data. The results obtained show that under certain conditions tweakable parallel OFB modes using XEX and XE constructions can achieve performance accelerations up to 10% and to 20%, respectively. Copyright (c) 2015 John Wiley & Sons, Ltd

关键词： cryptography parallel programming performance analysis AES

来源：评论

学校读者我要写书评

暂无评论

parallel Domain-Decomposition-Based Distributed State Estimation for Large-Scale Power Systems

引用

IEEE TRANSACTIONS ON INDUSTRY APPLICATIONS 2016年第2期52卷 1265-1269页

作者： Karimipour, Hadis Dinavahi, Venkata Univ Alberta Edmonton AB T6G 2V4 Canada

Growing system sizes and complexity, along with the large amount of data provided by phasor measurement units (PMUs), are the drivers to accurate state estimation algorithms for online monitoring and operation of power systems. In this paper, a distributed weighted-least-square state estimation method using an additive Schwarz domain decomposition technique is proposed to reduce the computational execution time. The proposed approach divides a data set into several subsets to be processed in parallel using a multiprocessor architecture considering data exchange among distributed areas. The slow coherency method and balanced partitioning are utilized to reduce the communication overhead and increase accuracy. Moreover, bad data analysis is also investigated in a distributed manner. The performance of the proposed distributed state estimator, along with the speed-up for several test systems, was compared with the traditional centralized state estimator. The simulation results show a speed-up of 6.5 for a 4992-bus system.

关键词： Bad data identification (BDI) distributed state estimation domain decomposition large-scale systems parallel programming phasor measurement units (PMUs) weighted least square (WLS)

来源：评论

学校读者我要写书评

暂无评论

An analysis of programmer productivity versus performance for high level data parallel programming

Concurrent Systems Engineering Series

引用

Concurrent Systems Engineering Series 2011年 68卷 111-130页

作者： Cole, Alex McEwan, Alistair Singh, Satnam Embedded Systems Lab. University of Leicester United Kingdom Microsoft Research Cambridge United Kingdom

Data parallel programming provides an accessible model for exploiting the power of parallel computing elements without resorting to the explicit use of low level programming techniques based on locks, threads and monitors. The emergence of Graphics Processing Units (GPUs) with hundreds or thousands of processing cores has made data parallel computing available to a wider class of programmers. GPUs can be used not only for accelerating the processing of computer graphics but also for general purpose data-parallel programming. Low level data-parallel programming languages based on the Compute Unified Device Architecture (CUDA) provide an approach for developing programs for GPUs but these languages require explicit creation and coordination of threads and careful data layout and movement. This has created a demand for higher level programming languages and libraries which raise the abstraction level of data-parallel programming and increase programmer productivity. The Accelerator system was developed by Microsoft for writing data parallel code in a high level manner which can execute on GPUs, multicore processors using SSE3 vector instructions and FPGA chips. This paper compares the performance and development effort of the high level Accelerator system against lower level systems which are more difficult to use but may yield better results. Specifically, we compare against the NVIDIA CUDA compiler and sequential C++ code considering both the level of abstraction in the implementation code and the execution models. We compare the performance of these systems using several case studies. For some classes of problems, Accelerator has a performance comparable to CUDA, but for others its performance is significantly reduced;however in all cases it provides a model which is easier to use and enables greater programmer productivity. © 2011 The authors and IOS Press. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Comment on "Process Placement in Multicore Clusters: Algorithmic Issues and Practical Techniques"

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2016年第8期27卷 2475-2476页

作者： Mann, Zoltan Adam Budapest Univ Technol & Econ Dept Comp Sci & Informat Theory Budapest Hungary

In "Process placement in multicore clusters: Algorithmic issues and practical techniques," Jeannot, Mercier, and Tessier presented an algorithm called TREEMATCH for determining the best placement of a set of communicating processes on a hierarchically structured computing architecture, described by a tree. In order to speed up the algorithm, it was suggested to decompose levels of the tree with high arity into several levels of smaller arity. The authors conjectured what the optimal strategy for decomposition is. In this contribution, we prove that their conjecture was right.

关键词： parallel programming high-performance computing multicore processing

来源：评论

学校读者我要写书评

暂无评论

Performance Study of Multithreaded MPI and OpenMP Tasking in a Large Scientific Code

Performance Study of Multithreaded MPI and OpenMP Tasking in...

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Dana Akhmetova Roman Iakymchuk Orjan Ekeberg Erwin Laure Department of Computational Science and Technology KTH Royal Institute of Technology Stockholm Sweden

With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale systems: they can be largely modified or even completely rewritten from scratch. Therefore, now it is important to ensure that software is ready for Exascale computing and will utilize all Exascale resources well. Many parallel programming models try to take into account all possible hardware features and nuances. However, the HPC community does not yet have a precise answer whether, for Exascale computing, there should be a natural evolution of existing models interoperable with each other or it should be a disruptive approach. Here, we focus on the first option, particularly on a practical assessment of how some parallel programming models can coexist with each other. This work describes two API combination scenarios on the example of iPIC3D [26], an implicit Particle-in-Cell code for space weather applications written in C++ and MPI plus OpenMP. The first scenario is to enable multiple OpenMP threads call MPI functions simultaneously, with no restrictions, using an MPI THREAD MULTIPLE thread safety level. The second scenario is to utilize the OpenMP tasking model on top of the first scenario. The paper reports a step-by-step methodology and experience with these API combinations in iPIC3D; provides the scaling tests for these implementations with up to 2048 physical cores; discusses occurred interoperability issues; and provides suggestions to programmers and scientists who may adopt these API combinations in their own codes.

关键词： Message systems Computational modeling parallel programming parallel processing Hardware Safety

来源：评论

学校读者我要写书评

暂无评论

A Formal Proof of Properties of a Presentation System using Isabelle

A Formal Proof of Properties of a Presentation System using ...

引用

IEEE Ukraine Conference on Electrical and Computer Engineering

作者： Taras Panchenko Ievgen Ivanov Taras Shevchenko National University of Kyiv

ISBN: (纸本)9781509030071

In this paper we present a correctness proof for Infosoft e-Detailing 1.0 presentation software using Isabelle proof assistant. This work illustrates a method of proving correctness of parallel software using proof assistants. Here we concentrate on the state-based approach for proving a safety property. We also give a comparison of this approach with the correctness proof method that was applied to this system in the previous works. The task, rationale, the details of the proof and comparative analysis of approaches are described in this paper.

关键词： ISABELLE STORAGE RINGS program verification formal proof Proof Display systems parallel programming

来源：评论

学校读者我要写书评

暂无评论

Multigrain parallelism: Bridging Coarse-Grain parallel Programs and Fine-Grain Event-Driven Multithreading

Multigrain Parallelism: Bridging Coarse-Grain Parallel Progr...

引用

International Symposium on parallel and Distributed Processing (IPDPS)

作者： Jaime Arteaga Stéphane Zuckerman Guang R. Gao Department of Electrical and Computer Engineering University of Delaware Newark DE USA Department of Computer Science Michigan Technological University Houghton MI USA

The overwhelming wealth of parallelism exposed by Extreme-scale computing is rekindling the interest for finegrain multithreading, particularly at the intranode level. Indeed, popular parallel programming models, such as OpenMP, are integrating fine-grain tasking in their newest standards. Yet, classical coarse-grain constructs are still largely preferred, as they are considered simpler to express parallelism. In this paper, we present a Multigrain parallel programming environment that allows programmers to use these well-known coarse-grain constructs to generate a fine-grain multithreaded application to be run on top of a fine-grain event-driven program execution model. Experimental results with four scientific benchmarks (Graph500, NAS Data Cube, NWChem-SCF, and ExMatEx's CoMD) show that fine-grain applications generated by and run on our environment are competitive and even outperform their OpenMP counterparts, especially for data-intensive workloads with irregular and dynamic parallelism, reaching speedups as high as 2.6x for Graph500 and 50x for NAS Data Cube.

关键词： Runtime parallel processing Copper parallel programming Computational modeling Program processors Standards

来源：评论

学校读者我要写书评

暂无评论

Automatic-Signal Monitors with Multi-object Synchronization

Automatic-Signal Monitors with Multi-object Synchronization

引用

International Symposium on parallel and Distributed Processing (IPDPS)

作者： Wei-Lun Hung Vijay K. Garg Department of Electrical and Computer Engineering The University of Texas at Austin Austin TX USA

Current monitor based systems have some disadvantages for multi-object operations. They require the programmers to (1) manually determine the order of locking operations, (2) manually determine the points of execution where threads should signal other threads, (3) use global locks or perform busy waiting for operations that depend upon a condition that spans multiple objects. Transactional memory systems eliminate the need for explicit locks, but do not support conditional synchronization. They also require the ability to rollback transactions. In this paper, we propose new monitor based methods that provide automatic signaling for global conditions that span multiple objects. Our system provides automatic notification for global conditions. Assuming that the global condition is a Boolean expression of local predicates, our method allows efficient monitoring of the conditions without any need for global locks. Furthermore, our system solves the monitor composition problem without requiring global locks. We have implemented our constructs on top of Java and have evaluated their overhead. Our results show that on most of the test cases, not only our code is simpler but also faster than Java's reentrant- lock as well as the Deuce transactional memory system.

关键词： Monitoring Synchronization System recovery Instruction sets Java parallel programming Concurrent computing

来源：评论

学校读者我要写书评

暂无评论

Saccade plan overlap and cancellation during free viewing

引用

VISION RESEARCH 2016年 127卷 122-131页

作者： Wu, Esther X. W. Chua, Fook-Kee Yen, Shih-Cheng Natl Univ Singapore Ctr Life Sci Singapore Inst Neurotechnol SINAPSE 28 Med Dr05-COR Singapore 117456 Singapore Natl Univ Singapore Fac Arts & Social Sci Dept Psychol Block AS402-079 Arts Link Singapore 117570 Singapore Natl Univ Singapore Dept Elect & Comp Engn Fac Engn Blk E405-484 Engn Dr 3 Singapore 117576 Singapore

In the current study, we examined how the saccadic system responds when visual information changes dynamically in our environment. Previous studies, using the double-step task, have shown that (a) saccade plans could overlap, such that saccade preparation to an object started even while the saccade preparation to another object was ongoing, and (b) saccade plans could be cancelled before they were completed. In these studies, saccade targets were restricted to a few, experimenter-defined locations. Here, we examined whether saccade plan overlap and cancellation mechanisms could be observed in free-viewing conditions. For each trial, we constructed sets of two images, each containing five objects. All objects have unique positions. Image I was presented for several fixations, before Image 2 was presented during a fixation, presumably while a saccade plan to an object in Image 1 was ongoing. There were two crucial findings: (a) First, the saccade immediately following the transition was sometimes executed towards objects in Image 2, and not an object in Image 1, suggesting that the earlier saccade plan to an Image 1 object had been cancelled. Second, analysis of the temporal data also suggested that preparation of the first post-transition saccade started before an earlier saccade plan to an Image 1 object was executed, implying that saccade plans overlapped. (C) 2016 Elsevier Ltd. All rights reserved.

关键词： Eye movement Scene transition Free viewing Saccade programming parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：