CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power o...
详细信息
ISBN:
(纸本)9781450348928
CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above problem is even more challenging for irregular applications, in which the execution dependency can only be determined at run time. Thus over-serialized accelerators are generated from existing works that rely on compile time analysis to schedule the computation. In this work, we propose a comprehensive software-hardware co-design framework, which captures parallelism in irregular applications and aggressively schedules pipelined execution on reconfigurable platform. Based on an inherently parallel abstraction packaging parallelism for runtime schedule, our framework significantly differs from existing works that tend to schedule executions at compile time. An irregular application is formulated as a set of tasks with their dependencies specified as rules describing the conditions under which a subset of tasks can be executed concurrently. Then datapaths on FPGA will be generated by transforming applications in the formulation into task pipelines orchestrated by evaluating rules at runtime, which could exploit fine-grained pipeline parallelism as handcrafted accelerators do. An evaluation shows that this framework is able to produce datapath with its quality close to handcrafted designs. Experiments show that generated accelerators are dramatically more efficient than those created by current high-level synthesis tools. Meanwhile, accelerators generated for a set of irregular applications attain 0.5x similar to 1.9x performance compared to equivalent software implementations we selected on a server-grade 10-core processor, with the memory subsystem remaining as the bottlene
The semantics of concurrent programs is now defined by a weak memory model, determined either by the programming language (e.g., in the case of C/C++11 or Java) or by the hardware architecture (e.g., for assembly and ...
详细信息
ISBN:
(纸本)9783319633879;9783319633862
The semantics of concurrent programs is now defined by a weak memory model, determined either by the programming language (e.g., in the case of C/C++11 or Java) or by the hardware architecture (e.g., for assembly and legacy C code). Since most work in concurrent software verification has been developed prior to weak memory consistency, it is natural to ask how these models affect formal reasoning about concurrent programs. In this overview paper, we show that verification is indeed affected: for example, the standard Owicki-Gries method is unsound under weak memory. Further, based on concurrent separation logic, we develop a number of sound program logics for fragments of the C/C++11 memory model. We show that these logics are useful not only for verifying concurrent programs, but also for explaining the weak memory constructs of C/C++.
Streaming environments and similar parallel platforms are widely used in image, signal, or general data processing as a means of achieving high performance. Unfortunately, they are often associated with specific progr...
详细信息
ISBN:
(纸本)9781509060580
Streaming environments and similar parallel platforms are widely used in image, signal, or general data processing as a means of achieving high performance. Unfortunately, they are often associated with specific programming languages and, thus, hardly accessible for non-experts. In this paper, we present a framework for transformation of a C# procedural code to a Hybrid Flow Graph - a novel intermediate code which employs the streaming paradigm and can be further converted into a streaming application. This approach will allow creating streaming applications or their parts using a widely known imperative language instead of an intricate language specific to streaming. In this paper, we focus on the transformation of control flow which represents the main difference between procedural code, driven by control flow constructs, and streaming environments, driven by data. Since the use of a streaming platform automatically enables parallelism and vectorization, we were able to demonstrate that the streaming applications generated by our method may outperform their original C# implementation.
Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recur...
详细信息
ISBN:
(纸本)9780769561493
Scaling up Artificial Intelligence (AI) algorithms for massive datasets to improve their performance is becoming crucial. In Machine Translation (MT), one of most important research fields of AI, models based on Recurrent Neural Networks (RNN) show state-of-the-art performance in recent years, and many researchers keep working on improving RNN-based models to achieve better accuracy in translation tasks. Most implementations of Neural Machine Translation (NMT) models employ a padding strategy when processing a mini-batch to make all sentences in a mini-batch have the same length. This enables an efficient utilization of caches and GPU/SIMD parallelism but leads to a waste of computation time. In this paper, we implement and parallelize batch learning for a Sequence-toSequence (Seq2Seq) model, which is the most basic model of NMT, without using a padding strategy. More specifically, our approach forms vectors which represent the input words as well as the neural network's states at different time steps into matrices when it processes one sentence, and as a result, the approach makes a better use of cache and optimizes the process that adjusts weights and biases during the back-propagation phase. Our experimental evaluation shows that our implementation achieves better scalability on multi-core CPUs. We also discuss our approach's potential to be used in other implementations of RNN-based models.
With the data growth, the need to parallelize treatments become crucial in numerous domains. But for non-specialists it is still difficult to tackle parallelism technicalities as data distribution, communications or l...
详细信息
With the data growth, the need to parallelize treatments become crucial in numerous domains. But for non-specialists it is still difficult to tackle parallelism technicalities as data distribution, communications or load balancing. For the geoscience domain we propose a solution based on implicit parallel patterns. These patterns are abstract models for a class of algorithms which can be customized and automatically transformed in a parallel execution. In this paper, we describe a pattern for stencil computation and a novel pattern dealing with computation following a pre-defined order. They are particularly used in geosciences and we illustrate them with the flow direction and the flow accumulation computations. (C) 2017 The Authors. Published by Elsevier B.V.
We present Fortran 2018 teams (grouped processes) running a parallel ensemble of simulations built from a pre-existing Message Passing Interface (MPI) application. A challenge arises around the Fortran standard's ...
详细信息
ISBN:
(纸本)9781450351232
We present Fortran 2018 teams (grouped processes) running a parallel ensemble of simulations built from a pre-existing Message Passing Interface (MPI) application. A challenge arises around the Fortran standard's eschewing any direct reference to lower-level communication substrates, such as MPI, leaving any interoperability between Fortran's parallel programming model, Coarray Fortran (CAF), and the supporting substrate to the quality of the compiler implmentation. Our approach introduces CAF incrementally, a process we term "caffeination." By letting CAF initiate execution and exposing the underlying MPI communicator to the original application code, we create a one-to-one correspondence between MPI group colors and Fortran teams. We apply our approach to the National Center for Atmospheric Research (NCAR)'s Weather Research and Forcecasting Hydrological Model (WRF-Hydro). The newly caffeinated main program replaces batch job submission scripts and forms teams that each execute one ensemble member. To support this work, we developed the first compiler front-end and parallel runtime library support for teams. This paper describes the required modifications to a public GNU Compiler Collection (GCC) fork, an OpenCoarrays [1] application binary interface (ABI) branch, and a WRF-Hydro branch.
This article presents massively parallel execution of the BLAST algorithm on supercomputers and HPC clusters using thousands of processors. Our work is based on the optimal splitting up the set of queries running with...
详细信息
ISBN:
(数字)9783319654829
ISBN:
(纸本)9783319654829;9783319654812
This article presents massively parallel execution of the BLAST algorithm on supercomputers and HPC clusters using thousands of processors. Our work is based on the optimal splitting up the set of queries running with the non-modified NCBI-BLAST package for sequence alignment. The work distribution and search management have been implemented in Java using a PCJ (parallel Computing in Java) library. The PCJ-BLAST package is responsible for reading sequence for comparison, splitting it up and start multiple NCBI-BLAST executables. We also investigated a problem of parallel I/O and thanks to PCJ library we deliver high throughput execution of BLAST. The presented results show that using Java and PCJ library we achieved very good performance and efficiency. In result, we have significantly reduced time required for sequence analysis. We have also proved that PCJ library can be used as an efficient tool for fast development of the scalable applications.
Brute-force algorithm needs large amount of computational resources. CUDA is one of computing platforms which are suitable to support this algorithm. In this paper, we discussed about 5 factors of which may be affecti...
详细信息
ISBN:
(纸本)9781538639788
Brute-force algorithm needs large amount of computational resources. CUDA is one of computing platforms which are suitable to support this algorithm. In this paper, we discussed about 5 factors of which may be affecting a GPU based parallel program performance indicator. We had constructed custom and testbed algorithms to evaluate those factors. Testbed algorithms were constructed based on a previous thesis work regarding PDF password cracking. The final algorithm was constructed from significantly affecting factors. All parallel algorithms were implemented on Tesla C2075. Speedup result of final algorithm implementations are 2.92 for 2 bytes alphanumeric passwords and 4.77 for 6 bytes numeric passwords.
Graphene, which is considered to be an infinitely thin two-dimension material, is a very promising optoelectronic material and has received much attention due to its outstanding electrical and optical properties. This...
详细信息
ISBN:
(纸本)9781509042432
Graphene, which is considered to be an infinitely thin two-dimension material, is a very promising optoelectronic material and has received much attention due to its outstanding electrical and optical properties. This paper describes an efficient message-passing interface (MPI) parallel implementation of the finite difference time domain (FDTD) algorithm for modeling infinite Graphene sheet simulations. The algorithm, which is based on the domain decomposition approach, reduces the number of field components to be exchanged between the neighboring processors as compared with the conventional parallel MPI FDTD implementation. Numerical simulations are included to show the effectiveness of the proposed parallel algorithm.
Herding cats can lead to coalition (of cheetahs), intrigue (of kittens), ambush (of tigers), destruction (of wild cats) or pride (of lions). In this tutorial, I will present the cat language to write consistency model...
详细信息
ISBN:
(纸本)9780983567875
Herding cats can lead to coalition (of cheetahs), intrigue (of kittens), ambush (of tigers), destruction (of wild cats) or pride (of lions). In this tutorial, I will present the cat language to write consistency models as a set of constraints on the executions of concurrent programs. A cat model can be executed within the herd tool [3], which I will use during the tutorial.
暂无评论