The proceedings contain 45 papers. The topics discussed include: the price of clustering in bin-packing with applications to bin-packing with delays;faster matrix multiplication via sparse decomposition;NC algorithms ...
ISBN:
(纸本)9781450361842
The proceedings contain 45 papers. The topics discussed include: the price of clustering in bin-packing with applications to bin-packing with delays;faster matrix multiplication via sparse decomposition;NC algorithms for computing a perfect matching, the number of perfect matchings, and a maximum flow in one-crossing-minor-free graphs;improved MPC algorithms for edit distance and ulam distance;brief announcement: scalable diversity maximization via small-size composable core-sets;brief announcement: eccentricities via parallel set cover;dynamic algorithms for the massively parallel computation model;massively parallel computation via remote memory access;and brief announcement: ultra-fast asynchronous randomized rumor spreading.
The rising trend and advancements in machine learning has resulted into its numerous applications in the field of computer vision, pattern recognition to providing security to hardware devices. Eventhough the proven a...
详细信息
In complex System-on-a-Chip (SoC) projects, the conclusion of the project depends on the functional verification phase, which takes a long time. Synchronizing distributed and heterogeneous components in a functional v...
详细信息
ISBN:
(纸本)9781538674314
In complex System-on-a-Chip (SoC) projects, the conclusion of the project depends on the functional verification phase, which takes a long time. Synchronizing distributed and heterogeneous components in a functional verification environment might not be a simple task. This work aims to present a distributed verification environment that allows the integration of heterogeneous components. In this environment, it is possible to perform the functional verification of multiple components in heterogeneous architectures in parallel and distributed fashion. For this, an intercommunication framework already developed by the authors was used, based on the High Level Architecture (IEEE 1516) standard. Thus, this article also demonstrates how the proposed architecture abstracts communication and synchronization details to make the functional verification process in distributed components as straightforward as possible. As a demonstration of the developed solution, an experiment is presented with the functional verification of parallel algorithms in GPU and in FPGA, besides the verification using a CPU.
Finding regions of local similarity between biological sequences is a fundamental task in computational biology. BLAst is the most widely-used tool for this purpose, but it suffers from irregularities due to its heuri...
详细信息
ISBN:
(纸本)9781538639146
Finding regions of local similarity between biological sequences is a fundamental task in computational biology. BLAst is the most widely-used tool for this purpose, but it suffers from irregularities due to its heuristic nature. To achieve fast search, recent approaches construct the index from the database instead of the input query. However, database indexing introduces more challenges in the design of index structure and algorithm, especially for data access through the memory hierarchy on modern multicore processors. In this paper, based on existing heuristic algorithms, we design and develop a database indexed BLAst with the identical sensitivity as query indexed BLAst (i.e., NCBI-BLAst). Then, we identify that existing heuristic algorithms of BLAst can result in serious irregularities in database indexed search. To eliminate irregularities in BLAst algorithm, we propose muBLAstP, that uses multiple optimizations to improve data locality and parallel efficiency for multicore architectures and multi-node systems. Experiments on a single node demonstrate up to a 5.1-fold speedup over the multi-threaded NCBI BLAst. For the inter-node parallelism, we achieve nearly linear scaling on up to 128 nodes and gain up to 8.9-fold speedup over mpiBLAst.
Some of the newer processor architectures are no longer based on registers in order to increase their potential of instruction-level parallelism. Instead, they expose their data paths to the compiler so that the progr...
详细信息
ISBN:
(纸本)9780769561493
Some of the newer processor architectures are no longer based on registers in order to increase their potential of instruction-level parallelism. Instead, they expose their data paths to the compiler so that the program is able to directly move data values between function units using suitable instructions. Some of these architectures require a synchronous transfer of data values while others use asynchronous transfers by buffering values. In this paper, we discuss the out-of-order execution of function units of exposed data path architectures with asynchronous data transfers. The execution of these function units may locally deviate from the program order which is in analogy to dynamic scheduling used by processors with out-of-order execution. Since our out-of-order execution has only effects inside the function units, it requires no modifications of the compiler or instruction set. We have implemented different variants on FPGAs, and evaluated these for a set of application scenarios showing that the out-of-order extension can considerably increase the performance of these architectures.
We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious...
详细信息
ISBN:
(纸本)9780769561493
We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious execution. In particular, our first variant corresponds to the classical implementation of LUpp in the legacy version of LAPACK, which constrains the concurrency exploited to that intrinsic to the basic linear algebra kernels that appear during the factorization, but exerts an strict control of the cache memory and a static mapping of kernels to cores. A second variant relaxes this task-constrained scenario by introducing a look-ahead of depth one to increase task-parallelism, increasing the pressure on the cache system in terms of cache misses. Finally, the third variant orchestrates an execution where the degree of concurrency is only limited by the actual data dependencies in LUpp, potentially yielding to a higher volume of conflicts due to competition for the cache memory resources. The target platform for our implementations and experiments is a specific asymmetric multicore processor (AMP) from ARM, which introduces the additional scheduling complexity of having to deal with two distinct types of cores;and an L2-shared cache per cluster of the AMP, which results in more conflictivity in the access to this key cache level.
Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The rece...
详细信息
ISBN:
(纸本)9780769561493
Convolutional Neural Network (CNN) is a deep learning algorithm extended from Artificial Neural Network (ANN) and widely used for image classification and recognition, thanks to its invariance to distortions. The recent rapid growth of applications based on deep learning algorithms, especially in the context of Big Data analytics, has dramatically improved both industrial and academic research and exploration of optimized implementations of CNNs on accelerators such as GPUs, FPGAs and ASICs, as general purpose processors can hardly meet the ever increasing performance and energy-efficiency requirements. FPGAs in particular are one of the most attractive alternative, as they allow the exploitation of the implicit parallelism of the algorithm and the acceleration of the different layers of a CNN with custom optimizations, while retaining extreme flexibility thanks to their reconfigurability. In this work, we propose a methodology to implement CNNs on FPGAs in a modular, scalable way. This is done by exploiting the dataflow pattern of convolutions, using an approach derived from previous work on the acceleration of Iterative stencil Loops (ISLs), a computational pattern that shares some characteristics with convolutions. Furthermore, this approach allows the implementation of a high-level pipeline between the different network layers, resulting in an increase of the overall performance when the CNN is employed to process batches of multiple images, as it would happen in real-life scenarios.
Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three diff...
详细信息
ISBN:
(纸本)9781538639146
Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel's tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.
Determining key characteristics of High Performance Computing machines that allow users to predict their performance is an old and recurrent dream. This was, for example, the rationale behind the design of the LogP mo...
详细信息
ISBN:
(纸本)9780769561493
Determining key characteristics of High Performance Computing machines that allow users to predict their performance is an old and recurrent dream. This was, for example, the rationale behind the design of the LogP model that later evolved into many variants (LogGP, LogGPS, LoGPS,...) to cope with the evolution and complexity of network technology. Although the network has received a lot of attention, predicting the performance of computation kernels can be very challenging as well. In particular, the tremendous increase of internal parallelism and deep memory hierarchy in modern multi-core architectures often limits applications by the memory access rate. In this context, determining the key characteristics of a machine such as the peak bandwidth of each cache level as well as how an application uses such memory hierarchy can be the key to predict or to extrapolate the performance of applications. Based on such performance models, most high-level simulation-based frameworks separately characterize a machine and an application, later convolving both signatures to predict the overall performance. We evaluate the suitability of such approaches to modern architectures and applications by trying to reproduce the work of others. When trying to build our own framework, we realized that, regardless of the quality of the underlying models or software, most of these frameworks rely on "opaque" benchmarks to characterize the platform. In this article, we report the many pitfalls we encountered when trying to characterize both the network and the memory performance of modern machines. We claim that opaque benchmarks that do not clearly separate experiment design, measurements, and analysis should be avoided as much as possible in a modeling context. Likewise, an a priori identification of experimental factors should be done to make sure the experimental conditions are adequate.
The increased use of application-specific computational devices turns even low-power chips into high-performance computers. Not only additional accelerators (e.g., GPU, DSP, or even FPGA), but also heterogeneous CPU c...
详细信息
ISBN:
(纸本)9780769561493
The increased use of application-specific computational devices turns even low-power chips into high-performance computers. Not only additional accelerators (e.g., GPU, DSP, or even FPGA), but also heterogeneous CPU clusters form modern computer systems. Programming these chips is however challenging, due to management overhead, data transfer delays, and a missing unification of the programming flow. Moreover, most accelerators require device specific optimizations. Thus, for application developers, fulfilling software's initial intention to serve high portability is one of the most ambitious objectives. In this work, we present a software abstraction layer unifying the programming flow for parallel and heterogeneous platforms. Therefore, we offer a generic C++ API for parallelizing on heterogeneous CPU clusters and offloading to accelerators, specifically addressing applications with strict real-time constraints. At a free configurable choice of parallelization- and offloading-frameworks (e.g., TBB, OpenCL) without affecting the portability, we also include automatic profiling methods. While offering high configurability of the architecture mapping, these methods ease the development of optimum scheduling strategies - e.g., in terms of power, throughput, or latency. To demonstrate the use of the proposed methods, we present heterogeneous implementations of the Semi-Global Matching and Histograms of Oriented Gradients algorithms as exemplary advanced driver-assistance algorithms. We provide an in-depth discussion of scheduling strategies for execution on a Samsung Exynos 5422 MPSoC, an Intel Xeon Phi manycore, and a general-purpose processor equipped with a Nallatech PCIe-385N FPGA accelerator card.
暂无评论