the proceedings contain 33 papers. the topics discussed include: a performance evaluation of multi-FPGA architectures for computations of information transfer;massively parallel computation of linear recurrence equati...
ISBN:
(纸本)9781450364942
the proceedings contain 33 papers. the topics discussed include: a performance evaluation of multi-FPGA architectures for computations of information transfer;massively parallel computation of linear recurrence equations with graphics processing units;a first-order approximation of microarchitecture energy-efficiency;delays and states in dataflow models of computation;communication-aware scheduling algorithms for synchronous dataflow graphs on multicore systems;towards power management verification of time-triggered systems using virtual platforms;architectural considerations for FPGA acceleration of machine learning applications in MapReduce;and fast parallel simulation of a manycore architecture with a flit-level on-chip network model.
Computer models have been used as a bridge between parallelalgorithms and hardware architectures. the Bulk-Synchronous parallel (BSP) is a well-known computing model originally devised for distributed algorithms runn...
详细信息
Computer models have been used as a bridge between parallelalgorithms and hardware architectures. the Bulk-Synchronous parallel (BSP) is a well-known computing model originally devised for distributed algorithms running on clusters of single-core processors. the Multi-BSP model, that extends the classic BSP model, was recently proposed for multi-core processors. However, this model -implemented through the MulticoreBSP-for-C library-presents some restrictions such as the explicit synchronizations between the cores, introducing some challenges on which the hardware characteristics should be taken into account to properly model the parallelalgorithms. therefore, we explore the suitability of these models for the Dell multi-core architecture. the objectives of this contribution are twofold. First, we model two different multi-core Dell architectures. Second, we show that a simple model with few parameters can be easily adapted to each Dell platform rather than complex models which tends to use tricky hardware parameters.
In this paper, a software/hardware framework is proposed for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee ...
详细信息
In this paper, a software/hardware framework is proposed for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee independence of different sub-streams. With support from the software, the hardware structure can be easily constructed by simply replicating the single generator. We apply the framework to parallelize MT19937 algorithm. Experimental results shows that our framework is capable of generating arbitrary number of independent parallel random sequences while obtaining speedup roughly proportional to the number of parallel cores. Meanwhile, our framework is superior to those existing architectures reported in the literatures in boththroughput rate and scalability. Furthermore, we implement 149 parallel instances of MT19937 generators on a Xilinx Virtex-5 FPGA device. It achieves the throughput of 42.61M samples/s. Compared to CPU and GPU implementations, the throughput is 10.0 and 2.5 times faster, while the throughput-power efficiency achieves 167.3 and 18.1 times speedup, respectively.
A high-speed frequency-domain bit synchronization algorithm for parallel implementation is proposed in this paper, which is aimed at integrating high-speed laser communication with distance measurement. this algorithm...
详细信息
A high-speed frequency-domain bit synchronization algorithm for parallel implementation is proposed in this paper, which is aimed at integrating high-speed laser communication with distance measurement. this algorithm which combines bit synchronization withthe frequency-domain parallel matched filtering can make the output of filter only contain the data of optimum sampling points, so that the complexity of implementation will be obviously lower than parallel time-domain bit synchronization algorithm. Moreover, this algorithm adopts tracking timing adjustment to produce a better accuracy. A description of this algorithm is made in this paper, including formula derivation, implementation structure and simulation verification. Meanwhile, comparison between time-domain algorithm and frequency-domain algorithm is performed in this paper. According to the simulation results, this algorithm has a better performance.
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be ...
详细信息
ISBN:
(纸本)9781538610428
In parallel computing, a valid graph coloring yields a lock-free processing of the colored tasks, data points, etc., without expensive synchronization mechanisms. However, coloring is not free and the overhead can be significant. In particular, for the bipartite-graph partial coloring (BGPC) and distance-2 graph coloring (D2GC) problems, which have various use-cases within the scientific computing and numerical optimization domains, the coloring overhead can be in the order of minutes with a single thread for many real-life graphs. In this work, we propose parallelalgorithms for bipartite-graph partial coloring on shared-memory architectures. Compared to the existing shared-memory BGPC algorithms, the proposed ones employ greedier and more optimistic techniques that yield a better parallel coloring performance. In particular, on 16 cores, the proposed algorithms are more than 4x faster than their counterparts in the ColPack library which is, to the best of our knowledge, the only publicly-available coloring library for multicore architectures. In addition to BGPC, the proposed techniques are employed to devise parallel distance-2 graph coloring algorithms and similar performance improvements have been observed. Finally, we propose two costless balancing heuristics for BGPC that can reduce the skewness and imbalance on the cardinality of color sets (almost) for free. the heuristics can also be used for the D2GC problem and in general, they will probably yield a better color-based parallelization performance especially on many-core architectures.
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of compu...
详细信息
ISBN:
(纸本)9783319654829;9783319654812
processing of big scale-free graphs on parallelarchitectures with high parallelization opportunities connected with a lot of overheads. Due to skewed degree distribution each thread receives different amount of computational workload. In this paper we present a method devoted to address this challenge by modificating CSR data structure and redistributing work across threads. the method was implemented in breadth-first search and single source shortest pathalgorithms for GPU architecture.
In order to satisfy high mobility and reliability of space optical networks, a parallel Wavelength Fault Tolerant Clos-network, PW-FTC, is proposed to resist faults induced by space effects. At the network level, the ...
详细信息
ISBN:
(纸本)9781538661192
In order to satisfy high mobility and reliability of space optical networks, a parallel Wavelength Fault Tolerant Clos-network, PW-FTC, is proposed to resist faults induced by space effects. At the network level, the minimal number of wavelengths required to resist link failures, W-min, is obtained by solving the linear programming model. At the node level, the PW-FTC network consists of W-min Fault Tolerant Clos-network (FTC) planes, in each of which a wavelength switching is accomplished. Boththeoretical analysis and numerical results demonstrate that, the PW-FTC outperforms parallel Wavelength Fault Tolerant Clos-network, PW-Clos, and parallel Spare Wavelength Clos-network, PSW-Clos, at reliability at the cost of adding spare switching elements.
the Discrete Periodic Radon Transform (DPRT) has many important applications in reconstructing images from their projections and has recently been used in fast and scalable architectures for computing 2D convolutions....
详细信息
the Discrete Periodic Radon Transform (DPRT) has many important applications in reconstructing images from their projections and has recently been used in fast and scalable architectures for computing 2D convolutions. Unfortunately, the direct computation of the DPRT involves O(N~3) additions and memory accesses that can be very costly in single-core architectures. the current paper presents new and efficient algorithms for computing the DPRT and its inverse on multi-core CPUs and GPUs. the results are compared against specialized hardware implementations (FPGAs/ASICs). the results provide significant evidence of the success of the new algorithms. On an 8-core CPU (Intel Xeon), with support for two threads per core, FastDirDPRT and FastDirInvDPRT achieve a speedup of approximately 10× (up to 12.83×) over the single-core CPU implementation. On a 2048-core GPU (GTX 980), FastRayDPRT and FastRayInvDPRT achieve speedups in the range of 526 (for 127 × 127) to 873 (for 1021 × 1021), which approximate ideal speedups of what can be achieved. the DPRT can be computed exactly and in real-time (30 frames per second) for 1471 × 1471 images using FastRayDPRT on the GPU. Furthermore, the GPU algorithms approximate the performance of an efficient FPGA implementation using 2N parallel cores at 100MHz.
Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In ...
详细信息
ISBN:
(纸本)9781728102139
Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In this vein, we present DaCO (Dataflow Coprocessor FPGA Overlay), a high-performance compute organization for FPGAs to deliver up to 2.5x speedup over existing dataflow alternatives. Historically, dataflow-style execution has been viewed as an attractive parallel computing paradigm due to the self-timed, decentralized nature of implementation of dataflow dependencies and an absence of sequential program counters. However, realising high-performance dataflow computers has remained elusive largely due to the complexity of scheduling this parallelism and data communication bottlenecks. DaCO achieves this by (1) supporting large-scale (1000s of nodes) out-of-order scheduling using hierarchical lookup, (2) priority-aware routing of dataflow dependencies using the efficient Hoplite-Q NoC, and (3) clustering techniques to exploit data locality in the communication network organization. Each DaCO processing element is a programmable soft processor and it communicates with others using a packet-switching network-on-chip (PSNoC). We target the Arria 10 AX115S FPGA to take advantage of the hard floating-point DSP blocks, and maximize performance by multipumping the M20K Block RAMs. Overall, we can scale DaCO to 450 processors operating at an f(max) of 250 MHz on the target platform. Each soft processor consumes 779 ALMs, 4 M20K BRAMs, and 3 hard floating-point DSP blocks for optimum balance, while the on-chip communication framework consumes < 15% of the on-chip resources.
Dynamic programming techniques are well-established and employed by-various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. these algorithms usually operate in an ite...
详细信息
Dynamic programming techniques are well-established and employed by-various practical algorithms, including the edit-distance algorithm or the dynamic time warping algorithm. these algorithms usually operate in an iteration-based manner where new values are computed from values of the previous iteration. the data dependencies enforce synchronization which limits possibilities for internal parallelprocessing. In this paper, we investigate parallel approaches to processing matrix-based dynamic programming algorithms on modern multicore CPUs, Intel Xeon Phi accelerators, and general purpose GPUs. We address boththe problem of computing a single distance on large inputs and the problem of computing a number of distances of smaller inputs simultaneously (e.g., when a similarity query is being resolved). Our proposed solutions yielded significant improvements in performance and achieved speedup of two orders of magnitude when compared to the serial baseline. (C) 2016 Elsevier Ltd. All rights reserved.
暂无评论