This paper discusses modification to algorithms for computing within a parallel cubing unit. The algorithms discussed in this paper shows several architectures for various operand sizes ranging from 8 to 32 bits. The ...
详细信息
This paper discusses modification to algorithms for computing within a parallel cubing unit. The algorithms discussed in this paper shows several architectures for various operand sizes ranging from 8 to 32 bits. The method proposed in this paper separates the cubing partial product matrix into smaller elements and organizes these partial products into repeatable manageable groups. Consequently, the overall partial product matrix is substantially reduced from previous methods. An algorithmic analysis is also presented that demonstrates reduction in area and delay for several operand widths as well as their implementations in a Vitex 5 Xilinx FPGAs and for IBM 65nm ASIC standard-cell library.
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow ...
详细信息
ISBN:
(纸本)9781581131857
In this paper we present fine-grained multithreaded algorithms and implementations for the Fast Fourier Transform (FFT) problem. The FFT problem has been formulated using two distinct approaches based on the dataflow concepts. The first approach, referred to as the receiver-initiated algorithm, realizes the FFAT iterations as a parent-child relationship while fully exploiting the underlying parallelism. The second approach, referred to as the sender-initiated algorithm, follows a data-flow model based on the producer-consumer style of programming and can be adopted to different architectural parameters for achieving high performance. The implementations of the proposed algorithms have been carried out on the EARTH (Efficient Architecture for Running THreads) platform. For both the algorithms, we analyze the ratio of remote vs local threads and study its impact on the experimental results. Our implementation results show that for certain block sizes on fixed problem size and machine size, the receiver-initiated approach performs better than the sender-initiated approach. For large number of processors, both the algorithms perform well, yielding execution times of only 10 msec for an input of 16 K data points on a 64 processor machine, assuming each processor running at 140 MHz clock speed.
Layered (or Turbo) decoding of Low-Density Parity-Check (LDPC) codes is considered as a decoding schedule that facilitates partially parallelarchitectures for performing iterative algorithms based on belief propagati...
详细信息
ISBN:
(纸本)9781538681114
Layered (or Turbo) decoding of Low-Density Parity-Check (LDPC) codes is considered as a decoding schedule that facilitates partially parallelarchitectures for performing iterative algorithms based on belief propagation. It has, on one hand, reduced implementation complexity and memory overhead compared to fully parallelarchitectures and, on the other hand, higher convergence speed compared to both serial and parallelarchitectures. In this paper, we introduce a general form of shuffling of the parity-check matrix of quasi-cyclic LDPC (QC-LDPC) codes which can split the critical path delay in layered decoding and therefore improve throughput by allowing higher clock rates. We also reveal a valuable property of Latin squares QC-LDPC codes which makes them a good candidate for the proposed shuffling method. As a result of that property, no special caution of choosing offset values in the proposed generalized shuffling method is required.
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to convention...
详细信息
ISBN:
(纸本)9781467355247
Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One draw-back of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31% memory address counts by 47%, and data access counts by 38%.
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entitie...
详细信息
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entities or processes. Distributed parallel computation was used in various computer cluster configurations to calculate PDE solutions of electrostatic field. The study of the efficacy of the selected architecture using mesh techniques was intended. The match between the algorithm and the architecture in achieving maximum computational performance was also investigated. The developed architectures, algorithms, and findings are presented in the paper.
Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic paralle...
详细信息
ISBN:
(纸本)9781509035090
Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate paral-lelizable work not known prior to execution. However, modern GPUs architectures do not support dynamic parallelism efficiently due to the high kernel launch overhead, limited number of simultaneous kernels, and limited depth of dynamic calls a device can support. In this paper, we propose Kernel Launch Aggregation and Promotion (KLAP), a set of compiler techniques that improve the performance of kernels which use dynamic parallelism. Kernel launch aggregation fuses kernels launched by threads in the same warp, block, or kernel into a single aggregated kernel, thereby reducing the total number of kernels spawned and increasing the amount of work per kernel to improve occupancy. Kernel launch promotion enables early launch of child kernels to extract more parallelism between parents and children, and to aggregate kernel launches across generations mitigating the problem of limited depth. We implement our techniques in a real compiler and show that kernel launch aggregation obtains a geometric mean speedup of 6.58x over regular dynamic parallelism. We also show that kernel launch promotion enables cases that were not originally possible, improving throughput by a geometric mean of 30.44 x.
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heteroge...
详细信息
ISBN:
(纸本)9781595934529
Suppose we have a parallel or distributed system whose nodes have limited capacities, such as processing speed, bandwidth, memory, or disk space. How does the performance of the system depend on the amount of heterogeneity of its capacity distribution? We propose a general framework to quantify the worst-case effect of increasing heterogeneity in models of parallel systems. Given a cost function g(C,W) representing the system's performance as a function of its nodes' capacities C and workload W (such as the completion time of an optimum schedule of jobs W on machines C), we say that g has price of heterogeneity α when for any workload, cost cannot increase by more than a factor α if node capacities become arbitrarily more heterogeneous. We give constant bounds on the price of heterogeneity of several well-known job scheduling and graph degree/diameter problems, indicating that increasing heterogeneity can never be much of a disadvantage. On the other hand, with the introduction of timing constraints such as release times or precedence constraints on the jobs, the dependence on node capacities becomes more complex, so that increasing heterogeneity may be quite detrimental.
暂无评论