We present novel work-optimal PRAM algorithms for Burrows-Wheeler (BW) compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), ...
详细信息
ISBN:
(纸本)9781450315722
We present novel work-optimal PRAM algorithms for Burrows-Wheeler (BW) compression and decompression of strings over a constant alphabet. For a string of length n, the depth of the compression algorithm is O(log2 n), and the depth of the corresponding decompression algorithm is O(log n). These appear to be the first polylogarithmic-time work-optimal parallelalgorithms for any standard lossless compression *** algorithms for the individual stages of compression and decompression may also be of independent interest: 1. a novel O(log n)-time, O(n)-work PRAM algorithm for Huffman decoding; 2. original insights into the stages of the BW compression and decompression problems, bringing out parallelism that was not readily apparent. We then mapped such parallelism in interesting ways to elementary parallel routines that have O(log n)-time, O(n)-work solutions, such as: (i) prefix-sums problems with an appropriately-defined associative binary operator for several stages, and (ii) list ranking for the final stage of decompression (inverse blocksorting transform).Companion work reports empirical speedups of up to 25x for compression and up to 13x for decompression. This reflects a speedup of 70x over recent work on BW compression on GPUs.
We consider two related scheduling problems: resource constrained scheduling on identical parallel machines and a generalization with resource dependent processing times. In both problems, jobs require a certain amoun...
详细信息
ISBN:
(纸本)9781510819672
We consider two related scheduling problems: resource constrained scheduling on identical parallel machines and a generalization with resource dependent processing times. In both problems, jobs require a certain amount of an additional resource and have to be scheduled on machines minimizing the makespan, while at every point in time a given resource capacity is not exceeded. In the first variant of the problem the processing times and resource amounts are fixed, while in the second the former depends on the latter. We present asymptotic fully polynomial approximation schemes (AFPTAS) for the problems: For any ε > 0 a schedule of length at most (1 + ε) times the optimum plus an additive term of O(1/ε~2) is provided, and the running time is polynomially bounded in 1/ε and the input length. Up to now only approximation algorithms with constant approximation ratios were known.
Deep neural networks (DNNs) algorithms are expected to be core components of next-generation applications. These high performance sensing and recognition algorithms are key enabling technologies of smarter systems tha...
详细信息
ISBN:
(纸本)9781665417228
Deep neural networks (DNNs) algorithms are expected to be core components of next-generation applications. These high performance sensing and recognition algorithms are key enabling technologies of smarter systems that make appropriate decisions about their environment. The integration of these compute-intensive and memory-hungry algorithms into embedded systems will require the use of specific energy-efficient hardware accelerators. The intrinsic parallelism of DNNs algorithms allows for the use of a large number of small processing elements, and the tight exploitation of data reuse can significantly reduce power consumption. To meet these features, many dataflow models and on-chip communication proposals have been studied in recent years. This paper proposes a comprehensive study of on-chip communication properties based on the analysis of application-specific features, such as data reuse and communication models, as well as the results of mapping these applications to architectures of different sizes. In addition, the influence of mechanisms such as broadcast and multicast on performance and energy efficiency is analyzed. This study leads to the definition of overarching features to be integrated into next-generation on-chip communication infrastructures for CNN accelerators.
暂无评论