Priority queues (PQ) are an essential data structure for many important algorithms. Hardware implementations of priority queues can accelerate such algorithms significantly. Usually a choice has to be made between ver...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
Priority queues (PQ) are an essential data structure for many important algorithms. Hardware implementations of priority queues can accelerate such algorithms significantly. Usually a choice has to be made between very fast (fixed-size) hardware PQs or scalable PQs. Fast hardware queues can provide single cycle push and pop operations, but are limited in size. Scalable hardware queues use embedded memory to achieve scalability, but thereby compromise on speed. This paper proposes a fast and scalable hardware priority queue that can be used for general purpose. It is based on a modular and flexible hybrid design, using a shift register queue combined with a heap-based queue. In the best case, i.e. when only the first queue is needed, push and pop operations only take a single cycle. In an experimental evaluation on a Xilinx FPGA we demonstrate that performance of the hybrid queue maintains a good balance between performance and scalability even for applications where the size of the working set of data can have large variance. We further propose an optimisation of the heap-based queue to support workloads that repeat several consecutive push operations followed by a single pop operation, which is the workload, for instance, in state space search or ray-tracing.
Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus...
Scheduling task graphs with communication delay is a widely studied NP-hard problem. Many heuristics have been proposed, but there is no constant approximation algorithm for this classic model. In this paper, we focus on the scheduling of the important class of fork-join task graphs (describing many types of common computations) on homogeneous processors. For this sub-case, we propose a guaranteed algorithm with a $\left( {1 + \frac{m}{{m - 1}}} \right)$-approximation factor, where m is the number of processors. The algorithm is not only the first constant approximation for an important sub-domain of the classic scheduling problem, it is also a practical algorithm that can obtain shorter makespans than known heuristics. To demonstrate this, we propose adaptations of known scheduling heuristic for the specific fork-join structure. In an extensive evaluation, we then implemented these algorithms and scheduled many fork-join graphs with up to thousands of tasks and various computation time distributions on up to hundreds of processors. Comparing the obtained results demonstrates the competitive nature of the proposed approximation algorithm.
A common optimisation problem in the high-level synthesis (HLS) of FPGA-based accelerators is to find a microarchitecture that maximises the performance while keeping the utilisation of the device's low-level reso...
A common optimisation problem in the high-level synthesis (HLS) of FPGA-based accelerators is to find a microarchitecture that maximises the performance while keeping the utilisation of the device's low-level resources below certain limits. We propose to tackle it directly as part of the HLS scheduler. To that end, we formalise a general, integrated scheduling and allocation problem for HLS kernels, and present SkyCastle, a novel resource-aware multi-loop scheduler using integer linear programming to solve it for a subclass of kernels composed of multiple, nested loops. In order to demonstrate the practical applicability of the approach, we model the scheduler in such a way as to be plug-in compatible with the Xilinx Vivado HLS engine, allowing the computed solutions to be fed back into its synthesis flow. We evaluate SkyCastle for three non-trivial kernels from the machine learning, signal processing, and physical simulation domains, on two FPGA devices. Additionally, we investigate the replication of slightly slower, but smaller accelerators as a means to further boost the overall performance. In contrast to Vivado HLS' default settings, which aim at maximum performance but may fail in later synthesis steps, the solutions computed by our scheduler always result in synthesisable designs.
GeMS is a customisable, open-source toolkit for generating random, yet constrained, modulo scheduling problems with a known optimal initiation interval. These can then be used to evaluate the behavior of different sch...
详细信息
GeMS is a customisable, open-source toolkit for generating random, yet constrained, modulo scheduling problems with a known optimal initiation interval. These can then be used to evaluate the behavior of different scheduling algorithms under controlled conditions.
A reduction is a parallel programming mechanism for combining two or more elements into one. Many parallel programming languages, tools and frameworks (e.g., OpenMP, MPI, etc.) directly support simple forms of reducti...
详细信息
ISBN:
(纸本)9781509042982
A reduction is a parallel programming mechanism for combining two or more elements into one. Many parallel programming languages, tools and frameworks (e.g., OpenMP, MPI, etc.) directly support simple forms of reductions (e.g., building a total sum out of partial sums). Some of those tools and frameworks allow more complex reductions to be implemented as custom reductions. However, the success of network-based application frameworks like Hadoop have shown that there is a strong need for reductions of aggregate data structures, such as the union of sets or maps. Usually parallel programming frameworks on shared-memory systems do not support these types of complex reductions directly, and a user needs to implement them manually. To address the gap thereof, this paper proposes an object-oriented reduction framework that supports reductions of aggregate types, and proposes the nesting of reduction objects for flexible extensions of reduction operations. Based on the proposed framework, a reduction library (RedLib) has been developed for Java with direct support for many common reduction operations on collections and maps. Furthermore, the paper studies the usage of the framework for common and complex cases and evaluates its performance, based on operations found in standard benchmarks.
The Square Kilometre Array will be an amazing instrument for pulsar astronomy. While the full SKA will be sensitive enough to detect all pulsars in the Galaxy visible from Earth, already with SKA1, pulsar searches wil...
详细信息
暂无评论