It is witnessed that heterogeneous architectures, such as GPUs, TPUs, and FPGAs, have made complex algorithms practical in the last decade. Despite multiple efforts to study the scheduling of these parallel and comple...
It is witnessed that heterogeneous architectures, such as GPUs, TPUs, and FPGAs, have made complex algorithms practical in the last decade. Despite multiple efforts to study the scheduling of these parallel and complex tasks on heterogeneous architectures, the power and energy consumption of the platforms have yet to be well managed under real-time task deadlines. To establish high schedulability in heterogeneous architectures, many scheduling strategies and models, such as multi-segment selfsuspension (MSSS), have been proposed by pioneer researchers. However, directly applying this model to heterogeneous architectures with multiple CPUs and many processing elements (PEs) suffers aggravated power consumption due to the pessimism in the scheduling algorithm and the tolerance margin in the worst-case execution time (WCET) model. Therefore, this paper presents an energy-efficient real-time scheduling approach called EESchedule, which works on heterogeneous architectures with guaranteed schedulability and improved power efficiency. In EESchedule, we build a general task execution model for the general heterogeneous architectures integrating multiple CPUs and many PEs. Then, an energy-efficient real-time scheduling strategy is introduced. Next, the response time and corresponding schedulability analysis are presented for EESchedule. Finally, extensive experiments on heterogeneous NVIDIA Jetson TX2 embedded systems and GPU servers with the Intel i9-10900x CPU and RTX 3080 GPU demonstrate that the EESchedule could achieve the same schedulability with 16.8%-40.7% and 39.0%-48.2% reduced power and energy consumption in comparison with state-of-the-art scheduling algorithms.
We show how to compute the width of a dynamic set of low-dimensional points in the streaming model. In particular, we assume the stream contains both insertions of points and deletions of points to a set S, and the go...
详细信息
ISBN:
(纸本)9781611972108
We show how to compute the width of a dynamic set of low-dimensional points in the streaming model. In particular, we assume the stream contains both insertions of points and deletions of points to a set S, and the goal is to compute the width of the set S, namely the minimal distance between two parallel lines sandwiching the pointset S. Our algorithm 1 + ε approximates the width of the set S using space polylogarithmic in the size of S and the aspect ratio of S. This is the first such algorithm that supports both insertions and deletions of points to the set S: previous algorithms for approximating the width of a pointset only supported additions [AHPV04, Cha06], or a sliding window [CS06]. This solves an open question from the "2009 Kanpur list" of Open Problems in Data Streams, Property Testing, and Related Topics [IMNO11].
One objective in establishing our NSF ILI funded parallel computation laboratory was to use closed, formal laboratory assignments to introduce parallelism throughout the core computer science curriculum. We discuss la...
Extensive research has been done on extracting parallelism from single instruction stream processors. The authors present some results of an investigation into ways to modify MIMD architectures to allow them to extrac...
详细信息
ISBN:
(纸本)9780818652806
Extensive research has been done on extracting parallelism from single instruction stream processors. The authors present some results of an investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism.< >
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the ...
详细信息
ISBN:
(纸本)9781450344937
In the many-core era, the performance of MPI collectives is more dependent on the intra-node communication component. However, the communication algorithms generally inherit from the inter-node version and ignore the cache complexity. We propose cache-oblivious algorithms for MPI all-to-all operations, in which data blocks are copied into the receive buffers in Morton order to exploit data locality. Experimental results on different many-core architectures show that our cache-oblivious implementations significantly outperform the naive implementations based on shared heap and the highly optimized MPI libraries.
We present a randomized parallel algorithm, in the Exclusive-Read Exclusive-Write (EREW) PRAM model, that computes a Maximal Independent Set (MIS) in O(log n) time and using O(m log2 n) work, with high probability. Th...
ISBN:
(纸本)9781611976465
We present a randomized parallel algorithm, in the Exclusive-Read Exclusive-Write (EREW) PRAM model, that computes a Maximal Independent Set (MIS) in O(log n) time and using O(m log2 n) work, with high probability. Thus, MIS ∈ RNC1. This time complexity is optimal and it improves on the celebrated O(log2 n) time algorithms of Luby [STOC'85] and Alon, Babai, and Itai [JALG'86], which had remained the state of the art for the past 35 years.
Presents a comparison of superscalar and decoupled access/execute architectures. Both architectures attempt to exploit instruction-level parallelism by issuing multiple instructions per cycle, employing dynamic schedu...
详细信息
ISBN:
(纸本)9780818652806
Presents a comparison of superscalar and decoupled access/execute architectures. Both architectures attempt to exploit instruction-level parallelism by issuing multiple instructions per cycle, employing dynamic scheduling to maximize performance. Simulation results are presented for four different configurations, demonstrating that the architectural queues of the decoupled machines provide similar functionality to register renaming, dynamic loop unrolling, and out-of-order execution of the superscalar machines with significantly less complexity.< >
暂无评论