Computers have been moving toward a multicore paradigm for the last several years. As a result of the recent multicore paradigm shift, software developers must design applications that exploit the inherent parallelism...
详细信息
ISBN:
(纸本)9781479956180
Computers have been moving toward a multicore paradigm for the last several years. As a result of the recent multicore paradigm shift, software developers must design applications that exploit the inherent parallelism of modern computing architectures. One of the areas of research to simplify this shift is the development of dynamic scheduling utilities that allow the developer to specify serial code that can be parallelized using a library or compiler technology. While these tools certainly increase the developer's productivity, they can obfuscate performance bottlenecks. For this reason, it is important to evaluate algorithm performance in order to ensure that the performance of a given algorithm is being realized using dynamic scheduling utilities. This paper presents the methodology and results of a new performance analysis tool that aims to accurately simulate the performance of various superscalar schedulers, including OmpSs, StarPU, and QUARK. The process begins with careful timing of each of the computational routines that make up the algorithm. The simulation tool then uses the timing of the computational kernels in conjunction with the dependency management provided by the superscalar scheduler in order to simulate the execution time of the algorithm. This tool demonstrates that simulation results of various algorithms can accurately predict the performance of a complex dynamic scheduling system.
This paper makes two observations that lead to a new heterogeneous core design. First, we observe that most serial code exhibitsfine-grained heterogeneity: at the scale of tens or hundreds of instructions, regions of ...
详细信息
ISBN:
(纸本)9781479964925
This paper makes two observations that lead to a new heterogeneous core design. First, we observe that most serial code exhibitsfine-grained heterogeneity: at the scale of tens or hundreds of instructions, regions of code fit different microarchitectures better (at the same point or at different points in time). Second, we observe that by grouping contiguous regions of instructions into blocks that are executed atomically, a core can exploit this fine-grained heterogeneity: atomicity allows each block to be executed independently on its own execution backend that fits its characteristics best. Based on these observations, we propose a fine-grained heterogeneous core design, called the heterogeneous block architecture (HBA), that combines heterogeneous execution backends into one core. HBA breaks the program into blocks of code, determines the best backend for each block, and specializes the block for that backend. As an example HBA design, we combine out-of-order, VLIW, and in-order backends, using simple heuristics to choose backends for different dynamic instruction blocks. Our extensive evaluations compare this example HBA design to multiple baseline core designs (including monolithic out-of-order, clustered out-of-order, in-order and a state-of-the-art heterogeneous core design) and show that it provides significantly better energy efficiency than all designs at similar performance.
暂无评论