检索结果-内蒙古大学图书馆

Available task-level parallelism on the Cell BE

SCIENTIFIC programming 2009年第1-2期17卷 59-76页

作者： Rico, Alejandro Ramirez, Alex Valero, Mateo Univ Politecn Cataluna ES-08034 Barcelona Spain Barcelona Supercomp Ctr Barcelona Spain

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.

关键词： Scalability multicore task-based programming models Cell BE

来源：评论

学校读者我要写书评

暂无评论

Leveraging iterative applications to improve the scalability of task-based programming models on distributed systems

引用

ACM Transactions on Architecture and Code Optimization 1000年

作者： Omar Shaaban Ibrahim Ali Juliette Fournis d'Albiat Isabel Piedrahita Vicenç Beltran Xavier Martorell Paul Carpenter Eduard Ayguadé Jesus Labarta Computer Sciences Barcelona Supercomputing Center Barcelona Spain Barcelona Supercomputing Center Barcelona Spain

Distributed tasking models such as OmpSs-2@Cluster, StarPU-MPI, and PaRSEC express HPC applications as task graphs with explicit dependencies. The single task graph unifies the representation of parallelism across CPU cores, accelerators, and distributed-memory nodes, offering higher programmer productivity compared to traditional MPI + X. Most task-based models construct the task graph sequentially, which provides a clear and familiar programming model, simplifying code development, maintenance and porting. However, this design introduces a bottleneck in task creation and dependency management, limiting performance and scalability. As a result, unless the tasks are very coarse-grained, current distributed sequential tasking models cannot match the performance of MPI + *** scientific applications, however, are iterative in nature, constructing the same directed acyclic task graph at each timestep. We exploit this structure to eliminate the sequential bottleneck and control message overhead in a sequentially-constructed distributed tasking model, while preserving its simplicity and productivity. Our approach builts on the recently proposed taskiter directive for OpenMP and OmpSs-2, allowing a single iteration to be expressed as a cyclic graph. The runtime partitions the cyclic graph across nodes, precomputes the MPI transfers, and then executes the loop body at low overhead. By integrating the MPI communications directly into the application’s task graph, our approach naturally overlaps computation and communication, in some cases exposing dramatically more parallelism than fork–join MPI + OpenMP. We define the programming model and describe the full runtime implementation, and integrate our proposal into OmpSs-2@Cluster. We evaluate it using five benchmarks on up to 128 nodes of the MareNostrum 5 supercomputer. For applications with fork–join parallelism, our approach has performance similar to fork–join MPI + OpenMP, making it a viable productive alternative, un

关键词： OpenMP OmpSs-2@Cluster MPI Distributed Computing Runtime Systems task-based programming models Iterative Methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：