检索结果-内蒙古大学图书馆

作者： Deitz, SJ Chamberlain, BL Choi, SE Snyder, L Univ Washington Seattle WA 98195 USA Cray Inc Seattle WA 98104 USA Los Alamos Natl Lab Los Alamos NM 87545 USA

Gather and scatter are data redistribution functions of long-standing importance to high performance computing. In this paper, we present a highly-general array operator with powerful gather and scatter capabilities unmatched by other array languages. We discuss an efficient parallel implementation, introducing three new optimizations-schedule compression, dead array reuse, and direct communication-that reduce the costs associated with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrate performance comparable to the hand-coded Fortran + MPI versions of the NAS FT and CC benchmarks.

关键词： languages parallel programming gather scatter array languages ZPL

来源：评论

学校读者我要写书评

暂无评论

User-controllable coherence for high performance shared memory multiprocessors

引用

acm sigplan NOTICES 2003年第10期38卷 73-83页

作者： McCurdy, C Fischer, C Univ Wisconsin Dept Comp Sci Madison WI 53706 USA

In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. However, currently only distributed address-space systems scale beyond the low hundreds of processors. In this paper we introduce a hybrid architecture that allows users to trade off local memory usage for coherence communication, making possible larger-scale shared memory architectures. We introduce a programming model and examine possible implementations of hardware mechanisms, evaluating some of the trade-offs inherent in each. Preliminary experiments on an application with particularly fine-grained communication requirements indicate that effective placement of directives can reduce coherence communication by more than a factor of 10 for 64 processors.

关键词： performance design languages parallel computation shared memory architectures distributed memory architectures irregular computation

来源：评论

学校读者我要写书评

暂无评论

Using thread-level speculation to simplify manual parallelization

引用

acm sigplan NOTICES 2003年第10期38卷 1-12页

作者： Prabhu, MK Olukotun, K Stanford Univ Comp Syst Lab Stanford CA 94305 USA

In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEX CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.

关键词： design algorithms performance measurement chip multiprocessor data speculation manual parallel programming multithreading feedback-driven optimization

来源：评论

学校读者我要写书评

暂无评论

Exploiting task-level concurrency in a programmable network interface

引用

acm sigplan NOTICES 2003年第10期38卷 61-72页

作者： Kim, HY Pai, VS Rixner, S Rice Univ Houston TX 77251 USA

Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modem ASIC-based network interface for real network services.

关键词： experimentation, performance programmable network interface parallel programming ethernet firmware

来源：评论

学校读者我要写书评

暂无评论

Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources

引用

acm sigplan NOTICES 2003年第10期38卷 215-228页

作者： Taura, K Kaneda, K Endo, T Yonezawa, A Univ Tokyo Bunkyo Ku Tokyo 1130033 Japan PRESTO JST Kawaguchi Saitama 3320012 Japan

This paper proposes Phoenix, a programming model for writing parallel and distributed applications that accommodate dynamically joining/leaving compute resources. In the proposed model, nodes involved in an application see a large and fixed virtual node name space. They communicate via messages, whose destinations are specified by virtual node names, rather than names bound to a physical resource. We describe Phoenix API and show how it allows a transparent migration of application states, as well as dynamically joining/leaving nodes as its by-product. We also demonstrate through several application studies that Phoenix model is close enough to regular message passing, thus it is a general programming model that facilitates porting many parallel applications/algorithms to more dynamic environments. Experimental results indicate applications that have a small task migration cost can quickly take advantage of dynamically joining resources using Phoenix. Divide-and-conquer algorithms written in Phoenix achieved a good speedup with a large number of nodes across multiple LANs (120 times speedup using 169 CPUs across three LANs). We believe Phoenix provides a useful programming abstraction and platform for emerging parallel applications that must be deployed across multiple LANs and/or shared clusters having dynamically varying resource conditions.

关键词： performance parallel programming distributed programming message passing migration resource reconfiguration

来源：评论

学校读者我要写书评

暂无评论

ARMI: An adaptive, platform independent communication library

引用

acm sigplan NOTICES 2003年第10期38卷 229-240页

作者： Saunders, S Rauchwerger, L Texas A&M Univ Dept Comp Sci College Stn TX 77843 USA

ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer, We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.

关键词： languages RMI MPI pthreads OpenMP RPC run-time system communication library parallel programming

来源：评论

学校读者我要写书评

暂无评论

Using generative design patterns to generate parallel code for a distributed memory environment

引用

acm sigplan NOTICES 2003年第10期38卷 202-214页

作者： Tan, K Szafron, D Schaeffer, J Anvik, J MacDonald, S Univ Alberta Dept Comp Sci Edmonton AB T6G 2E8 Canada Univ Waterloo Sch Comp Sci Waterloo ON N2L 3G1 Canada

A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a re-usable artifact. parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO2P3S (Correct Object-Oriented Pattern-based parallel programming System) and MetaCO(2)P(3)S, use generative design patterns. A programmer selects the parallel design patterns that are appropriate for an application, and then adapts the patterns for that specific application by selecting from a small set of code-configuration options. CO2P3S then generates a custom framework for the application that includes all of the structural code necessary for the application to ran in parallel. The programmer is only required to write simple code that launches the application and to fill in some application-specific sequential hook routines. We use generative design patterns to take an application specification (parallel design patterns + sequential user code) and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments. Although our implementations are for Java, the approach we describe is tool and language independent. This paper describes generalizing CO2P3S to generate distributed-memory parallel solutions.

关键词： performance design reliability languages parallel programming design patterns frameworks programming tools

来源：评论

学校读者我要写书评

暂无评论

Compactly representing parallel program executions

引用

acm sigplan NOTICES 2003年第10期38卷 190-201页

作者： Goel, A Roychoudhury, A Mitra, T Natl Univ Singapore Sch Comp Singapore 117543 Singapore

Collecting a program's execution profile is important for many reasons: code optimization, memory layout, program debugging and program comprehension. Path based execution profiles are more detailed than count based execution profiles, since they present the order of execution of the various blocks in a program: modules, procedures, basic blocks etc. Recently, online string compression techniques have been employed for collecting compact representations of sequential program executions. In this paper, we show how a similar approach can be taken for shared memory parallel programs. Our compaction scheme yields one to two orders of magnitude compression compared to the uncompressed parallel program trace on some of the SPLASH benchmarks. Our compressed execution traces contain detailed information about synchronization and control/data flow which can be exploited for post-mortem analysis. In particular, information in our compact execution traces are useful for accurate data race detection (detecting unsynchronized shared variable accesses that occurred in the execution).

关键词： algorithms measurement path profiling program path compression dynamic program analysis

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：