Gather and scatter are data redistribution functions of long-standing importance to high performance computing. In this paper, we present a highly-general array operator with powerful gather and scatter capabilities u...
详细信息
Gather and scatter are data redistribution functions of long-standing importance to high performance computing. In this paper, we present a highly-general array operator with powerful gather and scatter capabilities unmatched by other array languages. We discuss an efficient parallel implementation, introducing three new optimizations-schedule compression, dead array reuse, and direct communication-that reduce the costs associated with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrate performance comparable to the hand-coded Fortran + MPI versions of the NAS FT and CC benchmarks.
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. How...
详细信息
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. However, currently only distributed address-space systems scale beyond the low hundreds of processors. In this paper we introduce a hybrid architecture that allows users to trade off local memory usage for coherence communication, making possible larger-scale shared memory architectures. We introduce a programming model and examine possible implementations of hardware mechanisms, evaluating some of the trade-offs inherent in each. Preliminary experiments on an application with particularly fine-grained communication requirements indicate that effective placement of directives can reduce coherence communication by more than a factor of 10 for 64 processors.
In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and re...
详细信息
In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEX CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.
Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper a...
详细信息
Programmable network interfaces provide the potential to extend the functionality of network services but lead to instruction processing overheads when compared to application-specific network interfaces. This paper aims to offset those performance disadvantages by exploiting task-level concurrency in the workload to parallelize the network interface firmware for a programmable controller with two processors. By carefully partitioning the handler procedures that process various events related to the progress of a packet, the system can minimize sharing, achieve load balance, and efficiently utilize on-chip storage. Compared to the uniprocessor firmware released by the manufacturer, the parallelized network interface firmware increases throughput by 65% for bidirectional UDP traffic of maximum-sized packets, 157% for bidirectional UDP traffic of minimum-sized packets, and 32-107% for real network services. This parallelization results in performance within 10-20% of a modem ASIC-based network interface for real network services.
This paper proposes Phoenix, a programming model for writing parallel and distributed applications that accommodate dynamically joining/leaving compute resources. In the proposed model, nodes involved in an applicatio...
详细信息
This paper proposes Phoenix, a programming model for writing parallel and distributed applications that accommodate dynamically joining/leaving compute resources. In the proposed model, nodes involved in an application see a large and fixed virtual node name space. They communicate via messages, whose destinations are specified by virtual node names, rather than names bound to a physical resource. We describe Phoenix API and show how it allows a transparent migration of application states, as well as dynamically joining/leaving nodes as its by-product. We also demonstrate through several application studies that Phoenix model is close enough to regular message passing, thus it is a general programming model that facilitates porting many parallel applications/algorithms to more dynamic environments. Experimental results indicate applications that have a small task migration cost can quickly take advantage of dynamically joining resources using Phoenix. Divide-and-conquer algorithms written in Phoenix achieved a good speedup with a large number of nodes across multiple LANs (120 times speedup using 169 CPUs across three LANs). We believe Phoenix provides a useful programming abstraction and platform for emerging parallel applications that must be deployed across multiple LANs and/or shared clusters having dynamically varying resource conditions.
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced ...
详细信息
ARMI is a communication library that provides a framework for expressing fine-grain parallelism and mapping it to a particular machine using shared-memory and message passing library calls. The library is an advanced implementation of the RMI protocol and handles low-level details such as scheduling incoming communication and aggregating outgoing communication to coarsen parallelism when necessary. These details can be tuned for different platforms to allow user codes to achieve the highest performance possible without manual modification. ARMI is used by STAPL, our generic parallel library, to provide a portable, user transparent communication layer, We present the basic design as well as the mechanisms used in the current Pthreads/OpenMP, MPI implementations and/or a combination thereof. Performance comparisons between ARMI and explicit use of Pthreads or MPI are given on a variety of machines, including an HP V2200, SGI Origin 3800, IBM Regatta-HPC and IBM RS6000 SP cluster.
A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a re-usable artifact. parallel design patterns reflect commonly occurring parallel communication and synchronization struct...
详细信息
A design pattern is a mechanism for encapsulating the knowledge of experienced designers into a re-usable artifact. parallel design patterns reflect commonly occurring parallel communication and synchronization structures. Our tools, CO2P3S (Correct Object-Oriented Pattern-based parallelprogramming System) and MetaCO(2)P(3)S, use generative design patterns. A programmer selects the parallel design patterns that are appropriate for an application, and then adapts the patterns for that specific application by selecting from a small set of code-configuration options. CO2P3S then generates a custom framework for the application that includes all of the structural code necessary for the application to ran in parallel. The programmer is only required to write simple code that launches the application and to fill in some application-specific sequential hook routines. We use generative design patterns to take an application specification (parallel design patterns + sequential user code) and use it to generate parallel application code that achieves good performance in shared memory and distributed memory environments. Although our implementations are for Java, the approach we describe is tool and language independent. This paper describes generalizing CO2P3S to generate distributed-memory parallel solutions.
Collecting a program's execution profile is important for many reasons: code optimization, memory layout, program debugging and program comprehension. Path based execution profiles are more detailed than count bas...
详细信息
Collecting a program's execution profile is important for many reasons: code optimization, memory layout, program debugging and program comprehension. Path based execution profiles are more detailed than count based execution profiles, since they present the order of execution of the various blocks in a program: modules, procedures, basic blocks etc. Recently, online string compression techniques have been employed for collecting compact representations of sequential program executions. In this paper, we show how a similar approach can be taken for shared memory parallel programs. Our compaction scheme yields one to two orders of magnitude compression compared to the uncompressed parallel program trace on some of the SPLASH benchmarks. Our compressed execution traces contain detailed information about synchronization and control/data flow which can be exploited for post-mortem analysis. In particular, information in our compact execution traces are useful for accurate data race detection (detecting unsynchronized shared variable accesses that occurred in the execution).
暂无评论