Summary form only given. This paper describes a parallel debugger and the related debugging support implemented for CHARM++, a data-driven parallel programming language. Because we build extensive debugging support in...
详细信息
Summary form only given. This paper describes a parallel debugger and the related debugging support implemented for CHARM++, a data-driven parallel programming language. Because we build extensive debugging support into the parallel runtime system, applications can be debugged at a very high level.
This work describes the authors' approach at their university in the last three years for computer science second grade undergraduate students to experience parallel and distributed computing. The goal is to give ...
详细信息
This work describes the authors' approach at their university in the last three years for computer science second grade undergraduate students to experience parallel and distributed computing. The goal is to give a solid understanding of parallel and distributed processing technologies and to build up basic skills in the field, such as parallel algorithms, multi-thread/network programming, IP/socket communication, MVC paradigm, RPC/remote method invocation (RMI), Database/SQL, and Java/JDBC. The course features a combination of active experimental learning and N to N networking approach. Unlike typical laboratories where central parallel servers or parallel machines are used (N users to one system networking), our laboratories do without them and instead organize groups of student PCs to form virtual parallel/distributed systems (N users to N systems networking). All PCs work as servers as well as clients. parallel bucket sorting and virtual shopping mall implementations are employed for the course projects. The course consists of 14 ninety-minutes sessions within a semester, including introductory Java network programming and two projects. As the time is limited, homework and pre-laboratory experiments are encouraged. Web based course material distribution and the virtual laboratory environment contributed to student success.
Summary form only given. This article studies a communication model that aims at extending the scope of computational grids by allowing the execution of parallel and/or distributed applications without imposing any pr...
详细信息
Summary form only given. This article studies a communication model that aims at extending the scope of computational grids by allowing the execution of parallel and/or distributed applications without imposing any programming constraints or the use of a particular communication layer. Such model leads to the design of a communication framework for grids which allows the use of the appropriate middleware for the application rather than the one dictated by the available resources. Such a framework is able to handle any communication middleware - even several at the same time - on any kind of networking technologies. Our proposed dual-abstraction (parallel and distributed) model is organized into three layers: arbitration, abstraction and personalities which are highlighted. The performance obtained with PadicoTM, our available open source implementation of the proposed framework, show that such functionality can be obtained with still providing very high performance.
Since December 2002 the Naval Research Laboratory (NRL) has been evaluating an Altix 3000, the newest high performance computing system available from Silicon Graphics, Inc. (SGI). The Altix is a departure from the pr...
详细信息
Since December 2002 the Naval Research Laboratory (NRL) has been evaluating an Altix 3000, the newest high performance computing system available from Silicon Graphics, Inc. (SGI). The Altix is a departure from the previous products from SGI in that instead of using MIPS processors and the IRIX operating system, the Altix uses the Intel Itanium IA-64 processor and SGI ProPack (based on Linux Red Hat) operating system. The Altix still has the brick concept of system configuration and the SGI NUMAlink for the inter-module network for the shared memory. The Altix runs under a single image of the operating system and supports parallel programming through OpenMP, MPI, CoArray, FORTRAN, and an automatic parallelizing compiler. Various codes have been evaluated with respect to their ease of portability and their performance on the Altix as compared to other high performance computers.
Summary form only given. parallel programming paradigms, over the past decade, have focused on how to harness the computational power of contemporary parallel machines. Ease of use and code development productivity, h...
详细信息
Summary form only given. parallel programming paradigms, over the past decade, have focused on how to harness the computational power of contemporary parallel machines. Ease of use and code development productivity, has been a secondary goal. Recently, however, there has been a growing interest in understanding the code development productivity issues and their implications for the overall time-to-solution. Unified parallel C (UPC) is a recently developed language which has been gaining rising attention. UPC holds the promise of leveraging the ease of use of the shared memory model and the performance benefit of locality exploitation. The performance potential for UPC has been extensively studied in recent research efforts. The aim of this study, however, is to examine the impact of UPC on programmer productivity. We propose several productivity metrics and consider a wide array of high performance applications. Further, we compare UPC to the most widely used parallel programming paradigm, MPI. The results show that UPC compares favorably with MPI in programmers productivity.
We describe the design and implementation of a fault tolerant GridRPC system, Ninf-C, designed for easy programming of large-scale master-worker programs that take from few days to few months for its execution in a gr...
详细信息
We describe the design and implementation of a fault tolerant GridRPC system, Ninf-C, designed for easy programming of large-scale master-worker programs that take from few days to few months for its execution in a grid environment. Ninf-C employs Condor, developed at University of Wisconsin, as the underlying middleware supporting remote file transmission and checkpointing for system-wide robustness for application users on the grid. Ninf-C layers all the GridRPC communication and task parallel programming features on top of Condor in a non-trivial fashion, assuming that the entire program is structured in a master-worker style-in fact, older Ninf master-worker programs can be run directly or trivially ported to Ninf-C. In contrast to the original Ninf, Ninf-C exploits and extends Condor features extensively for robustness and transparency, such as 1) checkpointing and stateful recovery of the master process, 2) the master and workers mutually communicating using (remote) files, not IP sockets, and 3) automated throttling of parallel GridRPC calls; and in contrast to using Condor directly, programmers can set up complex dynamic workflow as well as master-worker parallel structure with almost no learning curve involved. To prove the robustness of the system, we performed an experiment on a heterogeneous cluster that consists of x86 and SPARC CPUs, and ran a simple but long-running master-worker program with staged rebooting of multiple nodes to simulate some serious fault situations. The program execution finished normally avoiding all the fault scenarios, demonstrating the robustness of Ninf-C.
Summary form only given. The memory consistency model underlying the Unified parallel C (UPC) language remains a promising but underused feature. We report on our efforts to understand the UPC memory model and assess ...
详细信息
Summary form only given. The memory consistency model underlying the Unified parallel C (UPC) language remains a promising but underused feature. We report on our efforts to understand the UPC memory model and assess its potential benefits. We describe problems we have uncovered in the current language specification. These results have inspired an effort in the UPC community to create an alternative memory model definition that avoids these problems. We give experimental results confirming the promise of performance gains afforded by the memory model's relaxed constraints on consistency.
We present the multi-log processors, an event-driven multiprocessor. The functionality of the processor is defined by the triggering of events, maintained in a single event queue. The key feature of multi-log is that ...
详细信息
We present the multi-log processors, an event-driven multiprocessor. The functionality of the processor is defined by the triggering of events, maintained in a single event queue. The key feature of multi-log is that the entire register file and the event queue are shared. We describe the network architecture of the multi-log and discuss optimum layout schemes. This article describes two scalable event-driven multiprocessor architectures, the multi-log I and the multi-log II, and compares their VLSI complexities (gate delays, wire-length delays, and area). Both multiprocessors are implemented by a large collection of ALUs with controllers and on chip speculative L0 caches (together called logPs) connected together by a network of parallel-prefix tree circuits. A fat-tree network connects an interleaved memory to the logPs. These networks provide superscalar uniprocessor-like functionality, including register renaming, out-of-order event execution, and speculative event execution. Given 1 billion transistors on a single chip, the multi-log I architecture would have 256 logPs on chip, while the multi-log II architecture would allow for 1024 logPs on chip. We propose a new strategy to handle non-local events by introducing a mechanism to allow event transfers over the just described network, by means of event stealing. We also propose an instruction set architecture for the multi-log processor and give a programming model for event-driven applications. Scheduling events and stealing events are implemented in software. We suggest some innovative schemes for their implementation and analysis.
We have developed a high performance hybridized parallel finite difference time domain (FDTD) algorithm featuring both OpenMP shared memory programming and MPl message passing. Our goal is to effectively model the opt...
详细信息
We have developed a high performance hybridized parallel finite difference time domain (FDTD) algorithm featuring both OpenMP shared memory programming and MPl message passing. Our goal is to effectively model the optical characteristics of a novel light source created by utilizing a new class of materials known as photonic band-gap crystals. Our method is based on the solution of the second order discretized Maxwell's equations in space and time. This novel hybrid parallelization scheme allows us to take advantage of the new generation parallel machines possessing connected SMP nodes. By using parallel computations, we are able to complete a calculation on 24 processors in less than a day, where a serial version would have taken over three weeks. We present a detailed study of this hybrid scheme on an SGI origin 2000 distributed shared memory ccNUMA system along with a complete investigation of the advantages versus drawbacks of this method.
This paper introduces a new high-level parallel programming construct called multiLoop that is designed to extend existing imperative languages such as C and Java. A multiLoop statement translates to SPMD specificatio...
详细信息
This paper introduces a new high-level parallel programming construct called multiLoop that is designed to extend existing imperative languages such as C and Java. A multiLoop statement translates to SPMD specification of a named group of synchronous-iterative processes. For efficient iterative communication, multiLoop provides a new publish/subscribe model of shared variable access. Under this model the sequential consistency of shared memory is maintained by a new, simple and efficient adaptation of virtual time paradigm. Virtual time is a localised message tagging and queuing procedure that provides a highly efficient alternative to barrier calls. ML-C, a prototype implementation based on C has been developed. We describe the programming model, discuss its implementation and present some empirical data showing good performance obtained for an example of the target class of applications.
暂无评论