Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full adva...
详细信息
ISBN:
(纸本)9781581137620
Recent efforts in adapting computer networks into system-on-chip (SOC), or network-on-chip, present a setback to the traditional computer systems for the lack of effective programming model, while not taking full advantage of the almost unlimited on-chip bandwidth. In this paper, we propose a new programming model, called context-flow, that is simple, safe, highly parallelizable yet transparent to the underlying architectural details. An SOC platform architecture is then designed to support this programming model, while fully exploiting the physical proximity between the processing elements. We demonstrate the performance efficiency of this architecture over bus based and packet-switch based networks by two case studies using a multi-processor architecture simulator.
The authors give an overview of the Rewrite Rule Machine's (RRM's) architecture and discuss performance estimates based on very detailed register-level simulations at the chip level, together with more abstrac...
详细信息
The authors give an overview of the Rewrite Rule Machine's (RRM's) architecture and discuss performance estimates based on very detailed register-level simulations at the chip level, together with more abstract simulations and modeling for higher levels. For a 10000 ensemble RRM, the present estimates are as follows. (1) The raw peak performance is 576 trillion operations per second. (2) For general symbolic applications, ensemble Sun-relative speedup is roughly 6.7, and RRM performance with a wormhole network at 88% efficiency gives an idealized Sun-relative speedup of 59000. (3) For highly regular symbolic applications (the sorting problem is taken as a typical example), ensemble performance is a Sun-relative speedup of 127, and RRM performance is estimated at over 80% efficiency (relative to the cluster performance), yielding a Sun-relative speedup of over 91. (4) For systolic applications (a 2-D fluid flow problem is taken as a typical example), ensemble performance is a Sun-relative speedup of 400-670, and cluster-level performance, which should be attainable in practice, is at 82% efficiency.< >
A Reconfigurable Consistency Algorithm (RCA) is an algorithm that guarantees the consistency in Distributed Shared Memory (DSM) Systems. In a RCA, there is a Configuration Control Layer (CCL) that is responsible for s...
详细信息
A Reconfigurable Consistency Algorithm (RCA) is an algorithm that guarantees the consistency in Distributed Shared Memory (DSM) Systems. In a RCA, there is a Configuration Control Layer (CCL) that is responsible for selecting the most suitable RCA configuration (behavior) for a specific workload and DSM system. In previous works, we defined an upper bound performance for RCA based on an ideal CCL, which knows apriori the best configuration for each situation. This ideal CCL is based on a set of workloads characteristics that, in most situations, are difficult to extract from the applications (percentage of shared write and read operations and sharing patterns). In this paper we propose, develop and present a heuristical configuration control mechanism for the CCL implementation. This mechanism is based on an easily obtained applications parameter, the concurrency level. Our results show that this configuration control mechanism improves the RCA performance in 15%, on average, compared to other traditional consistency algorithms. Furthermore, the CCL with this mechanism is independent from the workload and DSM system specific characteristics, like sharing patterns and percentage of writes and reads.
Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems since it provides a higher level of abstraction than simple message passing. D...
详细信息
Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems since it provides a higher level of abstraction than simple message passing. DSM combines the simple programming model of shared-memory with the scalability of distributed memory machines. This paper presents DSMPI, a parallel library that runs atop of MPI and provides a distributed shared memory abstraction. It provides an easy-to-use programming interface, is flexible, portable and supports heterogeneity. Moreover, it supports different coherence protocols and models of consistency. We present some performance results taken in a network of workstations and in a Cray T3D which show that DSMPI can be competitive with MPI for some applications.
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to domin...
详细信息
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to dominate. The author discusses: communication strategies; dedicated components; programming environments; and programming. An example listing of a ranking program is given that would require such a generation of machine to execute efficiently.< >
We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a sin...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a single-GH200 node, the proposed method improved the computation speed by 86.4 and 8.67 times compared to the conventional method run only on CPU and only on GPU, respectively. Furthermore, the energy-to-solution was reduced by 32.2-fold (from 9944 J to 309 J) and 7.01-fold (from 2163 J to 309 J) when compared to using only the CPU and GPU, respectively. Using the proposed method on the Alps supercomputer, a 51.6-fold and 6.98-fold speedup was attained when compared to using only the CPU and GPU, respectively, and a high weak scaling efficiency of 94.3% was obtained up to 1,920 compute nodes. These implementations were realized using directive-based parallel programming models while enabling portability, indicating that directives are highly effective in analyses in heterogeneous computing environments.
Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in...
详细信息
ISBN:
(纸本)0780372239
Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in time. Deformations of the surface contain information about dynamics of the flow; in particular, it is possible to identify zones where flow stretching and foldings occur. Because the points on the surface are independent of each other, it is possible to calculate the trajectory of each point concurrently. Two parallel algorithms are studied; the first one for a shared memory Origin 2000 supercomputer and the second one for a distributed memory PC cluster. The technique is applied to a fluid moving by natural convection inside a cubic container.
The VIPER tool visualises the execution of a parallel program. VIPER focuses on the class of parallel programs constructed around the Mona Lisa parallel programming paradigm. Mona Lisa is a typed paradigm, providing t...
详细信息
The VIPER tool visualises the execution of a parallel program. VIPER focuses on the class of parallel programs constructed around the Mona Lisa parallel programming paradigm. Mona Lisa is a typed paradigm, providing the user with a small set of high level primitives for data exchange. The information provided by VIPER is directly related to the execution of these primitives. This makes the tool more suitable for behavioural analysis and debugging compared to paradigm independent tools such as ParaGraph. Five graphical views are supplied by VIPER. The most important ones are: an animation view showing the parallel program as a collection of interacting modules, and a space time view displaying the module interaction over time. The construction of these views is based on trace messages, produced by the parallel program during execution. The trace messages have to be correctly ordered to allow a consistent observation of the distributed computation. VIPER performs this run construction on the fly (allowing on-line visualisation), with minimal latency and maximum efficiency in terms of trace message generation, size and processing.< >
The Common Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. In this paper...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
The Common Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. In this paper, we describe our experiences integrating CWL with Parsl, a Python-based parallel programming library designed to manage execution of workflows across diverse computing environments. We propose a new method that converts CWL CommandLineTool definitions into Parsl apps, enabling Parsl scripts to easily import and use tools represented in CWL. We describe a Parsl runner that is capable of executing a CWL CommandLineTool directly. We also describe a proof-of-concept extension to support inline Python in a CWL workflow definition, enabling seamless use in Parsl’s Python ecosystem. We demonstrate the benefits of this integration by presenting example CWL CommandLineTool definitions that show how they can be used in Parsl, and comparing performance of executing an image processing workflow using the Parsl integration and other CWL runners.
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. ...
详细信息
Software distributed shared memory (DSM) improves the programmability of message-passing machines and workstation clusters by providing a shared memory abstract (i.e., a coherent global address space) to programmers. As in any distributed system, however; the probability of software DSM failures increases as the system size grows. This paper presents a new efficient logging protocol for adaptive software DSM (ADSM), called adaptive logging (AL). It is suitable for both coordinated and independent checkpointing since it speeds up the recovery process and eliminates the unbounded rollback problem associated with independent checkpointing. By leveraging the existing coherence data maintained by ADSM, our AL protocol adapts to log only unrecoverable data (which cannot be recreated or retrieved after a failure) necessary for correct recovery, reducing both the number of messages logged and the amount of logged data. We have performed experiments on a cluster of eight Sun Ultra-5 workstations, comparing our AL protocol against the previous message logging (ML) protocol by implementing both protocols in TreadMarks-based ADSM. The experimental results show that our AL protocol consistently outperforms the ML protocol: Our protocol increases the execution time slightly by 2% to 10% during failure-free execution, while the ML protocol lengthens the execution time by many folds due to its larger log size and higher number of messages logged. Our AL-based recovery also outperforms ML-based recovery by 9% to 17% under parallel application examined.
暂无评论