The authors give an overview of the Rewrite Rule Machine's (RRM's) architecture and discuss performance estimates based on very detailed register-level simulations at the chip level, together with more abstrac...
详细信息
The authors give an overview of the Rewrite Rule Machine's (RRM's) architecture and discuss performance estimates based on very detailed register-level simulations at the chip level, together with more abstract simulations and modeling for higher levels. For a 10000 ensemble RRM, the present estimates are as follows. (1) The raw peak performance is 576 trillion operations per second. (2) For general symbolic applications, ensemble Sun-relative speedup is roughly 6.7, and RRM performance with a wormhole network at 88% efficiency gives an idealized Sun-relative speedup of 59000. (3) For highly regular symbolic applications (the sorting problem is taken as a typical example), ensemble performance is a Sun-relative speedup of 127, and RRM performance is estimated at over 80% efficiency (relative to the cluster performance), yielding a Sun-relative speedup of over 91. (4) For systolic applications (a 2-D fluid flow problem is taken as a typical example), ensemble performance is a Sun-relative speedup of 400-670, and cluster-level performance, which should be attainable in practice, is at 82% efficiency.< >
We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a sin...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
We propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a single-GH200 node, the proposed method improved the computation speed by 86.4 and 8.67 times compared to the conventional method run only on CPU and only on GPU, respectively. Furthermore, the energy-to-solution was reduced by 32.2-fold (from 9944 J to 309 J) and 7.01-fold (from 2163 J to 309 J) when compared to using only the CPU and GPU, respectively. Using the proposed method on the Alps supercomputer, a 51.6-fold and 6.98-fold speedup was attained when compared to using only the CPU and GPU, respectively, and a high weak scaling efficiency of 94.3% was obtained up to 1,920 compute nodes. These implementations were realized using directive-based parallel programming models while enabling portability, indicating that directives are highly effective in analyses in heterogeneous computing environments.
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to domin...
详细信息
Multiprocessor architectures are converging. For the present, we urgently need to adopt common standards for message passing programming. For the future, one can expect scalable virtual shared memory machines to dominate. The author discusses: communication strategies; dedicated components; programming environments; and programming. An example listing of a ranking program is given that would require such a generation of machine to execute efficiently.< >
A Reconfigurable Consistency Algorithm (RCA) is an algorithm that guarantees the consistency in Distributed Shared Memory (DSM) Systems. In a RCA, there is a Configuration Control Layer (CCL) that is responsible for s...
详细信息
A Reconfigurable Consistency Algorithm (RCA) is an algorithm that guarantees the consistency in Distributed Shared Memory (DSM) Systems. In a RCA, there is a Configuration Control Layer (CCL) that is responsible for selecting the most suitable RCA configuration (behavior) for a specific workload and DSM system. In previous works, we defined an upper bound performance for RCA based on an ideal CCL, which knows apriori the best configuration for each situation. This ideal CCL is based on a set of workloads characteristics that, in most situations, are difficult to extract from the applications (percentage of shared write and read operations and sharing patterns). In this paper we propose, develop and present a heuristical configuration control mechanism for the CCL implementation. This mechanism is based on an easily obtained applications parameter, the concurrency level. Our results show that this configuration control mechanism improves the RCA performance in 15%, on average, compared to other traditional consistency algorithms. Furthermore, the CCL with this mechanism is independent from the workload and DSM system specific characteristics, like sharing patterns and percentage of writes and reads.
Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems since it provides a higher level of abstraction than simple message passing. D...
详细信息
Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems since it provides a higher level of abstraction than simple message passing. DSM combines the simple programming model of shared-memory with the scalability of distributed memory machines. This paper presents DSMPI, a parallel library that runs atop of MPI and provides a distributed shared memory abstraction. It provides an easy-to-use programming interface, is flexible, portable and supports heterogeneity. Moreover, it supports different coherence protocols and models of consistency. We present some performance results taken in a network of workstations and in a Cray T3D which show that DSMPI can be competitive with MPI for some applications.
A panel session organized to project what changes might occur in the near future to make parallel computers easier to program and use and to explore how such computers could benefit many application areas is reported....
详细信息
A panel session organized to project what changes might occur in the near future to make parallel computers easier to program and use and to explore how such computers could benefit many application areas is reported. The following questions are discussed: (1) what type of applications will benefit on a widespread basis from improved performance; (1) whether serial and parallel programming will be integrated for greater reusability and portability; and (3) whether parallel computers will replace serial computers and, if so, what is needed so that concurrency can be handled more easily.< >
The purpose of this research was to construct an adaptive test on the computer. Adaptive testing is a new strategy of evaluation for computer-assisted learning and e-learning. Adaptive testing provides more efficient ...
详细信息
The purpose of this research was to construct an adaptive test on the computer. Adaptive testing is a new strategy of evaluation for computer-assisted learning and e-learning. Adaptive testing provides more efficient test administration and intelligent learning evaluation. It is expected to increase the accuracy of estimating the learners true ability with taking less appropriate selecting questions for individuals. Item response theory (IRT) is the main theoretical base to make tests adaptive and feasible. Adaptive testing requires high speed calculation to process the complicated IRT functions, which is fortunately the advantage of computers.
Exploiting clusters of workstations as a single computational resource is an attractive alternative to conventional multiprocessor technologies. However, the class of parallel applications that can benefit from cluste...
详细信息
Exploiting clusters of workstations as a single computational resource is an attractive alternative to conventional multiprocessor technologies. However, the class of parallel applications that can benefit from clusters is restricted due to their relatively high latency and low throughput-consequences of conventional networking. LANs offer the best performance but also limit the scope for effective clustering to a single room or building. Another major difference remains: multiprocessors can reasonably be programmed with the "error-free" assumption but applications cannot be run on distributed clusters without programming against the potential for remote faults. Emergent high speed switched networks such as ATM have the potential to reduce latency and increase bandwidth in the distributed scenario, and therefore extend the class of applications suitable for running on clusters. In addition, the virtual network capability of ATM removes some of the geographical constraints from clustering. But can ATM guarantee the type of application-level connection reliability which is taken for granted in multiprocessor environments? This paper reviews the capabilities of modern high-speed networks as exemplified by ATM and their relevance to parallel and distributed systems. In particular it asks if Quality of Service (QoS) can benefit parallel programming on distributed platforms.
Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in...
详细信息
ISBN:
(纸本)0780372239
Presents a visualization technique based on particle tracking. The technique consists in defining a set of points distributed on a closed surface and following the surface deformations as the velocity field changes in time. Deformations of the surface contain information about dynamics of the flow; in particular, it is possible to identify zones where flow stretching and foldings occur. Because the points on the surface are independent of each other, it is possible to calculate the trajectory of each point concurrently. Two parallel algorithms are studied; the first one for a shared memory Origin 2000 supercomputer and the second one for a distributed memory PC cluster. The technique is applied to a fluid moving by natural convection inside a cubic container.
The Common Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. In this paper...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
The Common Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. In this paper, we describe our experiences integrating CWL with Parsl, a Python-based parallel programming library designed to manage execution of workflows across diverse computing environments. We propose a new method that converts CWL CommandLineTool definitions into Parsl apps, enabling Parsl scripts to easily import and use tools represented in CWL. We describe a Parsl runner that is capable of executing a CWL CommandLineTool directly. We also describe a proof-of-concept extension to support inline Python in a CWL workflow definition, enabling seamless use in Parsl’s Python ecosystem. We demonstrate the benefits of this integration by presenting example CWL CommandLineTool definitions that show how they can be used in Parsl, and comparing performance of executing an image processing workflow using the Parsl integration and other CWL runners.
暂无评论