This paper describes an experimental message-driven programming system for fine-grain multicomputers. The initial target architecture is the J-machine designed at MIT. This machine combines a unique collection of arch...
详细信息
This paper describes an experimental message-driven programming system for fine-grain multicomputers. The initial target architecture is the J-machine designed at MIT. This machine combines a unique collection of architectural features that include fine-grain processes, on-chip associative memory;and hardware support for process synchronization. The programming system uses these mechanisms via a simple message-driven process model that blurs the distinction between processes and messages: messages correspond to processes that are executed elsewhere in the network. This model allows code and data to be distributed across the computers in the machine, and is supported at every stage of the program development cycle. The prototype system we have developed includes a basic set of programming tools to support the model;these include a compiler, linker, archiver, loader and microkernel. Although the concepts are language independent, our prototype system is based on GNU-C.
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer...
详细信息
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
Numerical approximate computations can solve large and complex problems *** have the advantage of high *** they only give approximate results,whereas we need exact results in some *** is a gap between approximate comp...
详细信息
Numerical approximate computations can solve large and complex problems *** have the advantage of high *** they only give approximate results,whereas we need exact results in some *** is a gap between approximate computations and exact results. In this paper,we build a bridge by which exact results can be obtained by numerical approximate computations.
Given a polynomial with symbolic/literal coefficients,a complete discrimination system is a set of explicit expressions in terms of the coefficients,which is sufficient for determining the numbers and multiplicities o...
详细信息
Given a polynomial with symbolic/literal coefficients,a complete discrimination system is a set of explicit expressions in terms of the coefficients,which is sufficient for determining the numbers and multiplicities of the real and imaginary *** it is of great significance,such a criterion for root-classification has never been given for polynomials with degrees greater than *** lack of efficient tools in this aspect extremely prevents computer implementations for Tarski’s and other methods in automated theorem *** remedy this defect,a generic algorithm is proposed to produce a complete discrimination system for a polynomial with any *** result has extensive applications in various fields,and its efficiency was demonstrated by computer implementations.
This paper introduced the optimization and deoptimization technologies for Escape analysis in open world. These technologies are used in a novel Escape analysis framework that has been implemented in Open runtime plat...
详细信息
This paper introduced the optimization and deoptimization technologies for Escape analysis in open world. These technologies are used in a novel Escape analysis framework that has been implemented in Open runtime platform, Intel's opensource Java virtual machine. We introduced the optimization technologies for synchronization removal and object stack allocation, as well as the runtime deoptimization and compensation work. The deoptimization and compensation technologies are crucial for a practical Escape analysis in open world. We evaluated the runtime efficiency of the deoptimization and compensation work on benchmarks like SPECjbb2000 and SPECjvm98.
This paper presents an extension of our Mathematica- and MathCode-based symbolic-numeric framework for solving a variety of partial differential equation (PDE) problems. The main features of our earlier work, which im...
详细信息
This paper presents an extension of our Mathematica- and MathCode-based symbolic-numeric framework for solving a variety of partial differential equation (PDE) problems. The main features of our earlier work, which implemented explicit finite-difference schemes, include the ability to handle (1) arbitrary number of dependent variables, (2) arbitrary dimensionality, and (3) arbitrary geometry, as well as (4) developing finite-difference schemes to any desired order of approximation. In the present paper, extensions of this framework to implicit schemes and the method of lines are discussed. While C++ code is generated, using the MathCode system for the implicit method, Modelica code is generated for the method of lines. The latter provides a preliminary PDE support for the Modelica language. Examples illustrating the various aspects of the solver generator are presented.
Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are...
详细信息
By applying the Fourier analysis, we study the spectral properties of R- filters. Further, we prove that R-filters are a generalization of least squares polynomial adjustment, and we give the geometric interpretation ...
详细信息
By applying the Fourier analysis, we study the spectral properties of R- filters. Further, we prove that R-filters are a generalization of least squares polynomial adjustment, and we give the geometric interpretation of R-filters.
In this paper, we develop a compositional denotational semantics for prioritized real-time distributed programming languages. One of the interesting features is that it extends the existing compositional theory propos...
详细信息
In this paper, we develop a compositional denotational semantics for prioritized real-time distributed programming languages. One of the interesting features is that it extends the existing compositional theory proposed by Koymans et al (1988) for prioritized real-time languages preserving the compositionality of the semantics. The language permits users to define situations in which an action has priority over another action without the requirement of preassigning priorities to actions for partially ordering the alphabet of actions. These features are part of the languages such as Ada designed specifically keeping in view the needs of real-time embedded systems. Further, the approach does not have the restriction of other approaches such as prioritized internal moves can preempt unprioritized actions etc. Our notion of priority in the environment is based on the intuition that a low priority action can proceed only if the high priority action cannot proceed due to lack of the handshaking partner at that point of execution. In other words, if some action is possible corresponding to that environment at some point of execution then the action takes place without unnecessary waiting. The proposed semantic theory provides a clear distinction between the semantic model and the execution model - this has enabled us to fully ensure that there is no unnecessary waiting.
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizat...
详细信息
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%).
暂无评论