CCL (checkpointing and communication library) is a software layer in support of optimistic parallel discrete event simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functi...
详细信息
CCL (checkpointing and communication library) is a software layer in support of optimistic parallel discrete event simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engine on board of myrinet network cards. These functionalities are unique since optimistic simulation systems conventionally rely on checkpointing implemented as a synchronous, CPU-based data copy. Releases of CCL up to v2.4 only support monoprogrammed non-blocking checkpoints. This forces re-synchronization between CPU and DMA activities, which is a potential source of overhead, each time a new checkpoint request must be issued at the simulation application level while the last issued one is still being carried out by the DMA engine. In this paper we present a redesigned release of CCL (v3.0) that, exploiting hardware capabilities of more advanced myrinet clusters, supports multiprogrammed non-blocking checkpoints. The multiprogrammed approach allows higher degree of con-currency between checkpointing and other simulation specific operations carried out by the CPU, with benefits on performance. We also report the results of the experimental evaluation of those benefits for the case of a Personal Communication System (PCS) simulation application, selected as a real world test-bed. (c) 2007 Elsevier B.V. All rights reserved.
This paper deals with the problem of aligning data and computations when mapping uniform or affine loop nests onto SPMD distributed memory parallel computers. For affine loop nests we formulate the problem by introduc...
详细信息
This paper deals with the problem of aligning data and computations when mapping uniform or affine loop nests onto SPMD distributed memory parallel computers. For affine loop nests we formulate the problem by introducing the communication graph, which can be viewed as the counterpart for the mapping problem of the dependence graph for scheduling. We illustrate the approach with several examples to show the difficulty of the problem. In the simplest case, that of perfect loop nests with uniform dependences, we show that minimizing the number of commmunications is NP-complete, although we are able to derive a good alignment heuristic in most practical cases.
Homogeneous job systems are systems in which all of a finite set of jobs to be processed by the system have exactly the same processing requirements. This paper assumes that each job first executes an input task requi...
详细信息
Homogeneous job systems are systems in which all of a finite set of jobs to be processed by the system have exactly the same processing requirements. This paper assumes that each job first executes an input task requiring an input unit (channel or controller) for some amount of time Tc along with a memory unit. Then it executes a computational task requiring a processing unit and the memory for some amount of time Tp. Under these assumptions, it is possible to derive some inequalities concerning the relative number of memory, input, and processor units which can be efficiently used by the system as a function of Tc and Tp. The scheduling problem is to order tasks and assign resources to them in such a way as to minimize some cost function. The cost functions considered in this paper are job set finishing time and dwell time. Some theorems are stated and proved which yield closed form expressions for the minimum finishing time in batch and in time-shared systems as a function of the number of jobs, memories, processors, input units, and Tc and Tp. The purpose of this study is to derive some general results which aid in the efficient utilization of multiprocessor computer systems. Although this study is directed toward a specific type of homogeneous system, it is shown that the results are applicable to other systems (e.g., systems with output).
Noisy intermediate-scale quantum computers are widely used for quantum computing (QC) from quantum cloud providers. Among them, superconducting quantum computers, with their high scalability and mature processing tech...
详细信息
Noisy intermediate-scale quantum computers are widely used for quantum computing (QC) from quantum cloud providers. Among them, superconducting quantum computers, with their high scalability and mature processing technology based on traditional silicon-based chips, have become the preferred solution for most commercial companies and research institutions to develop QC. However, superconducting quantum computers suffer from fluctuation due to noisy environments. To maintain reliability for every execution, calibration of the quantum processor is significantly important. During the long procedure to calibrate physical quantum bits (qubits), quantum processors must be turned into offline mode. In this work, we propose a real-time calibration framework (RCF) to execute quantum program tasks and calibrate in-demand qubits simultaneously, without interrupting quantum processors. Across a widely used noisy intermediate-scale quantum (NISQ) evaluation benchmark suite such as QASMBench, RCF achieves up to 18% reliability improvement for applications. For reliability on different physical qubits, RCF achieves an average gain of 15.7% (up to 36.7%). For cloud quantum machines, the throughput can be improved up to 9.5 throughput per minute (6.5 on average) based on baseline calibration time. In conclusion, RCF offers a reliable solution for large-scale, long-serving quantum machines.
multiprogramming systems require that a fair, equitable algorithm be used for the scheduling of jobs. This paper discusses some of the problems associated with this and proposes an automatic job scheduling algorithm. ...
详细信息
multiprogramming systems require that a fair, equitable algorithm be used for the scheduling of jobs. This paper discusses some of the problems associated with this and proposes an automatic job scheduling algorithm. The major parts of the algorithm have been implemented and have been in use for over one year. The user interface is simplified and the operational complexities are minimized. The parameters used for the algorithm are the estimates of the central processor time and the memory required by the job. All types of jobs including those requiring operator attention during execution are covered under the scheme. Operational data and the reactions from the users indicate that the results have been as expected.
The exact response time analysis for fixed priority scheduling (FPS) in the lowest priority first-based feasibility tests is commonly required as a part of system design tools. This letter proposes an efficient method...
详细信息
The exact response time analysis for fixed priority scheduling (FPS) in the lowest priority first-based feasibility tests is commonly required as a part of system design tools. This letter proposes an efficient method for this, which we named incremental lower bound (ILB) calculation method. Compared to the best algorithm that has been known so far, which is the incremental calculation method, ILB reduces the feasibility test iterations/run times by more than 38% and 20% regardless of varying utilization and the number of tasks in task sets.
A description is given of a novel design, using a hierarchy of controllers, that effectively controls a multiuser, multiprogrammed parallel system. Such a structure allows dynamic repartitioning according to changing ...
详细信息
A description is given of a novel design, using a hierarchy of controllers, that effectively controls a multiuser, multiprogrammed parallel system. Such a structure allows dynamic repartitioning according to changing job requirements. The design goals are examined, and the principles of distributed hierarchical control are presented. Control over processors is discussed. Mapping and load balancing with distributed hierarchical control are considered. Support for gang scheduling as well as availability and fault tolerance is addressed. The use of distributed hierarchical control in memory management and I/O is discussed
CPUs consume too much power. modern complex cores sometimes waste power on functions that are not useful for the code they run. In particular, operating system kernels do not benefit from many power-consuming features...
详细信息
CPUs consume too much power. modern complex cores sometimes waste power on functions that are not useful for the code they run. In particular, operating system kernels do not benefit from many power-consuming features intended to improve application performance. We advocate asymmetric single-ISA multicore systems, in which some cores are optimized to run os code at greatly improved energy efficiency.
The high complexity of distributed computer systems requires new methodologies and languages especially designed for the characteristics of these systems. Declarative languages have been proposed as a promising altern...
详细信息
The high complexity of distributed computer systems requires new methodologies and languages especially designed for the characteristics of these systems. Declarative languages have been proposed as a promising alternative because they provide a way of leaving aside system details. However, the behaviour of reactive systems cannot be described in pure relational or functional terms. We propose a declarative environment for distributed programming based on the concurrent logic language Parlog, which has the capability of expressing concurrence, communication and non-determinism in a very natural way. That is, the intrinsic parallel semantics of the concurrent logic languages make them appropriate for distributed programming. The proposed environment is particularly suitable for loosely coupled systems and it contains mechanisms for distributed process control, and both real-time and object-oriented design. Each of these characteristics is achieved by the integration, in the framework of the underlying concurrent logic language, of realtime and distributed processing control primitives and object-oriented constructions. From this viewpoint, an operational semantics is defined and some implementation issues are discussed.
Analytic queueing models of programs with internal concurrency are considered. The program behavior model allows a process to spawn two or more concurrent tasks at some point during its execution. Except for queueing ...
详细信息
Analytic queueing models of programs with internal concurrency are considered. The program behavior model allows a process to spawn two or more concurrent tasks at some point during its execution. Except for queueing effects, the tasks execute independently of one another, and at the end of their execution, either wait for all of their siblings to finish execution or merge with the parent if all have finished execution. Two approximate solution methods for the performance prediction of such systems are developed, and results of the approximations are compared to those of simulations. The approximations are both computationally efficient and highly accurate. The gain in performance due to multitasking and multiprocessing is studied with a series of examples.
暂无评论