Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors ...
详细信息
Spatial multi-programming is one of the most efficient multi-programming methods on Graphics Processing Units (GPUs). This multi-programming scheme generates variety in resource requirements of stream multiprocessors (SMs) and creates opportunities for sharing unused portions of each SM resource with other SMs. Although this approach drastically improves GPU performance, in some cases it leads to performance degradation due to the shortage of allocated resource to each program. Considering shared memory as one of the main bottlenecks of thread-level parallelism (TLP), in this paper, we propose an adaptive shared-memory sharing architecture, called ASHA. ASHA enhances spatial multi-programming performance and increases utilization of GPU resources. Experimental results demonstrate that ASHA improves speedup of a multi-programmed GPU by 17%-21%, on average, for 2- to 8-program execution scenarios, respectively. (C) 2016 Published by Elsevier B.V.
Current processors provide a variety of different processing units to improve performance and power efficiency. For example, ARM's ***, AMD's APUs, and Oracle's M7 provide heterogeneous processors, on-die ...
详细信息
ISBN:
(纸本)9781450341219
Current processors provide a variety of different processing units to improve performance and power efficiency. For example, ARM's ***, AMD's APUs, and Oracle's M7 provide heterogeneous processors, on-die GPUs, and on-die accelerators. However, the performance experienced by programs using these processing units can vary widely due to contention from multiprogramming, thermal constraints and other issues. In these systems, the decision of where to execute a task must consider not only execution time of the task, but also current system conditions. We built Rinnegan, a Linux kernel extension and runtime library, to perform scheduling and handle task placement in heterogeneous systems. The Rinnegan kernel extension monitors and reports the utilization of all processing units to applications, which then makes placement decisions at user level. The Rinnegan runtime provides a performance model to predict the speedup and overhead of offloading a task. With this model and the current utilization of processing units, the runtime can select the task placement that best achieves an application's performance goals, such as low latency, high throughput, or real-time deadlines. When integrated with StarPU, a runtime system for heterogeneous architectures, Rinnegan improves StarPU by performing 1.5-2x better than its native scheduling policies in a shared heterogeneous environment.
multi-threaded processors execute multiple threads concurrently in order to increase overall throughput. It is well documented that multi-threading affects per-thread performance but, more importantly, some threads ar...
详细信息
multi-threaded processors execute multiple threads concurrently in order to increase overall throughput. It is well documented that multi-threading affects per-thread performance but, more importantly, some threads are affected more than others. This is especially troublesome for multi-programmed workloads. Fairness metrics measure whether all threads are affected equally. However defining equal treatment is not straightforward. Several fairness metrics for multi-threaded processors have been utilized in the literature, although there does not seem to be a consensus on what metric does the best job of measuring fairness. This paper reviews the prevalent fairness metrics and analyzes their main properties. Each metric strikes a different trade-off between fairness in the strict sense and throughput. We categorize the metrics with respect to this property. Based on experimental data for SMT processors, we suggest using the minimum fairness metric in order to balance fairness and throughput.
multi-threading has been proposed as an execution model for massively built parallel processors. Due to the large amount of potential parallelism, resource management is a critical issue in multi-threaded architecture...
详细信息
multi-threading has been proposed as an execution model for massively built parallel processors. Due to the large amount of potential parallelism, resource management is a critical issue in multi-threaded architecture. The challenge of multi-threading is to hide the latency by switching among a set of ready threads and thus to improve the processor utilization. Threads are dynamically scheduled to execute based on availability of data. In this paper, two hybrid open queuing network models are proposed. Two sets of processors: synchronization processors and execution processors exist. Each processor is modeled as a server serving a single-queue or multiple-servers serving a single-queue. Performance measures like response times, system throughput and average queue lengths are evaluated for both the hybrid models. The utilizations of the two models are derived and compared with each other. A mean value analysis is performed and different performance measures are plotted. Crown copyright (C) 2008 Published by Elsevier B.V. All rights reserved.
In this paper, a closed queuing network model with both single and multiple servers has been proposed to model dataflow in a multi-threaded architecture. multi-threading is useful in reducing the latency by switching ...
详细信息
In this paper, a closed queuing network model with both single and multiple servers has been proposed to model dataflow in a multi-threaded architecture. multi-threading is useful in reducing the latency by switching among a set of threads in order to improve the processor utilization. Two sets of processors, synchronization and execution processors exist. Synchronization processors handle load/store operations and execution processors handle arithmetic/logic and control operations. A closed queuing network model is suitable for large number of job arrivals. The normalization constant is derived using a recursive algorithm for the given model. State diagrams are drawn from the hybrid closed queuing network model, and the steady-state balance equations are derived from it. Performance measures such as average response times and average system throughput are derived and plotted against the total number of processors in the closed queuing network model. Other important performance measures like processor utilizations, average queue lengths, average waiting times and relative utilizations are also derived. (c) 2005 Elsevier Ltd. All rights reserved.
As far as scheduling is concerned, there are 2 kinds of semaphores: weak and strong. When determining mutual exclusion problems, one typically assumes the existence of strong semaphores. A program is derived that de...
详细信息
As far as scheduling is concerned, there are 2 kinds of semaphores: weak and strong. When determining mutual exclusion problems, one typically assumes the existence of strong semaphores. A program is derived that demonstrates that strong semaphores can be implemented by weak ones. The techniques employed for the derivation are standard, and it is straightforward, with the arguments employed for the derivation, to deduce a formal correctness proof, which, for the purpose of this analysis, is considered superfluous. In this program, as in earlier programs that implement strong semaphores, the order of 2 V-operations -- V(enter) and V(queue) -- turns out to be critical. However, the reason is apparent. It merely stems from a mutual exclusion problem.
The SPS accelerator presents a considerable industrial control problem with the additional complication that the control procedures are never fixed. Right from the beginning it was decided to base the control system o...
详细信息
The SPS accelerator presents a considerable industrial control problem with the additional complication that the control procedures are never fixed. Right from the beginning it was decided to base the control system on a distributed network making use of an interpretive language for the control processes. The success of these decisions can be seen from the fact that over the last six years, the system has grown to a network of more tha~ 50 computers spread over a ten square kilometer site, all the time controlling an ever-changing accelerator complex. This paper will discuss the major elements of the strategy used and explain the reason for their choice. Microprocessors have become very popular in the field of. industrial control and the SPS control system is going to integrate this trend with little difficulty. The paper will show that the SPS approach is ideally suited to the construction of a real-time control network making use only of microprocessor based units.
The deadlock avoidance problem may be defined informally as the determination, from some a priori information about the processes, resources, operating system, etc., of the 'safe situations' which may be reali...
详细信息
The deadlock avoidance problem may be defined informally as the determination, from some a priori information about the processes, resources, operating system, etc., of the 'safe situations' which may be realized without endangering the smooth running of the system. When each process specifies its future needs by a flowchart of need-defined steps, a global approach to the phenomenon and its interpretation as a game between the operating system and the processes allows formalization of risk and safety concepts. The bipartite graph representation of this game may then be used to construct explicitly the set of safe states and to study their properties. [ABSTRACT FROM AUTHOR]
The Venus Operating System is an experimental multiprogramming system which supports five or six concurrent users on a small computer. The system was produced to test the effect of machine architecture on complexity o...
详细信息
暂无评论