In this paper we analyse a very simple dynamic work-stealing algorithm. In the work-generation model, there are n generators which are arbitrarily distributed among a set of n processors. During each time-step, with p...
详细信息
In this paper we analyse a very simple dynamic work-stealing algorithm. In the work-generation model, there are n generators which are arbitrarily distributed among a set of n processors. During each time-step, with probability /spl lambda/, each generator generates a unit-time task which it inserts into the queue of its host processor. After the new tasks are generated, each processor removes one task from its queue and services it. Clearly, the work-generation model allows the load to grow more and more imbalanced, so, even when /spl lambda/<1, the system load can be unbounded. The natural work-stealing algorithm that we analyse works as follows. During each time step, each empty processor sends a request to a randomly selected other processor. Any non-empty processor having received at least one such request in turn decides (again randomly) in favour of one of the requests. The number of tasks which are transferred from the non-empty processor to the empty one is determined by the so-called work-stealing function f. We analyse the long-term behaviour of the system as a function of /spl lambda/ and f. We show that the system is stable for any constant generation rate /spl lambda/<1 and for a wide class of functions f. We give a quantitative description of the functions f which lead to stable systems. Furthermore, we give upper bounds on the average system load (as a function of f and n).
The InfiniBand architecture (IBA) is becoming an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional I/O bus with a switc...
详细信息
The InfiniBand architecture (IBA) is becoming an industry standard both for communication between processing nodes and I/O devices and for interprocessor communication. It replaces the traditional I/O bus with a switch-based interconnect for connecting processing nodes and I/O devices. It is being developed by the InfiniBand/sup SM/ Trade Association (IBTA) to provide the levels of reliability, availability, performance, scalability, and quality of service (QoS) necessary for present and future server systems. For this, IBA provides a series of mechanisms that are able to guarantee QoS to the applications. Alfaro, Sanchez and Das (see proceedings of internationalparallel and distributed Processing symposium, April 2002) proposed a strategy to compute the InfiniBand arbitration tables. We only evaluated our proposal for CBR traffic with fixed mean bandwidth requirements. We evaluate our strategy to compute the InfiniBand arbitration tables with VBR traffic. Performance results show that, this class of traffic also gets their QoS requirements.
The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models...
详细信息
ISBN:
(纸本)9781665462723
The performance and energy costs of coordinating and performing data movement have led to proposals adding compute units and/or specialized access units to the memory hierarchy. However, current on-chip offload models are restricted to fixed compute and access pattern types, which limits software-driven optimizations and the applicability of such an offload interface to heterogeneous accelerator resources. This paper presents a computation offload interface for multi-core systems augmented with distributed on-chip accelerators. With energy-efficiency as the primary goal, we define mechanisms to identify offload partitioning, create a low-overhead execution model to sequence these fine-grained operations, and evaluate a set of workloads to identify the complexity needed to achieve distributed near-data *** demonstrate that our model and interface, combining features of dataflow in parallel with near-data processing engines, can be profitably applied to memory hierarchies augmented with either specialized compute substrates or lightweight near-memory cores. We differentiate the benefits stemming from each of elevating data access semantics, near-data computation, inter-accelerator coordination, and compute/access logic specialization. Experimental results indicate a geometric mean (energy efficiency improvement; speedup; data movement reduction) of (3.3; 1.59; 2.4)×, (2.46; 1.43; 3.5)× and (1.46; 1.65; 1.48)× compared to an out-of-order processor, monolithic accelerator with centralized accesses and monolithic accelerator with decentralized accesses, respectively. Evaluating both lightweight core and CGRA fabric implementations highlights model flexibility and quantifies the benefits of compute specialization for energy efficiency and speedup at 1.23× and 1.43×, respectively.
This work aims at the development of tools for supporting modelling and analysis of timed systems by Stochastic Reward Nets (SRN). In a first approach it was proposed and experimented a formal reduction of SRN over Ti...
详细信息
ISBN:
(纸本)9781665433266
This work aims at the development of tools for supporting modelling and analysis of timed systems by Stochastic Reward Nets (SRN). In a first approach it was proposed and experimented a formal reduction of SRN over Timed Automata (TA) in the context of the Uppaal popular toolbox. The reduction has the merit to allow both exhaustive model checking of an SRN model, useful for the assessment of qualitative properties (e.g., absence of deadlocks, occurrence of particular event sequences etc.), and quantitative analysis through the statistical model checker, which is based on simulations. However, although Uppaal enabled formal reasoning on the semantics of SRN, its practical usage suffers of scalability problems, that is it can introduce severe limitations in time and space when studying complex models. To cope with this problem, this paper describes a Java implementation of the SRN operational core engine, using the lock-free and efficient Theatre actor system which permits the parallel simulation of large models. The realization can be used for functional property checking on an untimed version of a source SRN model, and quantitative estimation of measurables through simulations. The paper discusses the design and implementation of the core engine of SRN on top of Theatre, together with supported intuitive configuration process of an SRN model, and reports some experimental results using a scalable grid computing model. The experiments confirm Theatre/SRN are capable of exploiting the potential of modern multi-core machines and can deliver good execution performances on large models.
The modern parallel I/O stack consists of several software layers with complex inter-dependencies and performance characteristics. While each layer exposes tunable parameters, it is often unclear to users how differen...
详细信息
ISBN:
(纸本)9781450319102
The modern parallel I/O stack consists of several software layers with complex inter-dependencies and performance characteristics. While each layer exposes tunable parameters, it is often unclear to users how different parameter settings interact with each other and affect overall I/O performance. As a result, users often resort to default system settings, which typically obtain poor I/O bandwidth. In this research, we develop a benchmark guided auto-tuning framework for tuning the HDF5, MPI-IO, and Lustre layers on production supercomputing facilities. Our framework consists of three main components. H5Tuner uses a control file to adjust I/O parameters without modifying or recompiling the application. H5PerfCapture records performance metrics for HDF5 and MPI-IO. H5Evolve uses a genetic algorithm to explore the parameter space to determine well-performing configurations. We demonstrate I/O performance results for three HDF5 application-based benchmarks on a Sun HPC system. All the benchmarks running on 512 MPI processes perform 3X to 5.5X faster with the auto-tuned I/O parameters compared to a configuration with default system parameters.
暂无评论