In this paper, we study how to minimize the latency of a message through a network that consists of a number of store-and-forward stages. This research is especially relevant for today's low overhead communication...
ISBN:
(纸本)9780897919821
In this paper, we study how to minimize the latency of a message through a network that consists of a number of store-and-forward stages. This research is especially relevant for today's low overhead communication systems that employ dedicated processing elements for protocol processing. We develop an abstract pipeline model that reveals a crucial performance tradeoff involving the effects of the overhead of the bottleneck stage and the bandwidth of the remaining stages. We exploit this tradeoff to develop a suite of fragmentation algorithms designed to minimize message latency. We also provide an experimental methodology that enables the construction of customized pipeline algorithms that can adapt to the specific system characteristics and application workloads. By applying this methodology to the Myrinet-GAM system, we have improved its latency by up to 51%. Our theoretical framework is also applicable to pipelined systems beyond the context of high speed networks.
In this paper we systematically examine various performance issues involved in the coordinated allocation of processor and disk resources in large-scale parallel computersystems. Models are formulated to investigate ...
ISBN:
(纸本)9780897919821
In this paper we systematically examine various performance issues involved in the coordinated allocation of processor and disk resources in large-scale parallel computersystems. Models are formulated to investigate the I/O and computation behavior of parallel programs and workloads, and to analyze parallel scheduling policies under such workloads. These models are parameterized by measurements of parallel programs, and they are solved via analytic methods and simulation. Our results provide important insights into the performance of parallel applications and resource management strategies when I/O demands are not negligible.
This paper reports on a set of experiments that measure the amount of CPU processing needed to decode MPEG-compressed video in software. These experiments were designed to discover indicators that could be used to pre...
ISBN:
(纸本)9780897919821
This paper reports on a set of experiments that measure the amount of CPU processing needed to decode MPEG-compressed video in software. These experiments were designed to discover indicators that could be used to predict how many cycles are required to decode a given frame. Such predictors can be used to do more accurate CPU scheduling. We found that by considering both frame type and size, it is possible to construct a linear model of MPEG decoding with R2 values of 0.97 and higher. Moreover, this model can be used to predict decoding times at both the frame and packet level that are almost always accurate to within 25% of the actual decode times. This is a surprising result given the large variability in MPEG decoding times, and suggests that it is feasible to design systems that make quality of service guarantees for MPEG-encoded video.
In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant pa...
ISBN:
(纸本)9780897919821
In many real applications, for example those with frequent and irregular communication patterns or those using large messages, network contention and contention for message processing resources can be a significant part of the total execution time. This paper presents a new cost model, called LoGPC, that extends the LogP [9] and LogGP [4] models to account for the impact of network contention and network interface DMA behavior on the performance of message-passing *** validate LoGPC by analyzing three applications implemented with Active Messages [11, 18] on the MIT Alewife multiprocessor. Our analysis shows that network contention accounts for up to 50% of the total execution time. In addition, we show that the impact of communication locality on the communication costs is at most a factor of two on Alewife. Finally, we use the model to identify tradeoffs between synchronous and asynchronous message passing styles.
Performance-oriented studies typically rely on the assumption that the stochastic process modeling the phenomenon of interest is already in steady state. This assumption is, however, not valid if the life cycle of the...
详细信息
ISBN:
(纸本)9780897919821
Performance-oriented studies typically rely on the assumption that the stochastic process modeling the phenomenon of interest is already in steady state. This assumption is, however, not valid if the life cycle of the phenomenon under study is not large enough, since usually a stochastic process cannot reach steady state unless time evolves towards infinity. Therefore, it is important to address performance issues in transient *** work in transient analysis of queueing systems usually focuses on Markov models. This paper, in contrast, presents an analysis of transient loss performance for a class of finite buffer queueing systems that are not necessarily Markovian. We obtain closed-form transient loss performance measures. Based on the loss measures, we compare transient loss performance against steady-state loss performance and examine how different assumptions on the arrival process will affect transient loss behavior of the queueing system. We also discuss how to guarantee transient loss performance. The analysis is illustrated with numerical results.
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independen...
ISBN:
(纸本)9780897919821
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating processes. The principal mechanism involved is two-phase spin-blocking: a process waiting for a message response spins for some amount of time, and then relinquishes the processor if the response does not *** this paper, we describe our experience implementing implicit coscheduling on a cluster of 16 UltraSPARC I workstations; this has led to contributions in three main areas. First, we more rigorously analyze the two-phase spin-block algorithm and show that spin time should be increased when a process is receiving messages. Second, we present performance measurements for a wide range of synthetic benchmarks and for seven Split-C parallel applications. Finally, we show how implicit coscheduling behaves under different job layouts and scaling, and discuss preliminary results for achieving fairness.
As hardware-coherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting a...
ISBN:
(纸本)9780897919821
As hardware-coherent, distributed shared memory (DSM) multiprocessing becomes popular commercially, it is important to evaluate modern realizations to understand how they perform and scale for a range of interesting applications and to identify the nature of the key bottlenecks. This paper evaluates the SGI Origin2000---the machine that perhaps has the most aggressive communication architecture of the recent cache-coherent offerings---and, in doing so, articulates a sound methodology for evaluating real systems. We examine data access and synchronization microbenchmarks; speedups for different application classes, problem sizes and scaling models; detailed interactions and time breakdowns using performance tools; and the impact of special hardware support. We find that overall the Origin appears to deliver on the promise of cache-coherent shared address space multiprocessing, at least at the 32-processor scale we examine. The machine is quite easy to program for performance and has fewer organizational problems than previous systems we have examined. However, some important trouble spots are also identified, especially related to contention that is apparently caused by engineering decisions to share resources among processors.
In this paper, we present the Cello disk scheduling framework for meeting the diverse service requirements of applications. Cello employs a two-level disk scheduling architecture, consisting of a class-independent sch...
ISBN:
(纸本)9780897919821
In this paper, we present the Cello disk scheduling framework for meeting the diverse service requirements of applications. Cello employs a two-level disk scheduling architecture, consisting of a class-independent scheduler and a set of class-specific schedulers. The two levels of the framework allocate disk bandwidth at two time-scales: the class-independent scheduler governs the coarse-grain allocation of bandwidth to application classes, while the class-specific schedulers control the fine-grain interleaving of requests. The two levels of the architecture separate application-independent mechanisms from application-specific scheduling policies, and thereby facilitate the co-existence of multiple class-specific schedulers. We demonstrate that Cello is suitable for next generation operating systems since: (i) it aligns the service provided with the application requirements, (ii) it protects application classes from one another, (iii) it is work-conserving and can adapt to changes in work-load, (iv) it minimizes the seek time and rotational latency overhead incurred during access, and (v) it is computationally efficient.
Multithreaded programming is an effective way to exploit concurrency, but it is difficult to debug and tune a highly threaded program. This paper describes a performance tool called Tmon for monitoring, analyzing and ...
ISBN:
(纸本)9780897919821
Multithreaded programming is an effective way to exploit concurrency, but it is difficult to debug and tune a highly threaded program. This paper describes a performance tool called Tmon for monitoring, analyzing and tuning the performance of multithreaded programs. The performance tool has two novel features: it uses "thread waiting time" as a measure and constructs thread waiting graphs to show thread dependencies and thus performance bottlenecks, and it identifies "semi-busy-waiting" points where CPU cycles are wasted in condition checking and context switching. We have implemented the Tmon tool and, as a case study, we have used it to measure and tune a heavily threaded file system. We used four workloads to tune different aspects of the file system. We were able to improve the file system bandwidth and throughput significantly. In one case, we were able to improve the bandwidth by two orders of magnitude.
We consider the problem of task assignment in a distributed system (such as a distributed Web server) in which task sizes are drawn from a heavy-tailed distribution. Many task assignment algorithms are based on the he...
ISBN:
(纸本)9780897919821
We consider the problem of task assignment in a distributed system (such as a distributed Web server) in which task sizes are drawn from a heavy-tailed distribution. Many task assignment algorithms are based on the heuristic that balancing the load at the server hosts will result in optimal performance. We show this conventional wisdom is less true when the task size distribution is heavy-tailed (as is the case for Web file sizes). We introduce a new task assignment policy, called Size Interval Task Assignment with Variable Load (SITA-V). SITA-V purposely operates the server hosts at different loads, and directs smaller tasks to the lighter-loaded hosts. The result is that SITA-V provably decreases the mean task slowdown by significant factors (up to 1000 or more) where the more heavy-tailed the workload, the greater the improvement factor. We evaluate the tradeoff between improvement in slowdown and increase in waiting time in a system using SITA-V, and show conditions under which SITA-V represents a particularly appealing policy.
暂无评论