The unmanageably large size of reference traces has spurred the development of sophisticated trace reduction techniques. In this paper we present two new algorithms for trace reduction - Safely Allowed Drop (SAD) and ...
详细信息
The unmanageably large size of reference traces has spurred the development of sophisticated trace reduction techniques. In this paper we present two new algorithms for trace reduction - Safely Allowed Drop (SAD) and Optimal LRU Reduction (OLR). Both achieve high reduction factors and guarantee exact simulations for common replacement policies and for memories larger than a user-defined threshold. In particular, simulation on OLR-reduced traces is accurate for the LRU replacement algorithm, while simulation on SAD-reduced traces is accurate for the LRU and OPT algorithms. OLR also satisfies an optimality property: for a given trace and memory size it produces the shortest possible trace that has the same LRU behavior as the original for a memory of at least this size. Our approach has multiple applications, especially in simulating virtual memory systems; many page replacement algorithms are similar to LRU in that more recently referenced pages are likely to be resident. For several replacement algorithms in the literature, SAD- and OLR-reduced traces yield exact simulations. For many other algorithms, our trace reduction eliminates information that matters little: we present extensive measurements to show that the error for simulations of the CLOCK and SEGQ (segmented queue) replacement policies (the most common LRU approximations) is under 3% for the majority of memory sizes. In nearly all cases, the error is much smaller than that incurred by the well known stack deletion technique. SAD and OLR have many desirable properties. In practice, they achieve reduction factors up to several orders of magnitude. The reduction translates to both storage savings and simulation speedups. Both techniques require little memory and perform a single forward traversal of the original trace, which makes them suitable for on-line trace reduction. Neither requires that the simulator be modified to accept the reduced trace.
The majority of the Internet population access the World Wide Web via dial-up modem connections. Studies have shown that the limited modem bandwidth is the main contributor to latency perceived by users. In this paper...
详细信息
The majority of the Internet population access the World Wide Web via dial-up modem connections. Studies have shown that the limited modem bandwidth is the main contributor to latency perceived by users. In this paper, we investigate one approach to reduce latency: prefetching between caching proxies and browsers. The approach relies on the proxy to predict which cached documents a user might reference next, and takes advantage of the idle time between user requests to push or pull the documents to the user. Using traces of modem Web accesses, we evaluate the potential of the technique at reducing client latency, examine the design of prediction algorithms, and investigate their performance varying the parameters and implementation concerns. Our results show that prefetching combined with large browser cache and delta-compression can reduce client latency up to 23.4%. The reduction is achieved using the Prediction-by-Partial-Matching (PPM) algorithm, whose accuracy ranges from 40% to 73% depending on its parameters, and which generates 1% to 15% extra traffic on the modem links. A perfect predictor can increase the latency reduction to 28.5%, whereas without prefetching, large browser cache and delta-compression can only reduce latency by 14.4%. Depending on the desired properties of the algorithm, several configurations for PPM can be best choices. Among several attractive simplifications of the scheme, some do more harm than others; in particular, it is important for the predictor to observe all accesses made by users, including browser cache hits.
Maximizing bandwidth efficiency in distributed continuous media streaming systems is the key in delivering cost-effective multimedia services to distributed and heterogeneous receivers. We introduce a technique based ...
详细信息
Maximizing bandwidth efficiency in distributed continuous media streaming systems is the key in delivering cost-effective multimedia services to distributed and heterogeneous receivers. We introduce a technique based on stream multiplexing to achieve the highest possible bandwidth efficiency, while preserving stringent and deterministic quality of service guarantees. The technique accomplishes the optimal multiplexing (i.e. resulting in the lowest possible bandwidth allocation) by exploiting both the temporal and the spatial structures among a group of continuous media streams. We present a family of optimal multiplexing schedules. The adverse per-stream effects of optimal multiplexing are studied and a technique based on transmission rearrangement is proposed to mitigates these effects, without sacrificing the achieved multiplexing optimality. The results presented in the paper provide some fundamental criteria and limits in the design and evaluation of resource allocation, admission control and stream scheduling policies for bandwidth efficient continuous media streaming.
Fault-based Markov prefetching for virtual memory pages (VMP) is examined. Markov prediction based only on the sequence of program-issued faults are shown to achieve reasonably high levels of accuracy for some scienti...
详细信息
Fault-based Markov prefetching for virtual memory pages (VMP) is examined. Markov prediction based only on the sequence of program-issued faults are shown to achieve reasonably high levels of accuracy for some scientific applications. Using the fault sequence to predict future faults makes speculative prefetching of VMPs feasible to implement. Further, despite high prediction accuracy, it is difficult in practice to obtain speedup from one-fault ahead speculative prefetching as this approach requires fault rate and fault latency sufficiently low that I/O overhead is not prohibitive in the first place.
Dynamic object replication schemes are studied, in conjunction with a mathematical model of user behavior for continuous media servers which do not rely on knowledge of frequencies of object accesses. Rather, these se...
详细信息
Dynamic object replication schemes are studied, in conjunction with a mathematical model of user behavior for continuous media servers which do not rely on knowledge of frequencies of object accesses. Rather, these servers make the adjustments, in a threshold-based manner simply based on the amount of resources currently available for servicing user requests for those objects. Further, analysis of the effects of various architectural and workload characteristics on dynamic replication policies (DRP) show the mathematical mode not only improves the more conservative' DRP, but also facilitates significantly reduced sensitivity in system architecture, workload characteristics, skewness of data access patterns and choice of threshold values.
Dynamic content generation (DCG) has become popular at many Web sites because it enables new services. As DCG normally places intensive I/O and CPU demands on a server, it becomes critical to cluster multiple servers ...
详细信息
Dynamic content generation (DCG) has become popular at many Web sites because it enables new services. As DCG normally places intensive I/O and CPU demands on a server, it becomes critical to cluster multiple servers to remove the server bottleneck. To this end, a resource management scheme for clustering Web servers with a master/slave architecture is developed. Further, necessary scheduling optimization which considers the characteristics of the Web workloads is provided. The proposed techniques include separation of static and dynamic content processing, low overhead remote CGI execution, and a reservation-based scheduler which considers both I/O and CPU utilization.
Trace sampling for a suit of publicly available desktop application traces for Windows NT on the Intel X86 platform is considered. Commonly used Windows NT applications can run for billions of instructions. Performing...
详细信息
Trace sampling for a suit of publicly available desktop application traces for Windows NT on the Intel X86 platform is considered. Commonly used Windows NT applications can run for billions of instructions. Performing architectural studies on traces of billions of references is not feasible from both time and space perspective. Trace sampling is demonstrated as an efficient alternative that greatly reduces simulation time and, in many cases, nominally reduces accuracy. Among the several trace sampling techniques, stitch most accurately predicts cache miss rates for these workloads. Furthermore, sampling can also be used to efficiently drive architectural studies and gather parameters for analytical cache models.
This paper examines NFS sensitivity to performance characteristics of emerging networks. We adopt an unusual method of inserting controlled delays into live systems to measure sensitivity to basic network parameters. ...
详细信息
This paper examines NFS sensitivity to performance characteristics of emerging networks. We adopt an unusual method of inserting controlled delays into live systems to measure sensitivity to basic network parameters. We develop a simple queuing model of an NFS server and show that it reasonably characterizes our two live systems running the SPECsfs benchmark. Using the techniques in this work, we can infer the structure of servers from published SPEC results. Our results show that NFS servers are most sensitive to processor overhead;it can be the limiting factor with even a modest number of disks. Continued reductions in processor overhead will be necessary to realize performance gains from future multi-gigabit networks. NFS can tolerate network latency in the regime of newer LANs and IP switches. Due to NFS's historic high mix of small metadata operations. NFS is quite insensitive to network bandwidth. Finally, we find that the protocol enhancements in NFS version 3 tolerate high latencies better than version 2 of the protocol.
In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/O-intensive workloads has revealed a performance impairment that results ...
详细信息
In modern I/O architectures, multiple disk drives are attached to each I/O controller. A study of the performance of such architectures under I/O-intensive workloads has revealed a performance impairment that results from a previously unknown form of convoy behavior in disk I/O. In this paper, we describe measurements of the read performance of multiple disks that share a SCSI bus under a heavy workload, and develop and validate formulas that accurately characterize the observed performance (to within 12% on several platforms for I/O sizes in the range 16-128 KB). Two terms in the formula clearly characterize the lost performance seen in our experiments. We describe techniques to deal with the performance impairment, via user-level work-arounds that achieve greater overlap of bus transfers with disk seeks, and that increase the percentage of transfers that occur at the full bus bandwidth rather than at the lower bandwidth of a disk head. Experiments show bandwidth improvements of 10-20% when using these user-level techniques, but only in the case of large I/Os.
暂无评论