Heap-allocated objects play an important role in many modern programs. Various results have shown the overall performance of these programs can be improved by increasing the reference locality of heap-allocated object...
详细信息
Heap-allocated objects play an important role in many modern programs. Various results have shown the overall performance of these programs can be improved by increasing the reference locality of heap-allocated objects. In this paper we describe an approach that improves the virtual memory performance of allocation-intensive C programs by predicting the reference behavior and lifetime of heap objects as they are allocated, We further describe an implementation of our prediction algorithm and evaluate its performance on real programs. As part of our implementation, we present a low-overhead algorithm to minimize the cost of gathering run-time stack information, Finally, we show that an implementation of these algorithms has little overhead and can improve the virtual memory and TLB performance of programs substantially. Copyright (C) 2001 John Wiley & Sons, Ltd.
The allocation and disposal of memory is a ubiquitous operation in most programs. Rarely do programmers concern themselves with details of memory allocators;most assume that memory allocators provided by the system pe...
详细信息
The allocation and disposal of memory is a ubiquitous operation in most programs. Rarely do programmers concern themselves with details of memory allocators;most assume that memory allocators provided by the system perform well. Yet, in some applications, programmers use domain-specific knowledge in an attempt to improve the speed or memory utilization of memory allocators. In this paper, we describe a program (CustoMalloc) that synthesizes a memory allocator customized for a specific application. Our experiments show that the synthesized allocators are uniformly faster and more space efficient than the Berkeley UNIX allocator. Constructing a custom allocator requires little programmer effort, usually taking only a few minutes. Experience has shown that the synthesized allocators are not overly sensitive to properties of input sets and the resulting allocators are superior even to domain-specific allocators designed by programmers. Measurements show that synthesized allocators are from two to ten times faster than widely-used allocators.
Dynamically dispatched calls often limit the performance of object-oriented programs, since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these e...
详细信息
Dynamically dispatched calls often limit the performance of object-oriented programs, since object-oriented programming encourages factoring code into small, reusable units, thereby increasing the frequency of these expensive operations. Frequent calls not only slow down execution with the dispatch overhead per se, but more importantly they hinder optimization by limiting the range and effectiveness of standard global optimizations. In particular, dynamically dispatched calls prevent standard interprocedural optimizations that depend on the availability of a static call graph. The SELF implementation described here offers two novel approaches to optimization. Type feedback speculatively inlines dynamically dispatched calls based on profile information that predicts likely receiver classes. Adaptive optimization reconciles optimizing compilation with interactive performance by incrementally optimizing only the frequently executed parts of a program. When combined, these two techniques result in a system that can execute programs significantly faster than previous systems while retaining much of the interactiveness of an interpreted system.
We consider the problem of allocating students to project topics satisfying side constraints and taking into account students' preferences. Students rank projects according to their preferences for the topic and s...
详细信息
We consider the problem of allocating students to project topics satisfying side constraints and taking into account students' preferences. Students rank projects according to their preferences for the topic and side constraints limit the possibilities to team up students in the project topics. The goal is to find assignments that are fair and that maximize the collective satisfaction. Moreover, we consider issues of stability and envy from the students' viewpoint. This problem arises as a crucial activity in the organization of a first year course at the Faculty of Science of the University of Southern Denmark. We formalize the student-project allocation problem as a mixed integer linear programming problem and focus on different ways to model fairness and utilitarian principles. On the basis of real-world data, we compare empirically the quality of the allocations found by the different models and the computational effort to find solutions by means of a state-of-the-art commercial solver. We provide empirical evidence about the effects of these models on the distribution of the student assignments, which could be valuable input for policy makers in similar settings. Building on these results we propose novel combinations of the models that, for our case, attain feasible, stable, fair and collectively satisfactory solutions within a minute of computation. Since 2010, these solutions are used in practice at our institution.
Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted...
详细信息
ISBN:
(纸本)9783030905392;9783030905385
Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted. In this paper, we report the potential performance improvement of domain-specific interpreter specialization for data analysis workloads and evaluate how the characteristics of data analysis workloads affect the specialization, both positively and negatively. Assisted by compilers, we specialize the Lu and CPython interpreters at source-level using the script being interpreted and the data types during the interpretation as invariants for five common tasks from real data analysis workloads. Through experiments, we measure 9.0-39.6% performance improvement for Lu and 11.0-17.2% performance improvement for CPython for benchmarks that perform data loading, histogram computation, data filtering, data transformation, and dataset shuffle. This specialization does not include misspeculation checks of data types at possible type conversion code that may be necessary for other workloads. We report the details of our evaluation and present a semi-automatic method for specializing the interpreters.
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the fe...
详细信息
ISBN:
(纸本)9780897916608
Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch prediction reported small or minimal improvements. As wide-issue architectures become increasingly popular the importance of reducing branch costs will increase, and branch alignment is one mechanism which can effectively reduce these costs. In this paper, we propose an improved branch alignment algorithm that takes into consideration the architectural cost model and the branch prediction architecture when performing the basic block reordering. We show that branch alignment algorithms can improve a broad range of static and dynamic branch prediction architectures. We also show that a programs performance can be improved by approximately 5% even when using recently proposed, highly accurate branch prediction architectures. The programs are compiled by any existing compiler ad then transformed via binary transformations. When implementing these algorithms on a Alpha AXP 21604 up to a 16% reduction in total execution time is achieved.
暂无评论