Rapidly increasing design and manufacturing non-recurring engineering (NRE) costs are prompting a shift in electronic design from hardwired application specific integrated circuits (ASICs) to the use of software on pr...
详细信息
ISBN:
(纸本)9781581138061
Rapidly increasing design and manufacturing non-recurring engineering (NRE) costs are prompting a shift in electronic design from hardwired application specific integrated circuits (ASICs) to the use of software on programmable platforms. However, in order to minimize the power and performance overhead of such processors, we are seeing the introduction of domain or application specific processors such as network and communication processors. The design of such specialized processors requires software development tools such as simulators and compilers. In order to quickly develop these tools for multiple design points under consideration, it is highly desirable to have them synthesized from formal processor descriptions written in Architecture Description languages (ADLs). In this paper, we present the Mescal Architecture Description language (MADL). MADL features a two-layer structure, a core layer and an annotation layer. The core layer is based on a formal and flexible microprocessor model-the operation state machine (OSM), which enables MADL to express the concurrency at the operation execution level for a wide range of architectures. We address the challenges faced in designing the core layer to combine the OSM model with techniques for achieving compact processor descriptions. The annotation layer features a generic syntax that allows creating annotation schemes to specify implementation dependent or tool specific information. To show the effectiveness of MADL, we present an MADL-based simulator synthesis framework that has been used to generate efficient cycle accurate simulators and instruction set simulators with very low development effort. We also describe our annotation schemes that enable the extraction of architecture properties for use in instruction scheduling and integer-linear-programming based register allocation. Our experimental results demonstrate the efficacy of MADL as a practical and promising language for the development of programmable plat
When an individual task can be forcefully terminated at any time, cooperating tasks must communicate carefully. For example, if two tasks share an object, and if one task is terminated while it manipulates the object,...
详细信息
When an individual task can be forcefully terminated at any time, cooperating tasks must communicate carefully. For example, if two tasks share an object, and if one task is terminated while it manipulates the object, the object may remain in an inconsistent or frozen state that incapacitates the other task. To support communication among terminable tasks, language run-time systems (and operating systems) provide kill-safe abstractions for inter-task communication. No kill-safe guarantee is available, however, for abstractions that are implemented outside the run-time system. In this paper, we show how a run-time system can support new kill-safe abstractions without requiring modification to the run-time system, and without requiring the run-time system to trust any new code. Our design frees the run-time implementor to provide only a modest set of synchronization primitives in the trusted computing base, while still allowing tasks to communicate using sophisticated abstractions.
Region-based memory management offers several important potential advantages over garbage collection, including real-time performance, better data locality, and more efficient use of limited memory. Researchers have a...
详细信息
Region-based memory management offers several important potential advantages over garbage collection, including real-time performance, better data locality, and more efficient use of limited memory. Researchers have advocated the use of regions for functional imperative, and object-oriented languages. Lexically scoped regions are now a core feature of the Real-Time Specification for Java (RTSJ)[5]. Recent research in region-based programming for Java has focused on region checking, which requires manual effort to augment the program with region annotations. In this paper, we propose an automatic region inference system for a core subset of Java. To provide an inference method that is both precise and practical, we support classes and methods that are region-polymorphic, with region-polymorphic recursion for methods. One challenging aspect is to ensure region safety in the presence of features such as class subtyping, method overriding, and downcast operations. Our region inference rules can handle these object-oriented features safely without creating dangling references.
This paper presents the first scalable context-sensitive, inclusion-based pointer alias analysis for Java programs. Our approach to context sensitivity is to create a clone of a method for every context of interest. a...
详细信息
This paper presents the first scalable context-sensitive, inclusion-based pointer alias analysis for Java programs. Our approach to context sensitivity is to create a clone of a method for every context of interest. and run a context-insensitive algorithm over the expanded call graph to get context-sensitive results. For precision, we generate a clone for every acyclic path through a program's call graph, treating methods in a strongly connected component as a single node. Normally, this formulation is hopelessly intractable as a call graph often has 10(14) acyclic paths or more. We show that these exponential relations can be computed efficiently using binary decision diagrams (BDDs). Key to the scalability of the technique is a context numbering scheme that exposes the commonalities across contexts. We applied our algorithm to the most popular applications available on Sourceforge, and found that the largest programs, with hundreds of thousands of Java bytecodes, can be analyzed in under 20 minutes. This paper shows that pointer analysis, and many other queries and algorithms, can be described succinctly and declaratively using Datalog, a logic programminglanguage. We have developed a system called bddbddb that automatically translates Datalog programs into highly efficient BDD implementations. We used this approach to develop a variety of context-sensitive algorithms including side effect analysis, type analysis, and escape analysis.
Through the design and implementation of a JVM that supports Pluggable Verification Modules (PVMs), the idea of an extensible protection mechanism is entertained. Link-time bytecode verification becomes a pluggable se...
详细信息
ISBN:
(纸本)1581138318
Through the design and implementation of a JVM that supports Pluggable Verification Modules (PVMs), the idea of an extensible protection mechanism is entertained. Link-time bytecode verification becomes a pluggable service that can be readily replaced, reconfigured and augmented. Application-specific verification services can be safely introduced into the dynamic linking process of the JVM. This feature is enabled by the adoption of a previously proposed modular verification architecture, Proof Linking [23, 24], which decouples bytecode verification from the dynamic linking process, rendering the verifier a replaceable module. The PVM mechanism has been implemented in an open source JVM, the Aegis VM [21]. To evaluate the software engineering and security engineering benefits of this extensible protection mechanism, an augmented type system JAC (Java Access Control) [37] has been successfully implemented as a PVM.
This paper describes a type system that is capable of expressing and enforcing immutability constraints. The specific constraint expressed is that the abstract state of the object to which an immutable reference refer...
详细信息
ISBN:
(纸本)1581138318
This paper describes a type system that is capable of expressing and enforcing immutability constraints. The specific constraint expressed is that the abstract state of the object to which an immutable reference refers cannot be modified using that reference. The abstract state is (part of) the transitively reachable state: that is, the state of the object and all state reachable from it by following references. The type system permits explicitly excluding fields or objects from the abstract state of an object. For a statically type-safe language, the type system guarantees reference immutability. If the language is extended with immutability downcasts, then run-time checks enforce the reference immutability constraints. In order to better understand the usability and efficacy of the type system, we have implemented an extension to Java, called Javari, that includes all the features of our type system. Javari is interoperable with Java and existing JVMs. It can be viewed as a proposal for the semantics of the Java const keyword, though Javari's syntax uses readonly instead. This paper describes the design and implementation of Javari, including the type-checking rules for the language. This paper also discusses experience with 160,000 lines of Javari code. Javari was easy to use and provided a number of benefits, including detecting errors in well-tested code.
When vectorizing for SIMD architectures that are commonly employed by today's multimedia extensions, one of the new challenges that arise is the handling of memory alignment. Prior research has focused primarily o...
详细信息
When vectorizing for SIMD architectures that are commonly employed by today's multimedia extensions, one of the new challenges that arise is the handling of memory alignment. Prior research has focused primarily on vectorizing loops where all memory references are properly aligned. An important aspect of this problem, namely, how to vectorize misaligned memory references, still remains unaddressed. This paper presents a compilation scheme that systematically vectorizes loops in the presence of misaligned memory references. The core of our technique is to automatically reorganize data in registers to satisfy the alignment requirement imposed by the hardware. To reduce the data reorganization overhead, we propose several techniques to minimize the number of data reorganization operations generated. During the code generation, our algorithm also exploits temporal reuse when aligning references that access contiguous memory across loop iterations. Our code generation scheme guarantees to never load the same data associated with a single static access twice. Experimental results indicate near peak speedup factors, e.g., 3.71 for 4 data. per vector and 6.06 for 8 data per vector, respectively, for a set of loops where 75% or more of the static memory references are misaligned.
The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtim...
详细信息
The emerging hardware support for thread-level speculation opens new opportunities to parallelize sequential programs beyond the traditional limits. By speculating that many data dependences are unlikely during runtime, consecutive iterations of a sequential loop can be executed speculatively in parallel. Runtime parallelism is obtained when the speculation is correct. To take full advantage of this new execution model, a program needs to be programmed or compiled in such a way that it exhibits high degree of speculative thread-level parallelism. We propose a comprehensive cost-driven compilation framework to perform speculative parallelization. Based on a misspeculation cost model, the compiler aggressively transforms loops into optimal speculative parallel loops and selects only those loops whose speculative parallel execution is likely to improve program performance. The framework also supports and uses enabling techniques such as loop unrolling, software value prediction and dependence profiling to expose more speculative parallelism. The proposed framework was implemented on the ORC compiler. Our evaluation showed that the cost-driven speculative parallelization was effective. Our compiler was able to generate good speculative parallel loops in ten Spec2000Int benchmarks, which currently achieve an average 8% speedup. We anticipate an average 15.6% speedup when all enabling techniques are in place.
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely di...
详细信息
This paper describes a computer architecture, Spatial Computation (SC), which is based on the translation of high-level language programs directly into hardware structures. SC program implementations are completely distributed, with no centralized control. SC circuits are optimized for wires at the expense of computation units. In this paper we investigate a particular implementation of SC: ASH (Application-Specific Hardware). Under the assumption that computation is cheaper than communication, ASH replicates computation units to simplify interconnect, building a system which uses very simple, completely dedicated communication channels. As a consequence, communication on the datapath never requires arbitration;the only arbitration required is for accessing memory. ASH relies on very simple hardware primitives, using no associative structures, no multiported register files, no scheduling logic, no broadcast, and no clocks. As a consequence, ASH hardware is fast and extremely power efficient. In this work we demonstrate three features of ASH: (1) that such architectures can be built by automatic compilation of C programs;(2) that distributed computation is in some respects fundamentally different from monolithic superscalar processors;and (3) that ASIC implementations of ASH use three orders of magnitude less energy compared to high-end superscalar processors, while being on average only 33% slower in performance (3.5x worst-case).
Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on ...
详细信息
Cache miss stalls hurt performance because of the large gap between memory and processor speeds - for example, the popular server benchmark SPEC JBB2000 spends 45% of its cycles stalled waiting for memory requests on the Itanium(R) 2 processor. Traversing linked data structures causes a large portion of these stalls. Prefetching for linked data structures remains a major challenge because serial data dependencies between elements in a linked data structure preclude the timely materialization of prefetch addresses. This paper presents Mississippi Delta (MS Delta), a novel technique for prefetching linked data structures that closely integrates the hardware performance monitor (HPM), the garbage collector's global view of heap and object layout, the type-level metadata inherent in type-safe programs, and JIT compiler analysis. The garbage collector uses the HPM's data cache miss information to identify cache miss intensive traversal paths through linked data structures, and then discovers regular distances (deltas) between these linked objects. JIT compiler analysis injects prefetch instructions using deltas to materialize prefetch addresses. We have implemented MS Delta in a fully dynamic profile-guided optimization system: the StarJIT dynamic compiler [1] and the ORP Java virtual machine [9]. We demonstrate a 28-29% reduction in stall cycles attributable to the high-latency cache misses targeted by MS Delta and a speedup of 11-14% on the cache miss intensive SPEC JBB2000 benchmark.
暂无评论