We report on the design and implementation of an extensible programminglanguage and its intrinsic support for formal verification. Our language is targeted at low-level programming of infrastructure like operating sy...
详细信息
ISBN:
(纸本)9781450323260
We report on the design and implementation of an extensible programminglanguage and its intrinsic support for formal verification. Our language is targeted at low-level programming of infrastructure like operating systems and runtime systems. It is based on a cross-platform core combining characteristics of assembly languages and compiler intermediate languages. From this foundation, we take literally the saying that C is a "macro assembly language": we introduce an expressive notion of certified low-level macros, sufficient to build up the usual features of C and beyond as macros with no special support in the core. Furthermore, our macros have integrated support for strongest postcondition calculation and verification condition generation, so that we can provide a high-productivity formal verification environment within Coq for programs composed from any combination of macros. Our macro interface is expressive enough to support features that low-level programs usually only access through external tools with no formal guarantees, such as declarative parsing or SQL-inspired querying. the abstraction level of these macros only imposes a compile-time cost, via the execution of functional Coq programs that compute programs in our intermediate language;but the run-time cost is not substantially greater than for more conventional C code. We describe our experiences constructing a full C-like language stack using macros, with some experiments on the verifiability and performance of individual programs running on that stack.
Withthe maturing of technology for model checking and constraint solving, there is an emerging opportunity to develop programming tools that can transform the way systems are specified. In this paper, we propose a ne...
详细信息
ISBN:
(纸本)9781450320146
Withthe maturing of technology for model checking and constraint solving, there is an emerging opportunity to develop programming tools that can transform the way systems are specified. In this paper, we propose a new way to program distributed protocols using concolic snippets. Concolic snippets are sample execution fragments that contain both concrete and symbolic values. the proposed approach allows the programmer to describe the desired system partially using the traditional model of communicating extended finite-state-machines (EFSM), along with high-level invariants and concrete execution fragments. Our synthesis engine completes an EFSM skeleton by inferring guards and updates from the given fragments which is then automatically analyzed using a model checker with respect to the desired invariants. the counterexamples produced by the model checker can then be used by the programmer to add new concrete execution fragments that describe the correct behavior in the specific scenario corresponding to the counterexample. We describe TRANSIT, a language and prototype implementation of the proposed specification methodology for distributed protocols. Experimental evaluations of TRANSIT to specify cache coherence protocols show that (1) the algorithm for expression inference from concolic snippets can synthesize expressions of size 15 involving typical operators over commonly occurring types, (2) for a classical directory-based protocol, TRANSIT automatically generates, in a few seconds, a complete implementation from a specification consisting of the EFSM structure and a few concrete examples for every transition, and (3) a published partial description of the SGI Origin cache coherence protocol maps directly to symbolic examples and leads to a complete implementation in a few iterations, withthe programmer correcting counterexamples resulting from underspecified transitions by adding concrete examples in each iteration.
Live programming allows programmers to edit the code of a running program and immediately see the effect of the code changes. this tightening of the traditional edit-compile-run cycle reduces the cognitive gap between...
详细信息
Image processing pipelines combine the challenges of stencil computations and stream programs. they are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or da...
详细信息
ISBN:
(纸本)9781450320146
Image processing pipelines combine the challenges of stencil computations and stream programs. they are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance difference between a naive implementation of a pipeline and an optimized one is often an order of magnitude. Efficient implementations require optimization of both parallelism and locality, but due to the nature of stencils, there is a fundamental tension between parallelism, locality, and introducing redundant recomputation of shared values. We present a systematic model of the tradeoff space fundamental to stencil pipelines, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing languagethat synthesizes high performance implementations from a Halide algorithm and a schedule. Combining this compiler with stochastic search over the space of schedules enables terse, composable programs to achieve state-of-the-art performance on a wide range of real image processing pipelines, and across different hardware architectures, including multicores with SIMD, and heterogeneous CPU+GPU execution. From simple Halide programs written in a few hours, we demonstrate performance up to 5x faster than hand-tuned C, intrinsics, and CUDA implementations optimized by experts over weeks or months, for image processing applications beyond the reach of past automatic compilers.
the proceedings contain 12 papers. the topics discussed include: APE: accelerator processor extensions to optimize data-compute co-location;a study of data structures with a deep heap shape;introducing kernel-level pa...
详细信息
ISBN:
(纸本)9781450321037
the proceedings contain 12 papers. the topics discussed include: APE: accelerator processor extensions to optimize data-compute co-location;a study of data structures with a deep heap shape;introducing kernel-level page reuse for high performance computing;a low overhead method for recovering unused memory inside regions;software-controlled transparent management of heterogeneous memory resources in virtualized systems;program-centric cost models for locality;a new perspective on processing-in-memory architecture design;a coldness metric for cache optimization;all-window data liveness;software-level scheduling to exploit non-uniformly shared data cache on GPGPU;cache rationing for multicore;and analyzing locality of memory references in GPU architectures.
We present CLAP, a new technique to reproduce concurrency bugs. CLAP has two key steps. First, it logs thread local execution paths at runtime. Second, offline, it computes memory dependencies that accord withthe log...
详细信息
ISBN:
(纸本)9781450320146
We present CLAP, a new technique to reproduce concurrency bugs. CLAP has two key steps. First, it logs thread local execution paths at runtime. Second, offline, it computes memory dependencies that accord withthe logged execution and are able to reproduce the observed bug. the second step works by combining constraints from the thread paths and constraints based on a memory model, and computing an execution with a constraint solver. CLAP has four major advantages. First, logging purely local execution of each thread is substantially cheaper than logging memory interactions, which enables CLAP to be efficient compared to previous approaches. Second, our logging does not require any synchronization and hence with no added memory barriers or fences;this minimizes perturbation and missed bugs due to extra synchronizations foreclosing certain racy behaviors. third, since it uses no synchronization, we extend CLAP to work on a range of relaxed memory models, such as TSO and PSO, in addition to sequential consistency. Fourth, CLAP can compute a much simpler execution than the original one, that reveals the bug with minimal thread context switches. To mitigate the scalability issues, we also present an approach to parallelize constraint solving, which theoretically scales our technique to programs with arbitrary execution length. Experimental results on a variety of multithreaded benchmarks and real world concurrent applications validate these advantages by showing that our technique is effective in reproducing concurrency bugs even under relaxed memory models;furthermore, it is significantly more efficient than a state-of-the-art technique that records shared memory dependencies, reducing execution time overhead by 45% and log size by 88% on average.
A software product line (SPL) encodes a potentially large variety of software products as variants of some common code base. Up until now, re-using traditional static analyses for SPLs was virtually intractable, as it...
详细信息
ISBN:
(纸本)9781450320146
A software product line (SPL) encodes a potentially large variety of software products as variants of some common code base. Up until now, re-using traditional static analyses for SPLs was virtually intractable, as it required programmers to generate and analyze all products individually. In this work, however, we show how an important class of existing inter-procedural static analyses can be transparently lifted to SPLs. Without requiring programmers to change a single line of code, our approach SPLLIFT automatically converts any analysis formulated for traditional programs within the popular IFDS framework for inter-procedural, finite, distributive, subset problems to an SPL-aware analysis formulated in the IDE framework, a well-known extension to IFDS. Using a full implementation based on Heros, Soot, CIDE and JavaBDD, we show that with SPLLIFT one can reuse IFDS-based analyses without changing a single line of code. through experiments using three static analyses applied to four Java-based product lines, we were able to show that our approach produces correct results and outperforms the traditional approach by several orders of magnitude.
Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are ex...
详细信息
ISBN:
(纸本)9781450320146
Read-Modify-Write (RMW) instructions are widely used as the building blocks of a variety of higher level synchronization constructs, including locks, barriers, and lock-free data structures. Unfortunately, they are expensive in architectures such as x86 and SPARC which enforce (variants of) Total-Store-Order (TSO). A key reason is that RMWs in these architectures are ordered like a memory barrier, incurring the cost of a write-buffer drain in the critical path. Such strong ordering semantics are dictated by the requirements of the strict atomicity definition (type-1) that existing TSO RMWs use. Programmers often do not need such strong semantics. Besides, weakening the atomicity definition of TSO RMWs, would also weaken their ordering - thereby leading to more efficient hardware implementations. In this paper we argue for TSO RMWs to use weaker atomicity definitions - we consider two weaker definitions: type-2 and type-3, with different relaxed ordering differences. We formally specify how such weaker RMWs would be ordered, and show that type-2 RMWs, in particular, can seamlessly replace existing type-1 RMWs in common synchronization idioms - except in situations where a type-1 RMW is used as a memory barrier. Recent work has shown that the new C/C++ 11 concurrency model can be realized by generating conventional (type-1) RMWs for C/C++ 11 SC-atomic-writes and/or SC-atomic-reads. We formally prove that this is equally valid using the proposed type-2 RMWs;type-3 RMWs, on the other hand, could be used for SC-atomic-reads (and optionally SC-atomic-writes). We further propose efficient microarchitectural implementations for type-2 (type-3) RMWs - simulation results show that our implementation reduces the cost of an RMW by up to 58.9% (64.3%), which translates into an overall performance improvement of up to 9.0% (9.2%) on a set of parallel programs, including those from the SPLASH-2, PARSEC, and STAMP benchmarks.
暂无评论