The use of driver models within advanced driver assistance systems (ADAS) allows anticipating the driving behavior of the vehicle and all traffic participants in the close vicinity. This valuable information could con...
详细信息
The use of driver models within advanced driver assistance systems (ADAS) allows anticipating the driving behavior of the vehicle and all traffic participants in the close vicinity. This valuable information could considerably improve the performance as well as the acceptance of ADAS. Consequently complex driver models need to be integrated in embeddedsystems. This work, first of all, aims to summarize important driver models described in literature. Based upon this a suitable approach to implement a driver model on an embedded system is derived. The model used, focuses on the longitudinal driving and lane change behavior of drivers. The system architecture is derived and optimized for real-time execution. The driver model is analyzed in detailed simulations. Test drives in a small scale naturalistic driving study are used to validate the driver model. This paper defines a standard driver model to be implemented as part of the DESERVE platform within the Artemis project “DESERVE”. As embedded automotive hardware the dSpace MicroAutoBox II is used. The paper summarizes approaches and examples to use the generated prediction data in ADAS like ACC.
High-end embeddedsystems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access (NUM...
详细信息
High-end embeddedsystems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that provide a shared memory abstraction subject to non-uniform memory access (NUMA) costs. In order to keep the cores and memory hierarchy simple, many-core embeddedsystems tend to employ simple, scratchpad-like memories, rather than hardware managed caches that require some form of cache coherence management. These “coherence-free” systems still require some means to synchronize memory accesses and guarantee memory consistency. Conventional lock-based approaches may be employed to accomplish the synchronization, but may lead to both useability and performance issues. Instead, speculative synchronization, such as hardware transactional memory, may be a more attractive approach. However, hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. The lack of a cache-coherence protocol adds new challenges in the design of hardware speculative support. In this paper, we present a new scheme for hardware transactional memory support within a cluster-based NUMA system that lacks an underlying cache-coherence protocol. To the best of our knowledge, this is the first design for speculative synchronization for this type of architecture. Through a set of benchmark experiments using our simulation platform, we show that our design can achieve significant performance improvements over traditional lock-based schemes.
Tasks executing on general purpose multiprocessor platforms exhibit variations in their execution times. As such, there is a need to explicitly consider robustness, i.e., tolerance to these fluctuations. This work aim...
详细信息
Tasks executing on general purpose multiprocessor platforms exhibit variations in their execution times. As such, there is a need to explicitly consider robustness, i.e., tolerance to these fluctuations. This work aims to quantify the robustness of schedules of directed acyclic graphs (DAGs) on multiprocessors by defining probabilistic robustness metrics and to present a new approach to perform robustness analysis to obtain these metrics. Stochastic execution times of tasks are used to compute completion time distributions which are then used to compute the metrics. To overcome the difficulties involved with the max operation on distributions, a new curve fitting approach is presented using which we can derive a distribution from a combination of analytical and limited simulation based results. The approach has been validated on schedules of time-critical applications in ASML wafer scanners.
The SRAM memories used for embedded micro-processor devices consume a large portion of the system's power. The power dissipation of the instruction memory can be limited by using code compression methods, which ma...
详细信息
The SRAM memories used for embedded micro-processor devices consume a large portion of the system's power. The power dissipation of the instruction memory can be limited by using code compression methods, which may require the use of variable length instruction formats in the processor. The power-efficient design of variable length instruction fetch and decode is challenging for static multiple-issue processors, which aim for low power consumption on embedded platforms. The power saved using compression is easily lost on inefficient processor design. We propose an implementation for instruction template -based compression and two instruction fetch alternatives for variable length instruction encoding on Transport Triggered Architecture, a static multiple-issue exposed data path architecture. The compression approach reaches an average program size reduction of 44% at best. We show that the variable length fetch designs are sufficiently low-power oriented for the system to benefit from the code compression, which reduces the program memory size.
Powerful cache systems are required in multi- and many-core systems in order to provide a suitable performance. Moreover, these caches must provide coherent accesses to shared data. With respect to embedded real-time ...
详细信息
Powerful cache systems are required in multi- and many-core systems in order to provide a suitable performance. Moreover, these caches must provide coherent accesses to shared data. With respect to embedded real-time systems, traditional write-invalidate and write-update coherence protocols hamper a static worst case execution time analysis, since the content and state of a local cache can be modified by other cores. Hence, a meaningful cache analysis is no more possible. In previous work, we introduced the On-Demand Coherent Cache (ODC 2 ) which provides coherent accesses to shared data without any coherence transactions and without modifications of other caches. It allows a tight worst case execution time (WCET) analysis at the expense of a decreased hit rate and an additional write-back procedure. In this work, we briefly describe two versions of the ODC 2 before we present detailed evaluations of its performance and the cost of time predictability. We quantify the overhead of the writeback procedure of one of ODC 2 's variations and we estimate the impact on the hit rate.
Nowadays multi-standard wireless baseband, Convolutional Code (CC), Turbo code and LDPC code are widely applied and need to be integrated within one FEC module. Since memory occupies half or even more area of the deco...
详细信息
Nowadays multi-standard wireless baseband, Convolutional Code (CC), Turbo code and LDPC code are widely applied and need to be integrated within one FEC module. Since memory occupies half or even more area of the decoder, memory sharing techniques for area saving purpose is valuable to consider. In this work, several memory merging techniques are proposed. A non-conflict access technique for merged path metric buffer is proposed. The results show that 41% of total memory bits are saved when integrating three different decoding schemes including CC (802.11a/g/n), LDPC (802.11n and 802.16e) and Turbo (3GPP-LTE). Synthesis result with 65nm process shows that the merged memory blocks consume merely 1.06mm 2 of the chip area.
Data locality optimization is a well-known goal when handling programs that must run as fast as possible or use a minimum amount of energy. However, usual techniques never address the significant impact of numerous st...
详细信息
Data locality optimization is a well-known goal when handling programs that must run as fast as possible or use a minimum amount of energy. However, usual techniques never address the significant impact of numerous stalled processor cycles that may occur when consecutive load and store instructions are accessing the same memory location. We show that two versions of the same program may exhibit similar memory performance, while performing very differently regarding their execution times because of the stalled processor cycles generated by many pipeline hazards. We propose a new programming structure called “xfor”, enabling the explicit control of the way data locality is optimized in a program and thus, to control the amount of stalled processor cycles. We show the benefits of xfor regarding execution time and energy saving.
Heterogeneous multicore systems have gained momentum, specially for embedded applications, thanks to the performance and energy consumption trade-offs provided by inorder and out-of-order cores. Micro-architectural si...
详细信息
Heterogeneous multicore systems have gained momentum, specially for embedded applications, thanks to the performance and energy consumption trade-offs provided by inorder and out-of-order cores. Micro-architectural simulation models the behavior of pipeline structures and caches with configurable parameters. This level of abstraction is well known for being flexible enough to quickly evaluate the performance of new hardware implementations, such as future heterogeneous multicore platforms. However, currently, there is no open-source micro-architectural simulator supporting both in-order and out-of-order ARM cores. This article describes the implementation and accuracy evaluation of a micro-architectural simulator of Cortex-A cores, supporting in-order and out-of-order pipelines and based on the open-source gem5 simulator. We explain how to simulate CortexA8 and Cortex-A9 cores in gem5, and compare the execution time of ten benchmarks with real hardware. Both models, with average absolute errors of only 7 %, are more accurate than similar microarchitectural simulators, which show average absolute errors greater than 15 %.
Application Specific Instruction Set Processors (ASIPs) seek for an optimal performance/area/energy trade-off for a given algorithm. In all current design methodologies an architectural model must be first manually cr...
详细信息
Application Specific Instruction Set Processors (ASIPs) seek for an optimal performance/area/energy trade-off for a given algorithm. In all current design methodologies an architectural model must be first manually created based on designers experience. These models are increasingly refined until the design constraints are met, through several time consuming algorithmic/architecture co-exploration iterations. This paper presents a novel performance estimation approach that shortens the design cycle of existing methodologies by providing an early assessment of the impact of customizations on the achievable performance. The approach does so by eliminating the need for a completely specified architecture, without limiting designer's freedom and without simulating the application repeatedly. Overall, our approach reduces the number of necessary co-exploration iterations, thus increasing design productivity. We validate our approach via two different case studies: a butterfly-enabled ASIP for Fast Fourier Transform computation and a Connected Components Labeling ASIP for computer vision.
GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and c...
详细信息
GPUs are much more power-efficient devices compared to CPUs, but due to several performance bottlenecks, the performance per watt of GPUs is often much lower than what could be achieved theoretically. To sustain and continue high performance computing growth, new architectural and application techniques are required to create power-efficient computing systems. To find such techniques, however, it is necessary to study the power consumption at a detailed level and understand the bottlenecks which cause low performance. Therefore, in this paper, we study GPU power consumption at component level and investigate the bottlenecks that cause low performance and low energy efficiency. We divide the low performance kernels into low occupancy and full occupancy categories. For the low occupancy category, we study if increasing the occupancy helps in increasing performance and energy efficiency. For the full occupancy category, we investigate if these kernels are limited by memory bandwidth, coalescing efficiency, or SIMD utilization.
暂无评论