Recently, many software runtime systems have been proposed that allow developers to efficiently map applications to contemporary consumer electronic devices and high-performance academic processing platforms. Most of ...
详细信息
ISBN:
(纸本)9781479901043
Recently, many software runtime systems have been proposed that allow developers to efficiently map applications to contemporary consumer electronic devices and high-performance academic processing platforms. Most of these runtime systems employ advanced scheduling techniques for automatic task assignment to all available processing elements. However, they focus on a particular environment and architecture, and it is not easy to port them to reconfigurable embedded MPSoCs. As a consequence, in the embedded community, researchers implement hardwired application-specific task schedulers, which can not be used by other embedded MPSoCs. To address this problem, in this paper we propose a lightweight runtime software framework for reconfigurable shared-memory MPSoCs, that integrate a master embedded processor connected to slave cores. Similarly to many of the aforementioned advanced runtime systems, we adopt a task-based programming model that uses simple, pragma-based annotations of the application software, in order to dynamically resolve task dependencies. Our runtime system supports heterogeneity in the hardware resources, and is also low-overhead to account for possible limitations in their processing capabilities and available on-chip memory. To evaluate our proposal, we have prototyped an MPSoC with seven slaves to a Xilinx ML605 FPGA board. We run three micro-benchmarks that achieve a performance speedup of 3.8x, 7x and 5.8x, and energy consumption of 27%, 14% and 18% respectively, compared to a single-core baseline system with no runtime support.
As technology scaling down allows multiple processing components to be integrated on a single chip, the modern computing systems led to the advent of Multiprocessor System-on-Chip (MPSoC) and Chip Multiprocessor (CMP)...
详细信息
ISBN:
(纸本)9781479910717
As technology scaling down allows multiple processing components to be integrated on a single chip, the modern computing systems led to the advent of Multiprocessor System-on-Chip (MPSoC) and Chip Multiprocessor (CMP) design. Network-on-Chips (NoCs) have been proposed as a promising solution to tackle the complex on-chip communication problems on these multicore platforms. In order to optimize the NoC-based multicore system design, it is essential to evaluate the NoC performance with respect to numerous configurations in a large design space. Taking the traffic characteristics into account and using an appropriate latency model become crucially important to provide an accurate and fast evaluation. In this tutorial, we survey the current progresses in these aspects. We first review the NoC workload modeling and traffic analysis techniques. Then, we discuss the mathematical formalisms of evaluating the performance under a given traffic model, for both the average and worst-case latency predictions. Finally, the advantages of combining the analytical and simulation-based techniques are discussed and new attempts for bridging these two approaches are reviewed.
Next generation cyber-physical systems (CPS) are expected to be deployed in domains which require scalability as well as performance under dynamic conditions. This scale and dynamicity will require that CPS communicat...
详细信息
Modern systems-on-a-chip are equipped with power architectures, allowing to control the consumption of individual components or subsystems. These mechanisms are controlled by a power-management policy often implemente...
详细信息
ISBN:
(纸本)9783981537000
Modern systems-on-a-chip are equipped with power architectures, allowing to control the consumption of individual components or subsystems. These mechanisms are controlled by a power-management policy often implemented in the embedded software, with hardware support. Today's circuits have an important static power consumption, whose low-power design require techniques like DVFS or power-gating. A correct and efficient management of these mechanisms is therefore becoming non-trivial. Validating the effect of the power management policy needs to be done very early in the design cycle, as part of the architecture exploration activity. High-level models of the hardware must be annotated with consumption information. Temperature must also be taken into account since leakage current increases exponentially with it. Existing annotation techniques applied to loosely-timed or temporally-decoupled models would create bad simulation artifacts on the temperature profile (e.g. unrealistic peaks). This paper addresses the instrumentation of a timed transaction-level model of the hardware with information on the power consumption of the individual components. It can cope not only with power-state models, but also with Joule-per-bit traffic models, and avoids simulation artifacts when used in a functional/power/temperature co-simulation.
Due to many factors such as, high transistor density, high frequency, and low voltage, today's processors are more than ever subject to hardware failures. These errors have various impacts depending on the locatio...
详细信息
ISBN:
(纸本)9781479901043
Due to many factors such as, high transistor density, high frequency, and low voltage, today's processors are more than ever subject to hardware failures. These errors have various impacts depending on the location of the error and the type of processor. Because of the hierarchical structure of the compute units and work scheduling, the hardware failure on GPUs affect only part of the application. In this paper we present a new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL. We also present which hardware part of TESLA architecture is more sensitive to intermittent errors, which usually appears when the processor is aging. We obtained these results by accelerating the aging process by running the processors at high temperature. We show that on GPUs, intermittent errors impact is limited to a localized architecture tile. Finally, we propose a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.
Reliability is emerging as an important design criterion in modern systems due to increasing transient fault rates. Hardware fault-tolerance techniques, commonly used to address this, introduce high design costs. As a...
详细信息
ISBN:
(纸本)9781479901043
Reliability is emerging as an important design criterion in modern systems due to increasing transient fault rates. Hardware fault-tolerance techniques, commonly used to address this, introduce high design costs. As alternative, software Signature-Monitoring (SM) schemes based on compiler assertions are an efficient method for control-flow-error detection. Existing SM techniques do not consider application-specific-information causing unnecessary overheads. In this paper, compile-time Control-Flow-Graph (CFG) topology analysis is used to place best-suited assertions at optimal locations of the assembly code to reduce overheads. Our evaluation with representative workloads shows fault-coverage increase with overheads close to Assertion-based Control-Flow Correction (ACFC), the method with lowest overhead. Compared to ACFC, our technique improves (on average) fault coverage by 17%, performance overhead by 5% and power-consumption by 3% with equal code-size overhead.
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently put, get, and remove information. Wait-freedom me...
详细信息
ISBN:
(纸本)9781479901043
In this work we present the first design and implementation of a wait-free hash map. Our multiprocessor data structure allows a large number of threads to concurrently put, get, and remove information. Wait-freedom means that all threads make progress in a finite amount of time - an attribute that can be critical in real-time environments. This is opposed to the traditional blocking implementations of shared data structures which suffer from the negative impact of deadlock and related correctness and performance issues. Our design is portable because we only use atomic operations that are provided by the hardware; therefore, our hash map can be utilized by a variety of data-intensive applications including those within the domains of embeddedsystems and supercomputers. The challenges of providing this guarantee make the design and implementation of wait-free objects difficult. As such, there are few wait-free data structures described in the literature; in particular, there are no wait-free hash maps. It often becomes necessary to sacrifice performance in order to achieve wait-freedom. However, our experimental evaluation shows that our hash map design is, on average, 5 times faster than a traditional blocking design. Our solution outperforms the best available alternative non-blocking designs in a large majority of cases, typically by a factor of 8 or higher.
Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm yielding high accuracy rates. However, SVMs often require processing a large number of support vectors, making the classific...
详细信息
ISBN:
(纸本)9781479901043
Support Vector Machines (SVMs) are considered as a state-of-the-art classification algorithm yielding high accuracy rates. However, SVMs often require processing a large number of support vectors, making the classification process computationally demanding, especially when considering embedded applications. Cascade SVMs have been proposed in an attempt to speed-up classification times, but improved performance comes at a cost of additional hardware resources. Consequently, in this paper we propose an optimized architecture for cascaded SVM processing, along with a hardware reduction method in order to reduce the overheads from the implementation of additional stages in the cascade, leading to significant resource and power savings for embedded applications. The architecture was implemented on a Virtex 5 FPGA platform and evaluated using face detection as the target application on 640×480 resolution images. Additionally, it was compared against implementations of the same cascade processing architecture but without using the reduction method, and a single parallel SVM classifier. The proposed architecture achieves an average performance of 70 frames-per-second, demonstrating a speed-up of 5× over the single parallel SVM classifier. Furthermore, the hardware reduction method results in the utilization of 43% less hardware resources and a 20% reduction in power, with only 0.7% reduction in classification accuracy.
The declining robustness of transistors and their ever-denser integration threatens the dependability of future microprocessors. Classic multicores offer a simple solution to overcome hardware defects: faulty processo...
详细信息
ISBN:
(纸本)9781479901043
The declining robustness of transistors and their ever-denser integration threatens the dependability of future microprocessors. Classic multicores offer a simple solution to overcome hardware defects: faulty processors can be disabled without affecting the rest of the system. However, this approach becomes quickly an impractical solution at high fault rates. Recently, distributed computerarchitectures have been proposed to mitigate the effects of faulty transistors by utilizing finegrained hardware reconfiguration, managed by fully decoupled control logic. Unfortunately, such solutions trade flexibility for execution locality, and thus they do not scale to large systems. To overcome this issue we propose Cobra, a distributed, scalable, highly parallel reliable architecture. Cobra is a service-based architecture where groups of dynamic instructions flow independently through the system, making use of the available hardware resources. Cobra organizes the system's units dynamically using a novel resource assignment that preserves locality and limits communication overhead. Our experiments show that Cobra is extremely dependable, and outperforms classic multicores when subjected to 5 or more defects per 100 million transistors. We also show that Cobra operates 80% faster than previous state-of-the-art solutions on multi-programmed SPEC CPU2006 workloads and it improves cache hit rate by up to 62%. Our runtime fault detection techniques have a performance impact of only 3%.
This paper summarizes the design of a programmable processor with transport triggered architecture (TTA) for decoding LDPC and turbo codes. The processor architecture is designed in such a manner that it can be progra...
详细信息
ISBN:
(纸本)9781479901043
This paper summarizes the design of a programmable processor with transport triggered architecture (TTA) for decoding LDPC and turbo codes. The processor architecture is designed in such a manner that it can be programmed for LDPC or turbo decoding for the purpose of internetworking and roaming between different networks. The standard trellis based maximum a posteriori (MAP) algorithm is used for turbo decoding. Unlike most other implementations, a supercode based sum-product algorithm is used for the check node message computation for LDPC decoding. This approach ensures the highest hardware utilization of the processor architecture for the two different algorithms. Up to our knowledge, this is the first attempt to design a TTA processor for the LDPC decoder. The processor is programmed with a high level language to meet the time-to-market requirement. The optimization techniques and the usage of the function units for both algorithms are explained in detail. The processor achieves 22.64 Mbps throughput for turbo decoding with a single iteration and 10.12 Mbps throughput for LDPC decoding with five iterations for a clock frequency of 200 MHz.
暂无评论