Complex systems-on-chip are designed by interconnecting pre-designed hardware (HW) and software (SW) components. During the design cycle, a global model of the SoC may be composed of HW and SW models at different abst...
详细信息
ISBN:
(纸本)9781595931610
Complex systems-on-chip are designed by interconnecting pre-designed hardware (HW) and software (SW) components. During the design cycle, a global model of the SoC may be composed of HW and SW models at different abstraction levels. Designing HW/SW interfaces to interconnect SoC components is a source of design bottlenecks. This paper describes a service-based model enabling systematic design and co-simulation of HW/SW interfaces for SoC design. This model, called Service dependency graph (SDG) allows modeling of complex and application-specific interfaces. We present also a model generator that can automatically build HW/SW interfaces based on service and resource requirements described by the SDG. This approach has been applied successfully on the design of an MPEG-4 encoder. Additionally the SDG seems to be an excellent intermediate representation for the design automation of HW/SW interfaces.
In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages ...
详细信息
ISBN:
(纸本)9781595931610
In this paper, we show a novel approach to accelerate loops by tightly coupling a coprocessor to an ASIP. Latency hiding is used to exploit the parallelism available in this architecture. To illustrate the advantages of this approach, we investigate a JPEG encoding algorithm and accelerate one of its loop by implementing it in a coprocessor. We contrast the acceleration by implementing the critical segment as two different coprocessors and a set of customized instructions. The two different coprocessor approaches are: a high-level synthesis (HLS) approach; and a custom coprocessor approach. The HLS approach provides a faster method of generating coprocessors. We show that a loop performance improvement of 2.57x is achieved using the custom coprocessor approach, compared to 1.58x for the HLS approach and 1.33x for the customized instruction approach compared with just the main processor. Respective energy savings within the loop are 57%, 28% and 19%.
This paper describes the design of a low-cost, low-power smart imaging core that can be embedded in cameras. The core integrates an ARM 9 processor, a camera interface and two specific hardware blocks for image proces...
详细信息
ISBN:
(纸本)9781595931610
This paper describes the design of a low-cost, low-power smart imaging core that can be embedded in cameras. The core integrates an ARM 9 processor, a camera interface and two specific hardware blocks for image processing: a smart imaging coprocessor and an enhanced motion estimator. Both coprocessors have been designed using high-level synthesis tools taking the C programming language as a starting point. The resulting RTL code of each coprocessor has been synthesized and verified on an FPGA board. Two automotive and two mobile smart imaging applications are mapped onto the resulting smart imaging core. This mapping process of the original C++ applications onto the smart imaging core is also presented in this paper.
Arbitrary memory dependencies and variable latency memory systems are major obstacles to the synthesis of large-scale ASIC systems in high-level synthesis. This paper presents SOMA, a synthesis framework for construct...
详细信息
ISBN:
(纸本)9781595931610
Arbitrary memory dependencies and variable latency memory systems are major obstacles to the synthesis of large-scale ASIC systems in high-level synthesis. This paper presents SOMA, a synthesis framework for constructing Memory Access Network (MAN) architectures that inherently enforce memory consistency in the presence of dynamic memory access dependencies. A fundamental bottleneck in any such network is arbitrating between concurrent accesses to a shared memory resource. To alleviate this bottleneck, SOMA uses an application-specific concurrency analysis technique to predict the dynamic memory parallelism profile of the application. This is then used to customize the MAN architecture. Depending on the parallelism profile, the MAN may be optimized for latency, throughput or both. The optimized MAN is automatically synthesized into gate-level structural Verilog using a flexible library of network building blocks. SOMA has been successfully integrated into an automated C-to-hardwaresynthesis flow, which generates standard cell circuits from unrestricted ANSI-C programs. Post-layout experiments demonstrate that application specific MAN construction significantly improves power and performance.
LUT based reconfigurability is being complemented with a new, emerging class of reconfigurable technology based on ALU and heterogeneous architectures. Moreover, existing FPGA technology has experienced recent advance...
详细信息
LUT based reconfigurability is being complemented with a new, emerging class of reconfigurable technology based on ALU and heterogeneous architectures. Moreover, existing FPGA technology has experienced recent advances that go beyond other implementation technologies such as DSP, ASIC or ASSP. Already deployed within the network infrastructure and application areas where performance, power and flexibility are principal, the migration of software-programmed hardware reconfigurability to the mobile computing domain is attractive. To enable this migration an advance in design automation is required where algorithms can be efficiently mapped to the computational fabric using software-programming techniques. In this paper we describe how digital circuits from software designs and formal executable specifications can be synthesized to reconfigurable architectures using higher-level languages and deterministic hardware compilation, or 'C-synthesis'. For context we provide an introduction and overview to reconfigurable architectures.
Memory is a scarce resource in many embedded systems. Increasing memory often increases packaging and cooling costs, size, and energy consumption. This paper presents CRAMES, an efficient software-based RAM compressio...
详细信息
ISBN:
(纸本)9781595931610
Memory is a scarce resource in many embedded systems. Increasing memory often increases packaging and cooling costs, size, and energy consumption. This paper presents CRAMES, an efficient software-based RAM compression technique for embedded systems. The goal of CRAMES is to dramatically increase effective memory capacity without hardware design changes, while maintaining high performance and low energy consumption. To achieve this goal, CRAMES takes advantage of an operating system's virtual memory infrastructure by storing swapped-out pages in compressed format. It dynamically adjusts the size of the compressed RAM area, protecting applications capable of running without it from performance or energy consumption penalties. In addition to compressing working data sets, CRAMES also enables efficient in-RAM filesystem compression, thereby further increasing RAM capacity. CRAMES was implemented as a loadable module for the Linux kernel and evaluated on a battery-powered embedded system. Experimental results indicate that CRAMES is capable of doubling the amount of RAM available to applications. Execution time and energy consumption for a broad range of examples increase only slightly, by averages of 0.35% and 4.79%. In addition, this work identifies the software-based compression algorithms that are most appropriate for low-power embedded systems.
In this paper, we present a new architectural concept for network processors called FlexPath NP. The central idea behind FlexPath NP is to systematically map network processor (NP) application sub-functions onto both ...
详细信息
ISBN:
(纸本)9781595931610
In this paper, we present a new architectural concept for network processors called FlexPath NP. The central idea behind FlexPath NP is to systematically map network processor (NP) application sub-functions onto both SW programmable processor (CPU) resources and (re-)configurable HW building blocks, such that different packet flows are forwarded via different, optimized processing paths through the NP. Packets with well understood, relatively simple processing requirements may even bypass the central CPU complex (AutoRoute). In consequence, CPU processing resources are more effectively used and the overall NP performance and throughput are improved compared to conventional NP architectures. We present analytical performance estimations to quantify the performance advantage of FlexPath (expressed as available CPU instructions for each packet traversing the CPUs) and introduce a platform-based system on Programmable Chip (SoPC) based architecture which implements the FlexPath NP concept.
In this paper, we present the analysis, design and implementation of an estimator to realize large bit width unsigned integer multiplier units. Larger multiplier units are required for cryptography and error correctio...
详细信息
ISBN:
(纸本)9781595931610
In this paper, we present the analysis, design and implementation of an estimator to realize large bit width unsigned integer multiplier units. Larger multiplier units are required for cryptography and error correction circuits for more secure and reliable transmissions over highly insecure and/or noisy channels in networking and multimedia applications. The design space for these circuits is very large when integer multiplication on large operands is carried out hierarchically. In this paper, we explore automated synthesis of high bit-width unsigned integer multiplier circuits by defining and validating an estimator function used in search and analysis of the design space of such circuits. We focus on analysis of a hybrid hierarchical multiplier scheme that combines the throughput advantages of parallel multipliers and the resource cost-effectiveness of serial ones. We present an analytical model that rapidly predicts timing and resource usage for selected model candidates. We evaluate the estimator model in the design of a practical application, a 256-bit elliptic curve adder implemented on a Xilinx FPGA fabric. We show that our estimator allows implementation of fast, efficient circuits, where resultant designs provide order-of-magnitude performance improvements when compared with that of software implementations on a high performance computing platform.
暂无评论