Energy efficient sensor nodes are among the rapidly expanding applications for embeddedsystems technology. Typically, the processing resources in sensor nodes are based on programmable micro-controllers and digital s...
详细信息
Energy efficient sensor nodes are among the rapidly expanding applications for embeddedsystems technology. Typically, the processing resources in sensor nodes are based on programmable micro-controllers and digital signal processors, and the same processing architecture is used regardless of the actual task of the node. This regularly results in at least an order of magnitude over-provisioning of resources, and in higher power consumption than would be needed by tightly application specific processing solutions. Currently, experiments show that Flash FPGA technology enables implementing precisely provisioned processing for sensor nodes with energy efficiency that rivals off-the-shelf processor solutions. The expected competitiveness originates from savings in silicon real-estate, and lowered software overheads, as inherently parallel tasks can be offloaded to dedicated hardware accelerators on the same die with a microcontroller unit, and radio baseband. The results pave the way for a novel type of self-powered sensor nodes whose processing resources are configured according to their tasks.
Summary form only given. Dynamic signal processing systems, where significant changes in functionality and computational structure must be achieved while applications are running, are becoming increasingly important a...
详细信息
Summary form only given. Dynamic signal processing systems, where significant changes in functionality and computational structure must be achieved while applications are running, are becoming increasingly important as computational platforms become more powerful, and feature-sets of DSP-powered products become more sophisticated. This talk covers two new, complementary dataflow models of computation that are being developed in the Maryland DSPCAD Research Group to help address the challenges of structured design, simulation, and synthesis of dynamic signal processing systems. The first of these models, called enable-invoke dataflow (EIDF), is aimed improving the predictability of actor invocation and the efficiency with which dynamic scheduling techniques can be realized. The second model, called the dataflow schedule graph (DSG), provides a formal framework for representing and analyzing dataflow graph schedules that is rooted in formal dataflow semantics, and accommodates a wide range of schedule classes, including static, quasi-static, and dynamic schedules, as well as both sequential and parallel schedule formats. In this talk, I will present the EIDF and DSG models and discuss their potential to improve the processes by which dynamic signal processing systems are developed.
Recently, three-dimensional integration technology has allowed researchers and designers to explore novel architectures for computing systems. Due to the memory-intensive nature of signal processing systems, DSPs can ...
详细信息
Recently, three-dimensional integration technology has allowed researchers and designers to explore novel architectures for computing systems. Due to the memory-intensive nature of signal processing systems, DSPs can greatly benefit from 3D memory integration technology realized by vertically stacking high-density memory below processing cores. In this paper, we analyze the energy and performance impacts of 3D memory integration in DSP systems by exploring a wide variety of memory configurations the technology enables. Our analysis demonstrates that a 3D memory hierarchy can increase performance by an average of 20.6% while decreasing energy by 6.7% when compared to our baseline 2D memory hierarchy, with a processor similar to a Texas Instrument C67× DSP. This proposed memory architecture also allows us to scale the voltage and frequency to decrease energy consumption by an average of 23% without decreasing performance.
In recent years multi-core processors have seen broad adoption in application domains ranging from embeddedsystems through general-purpose computing to large-scale data centres. simulation technology for multi-core s...
详细信息
In recent years multi-core processors have seen broad adoption in application domains ranging from embeddedsystems through general-purpose computing to large-scale data centres. simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (ISS) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (JIT) dynamic binary translation (DBT). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the ARCompact instruction set architecture (ISA). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and EEMBC MULTIBENCH benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core ×86 host machine for as many as 2048 target processors whilst exhibiting minimal and near constant overhead.
Functional simulators find widespread use as sub-systems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. in...
详细信息
Functional simulators find widespread use as sub-systems within microarchitectural simulators. The speed of functional simulators is strongly influenced by the implementation style of the functional simulator, e.g. interpreted vs. binary-translated simulation. Speed is also strongly influenced by the level of detail of the interface the functional simulator presents to the rest of the timing simulator. This level of detail may change during design space exploration, requiring corresponding changes to the interface and the simulator. However, for many implementation styles, changing the interface is difficult. As a result, architects may choose either implementation styles which are more malleable or interfaces with more detail than is necessary. In either case, simulation speed is traded for simulator design time. We show that this tradeoff is unnecessary if an orthogonal-specification design principle is practiced: specify how a simulator is to be implemented separately from what it is implementing and then synthesize a simulator from the combined specifications. We show that the use of an Architectural Description Language (ADL) with constructs for implementation style specification makes it possible to synthesize interfaces with different implementation styles with reasonable effort.
An innovative high throughput and scalable multi-transform architecture for H.264/AVC is presented in this paper. This structure can be used as a hardware accelerator in modern embeddedsystems to efficiently compute ...
详细信息
An innovative high throughput and scalable multi-transform architecture for H.264/AVC is presented in this paper. This structure can be used as a hardware accelerator in modern embeddedsystems to efficiently compute the 4×4 forward/inverse integer DCT, as well as the 2-D 4×4 / 2×2 Hadamard transforms. Moreover, its highly flexible design and hardware efficiency allows it to be easily scaled in terms of performance and hardware cost to meet the specific requirements of any given video coding application. Experimental results obtained using a Xilinx Virtex-4 FPGA demonstrate the superior performance and hardware efficiency levels provided by the proposed structure, which presents a throughput per unit of area at least 1.8× higher than other similar recently published designs. Furthermore, such results also showed that this architecture can compute, in realtime, all the above mentioned H.264/AVC transforms for video sequences with resolutions up to UHDV.
In this paper, a flexible HW architecture for video-based driver assistance applications is presented. It comprises a customizable and extensible processor template and several task-specific HW accelerators. The propo...
详细信息
In this paper, a flexible HW architecture for video-based driver assistance applications is presented. It comprises a customizable and extensible processor template and several task-specific HW accelerators. The proposed heterogeneous architecture allows utilization of the programmable processor core for control and low data rate tasks. For the acceleration of computationally intensive tasks of the application, special functional units and custom instructions can be added to the processor template to form an application specific instruction set processor (ASIP). Moreover, dedicated HW accelerators can be attached to the ASIP. To compare the diverse design options, a shape detection application for traffic sign detection is utilized as case study. It is shown that single tasks of a pure software implementation can be significantly accelerated by usage of special functional units by a factor of up to 35 and by usage of HW accelerators of up to 243. The proposed architecture has been mapped onto an FPGA and it could be shown that a realtime capable system can be realized.
With the increasing proliferation of heterogeneous and reconfigurable computing, it has become essential to have efficient prediction models to drive early HW-SW partitioning and co-design. In this paper, we present a...
详细信息
With the increasing proliferation of heterogeneous and reconfigurable computing, it has become essential to have efficient prediction models to drive early HW-SW partitioning and co-design. In this paper, we present a high level quantitative prediction modeling approach that accurately models the relation between hardware and software metrics, based on several statistical techniques. The proposed approach generates models that predict hardware performance indicators for reconfigurable components, such as the number of slices, the number of flip-flops, and the number of wires. It utilizes automatic model selection, artificial neural networks, (logistic) regression, and data transformations. These models take a high-level language description as input, enabling hardware prediction in the early design stages. We calibrate the models for two sets of tools targeting Xilinx and Altera FPGAs, where we report, for example, and error of 14% for the number of multipliers in case of Xilinx and an error of only 18% for the number of wires in case of Altera. To provide a realistic evaluation, we validate the approach using 181 kernels, contrary to the majority of the existing techniques, which use libraries of tens of kernels at most.
Future multi-core processors will necessitate exploitation of fine-grain, architecture-independent parallelism from applications to utilize many cores with relatively small local memories. We use c264, an end-to-end H...
详细信息
Future multi-core processors will necessitate exploitation of fine-grain, architecture-independent parallelism from applications to utilize many cores with relatively small local memories. We use c264, an end-to-end H.264 video encoder for the Cell processor based on x264, to show that exploiting fine-grain parallelism remains challenging and requires significant advancement in runtime support. Our implementation of c264 achieves speedup between 4.7× and 8.6× on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). We find that the programming effort associated with efficient parallelization of c264 at fine granularity is highly non-trivial. Hand optimizations may improve performance significantly but are limited eventually by the code restructuring they require. We assess the complexity of exploiting fine-grain parallelism in realistic applications, by identifying optimizations of c264 and the effort they require.
Pervasive computing refers to a seamless and invisible computing environment in which ubiquitous and connected computing devices gather information about the environment. Such computer-enabled artefacts represent a ne...
详细信息
暂无评论