In this paper, an extension of the OVP based MPSoC simulator MPSoCSim is presented. This latter is an extension of the OVP simulator with a SystemC Network-on-Chip (NoC) allowing the modeling and evaluation of NoC bas...
详细信息
In this paper, an extension of the OVP based MPSoC simulator MPSoCSim is presented. This latter is an extension of the OVP simulator with a SystemC Network-on-Chip (NoC) allowing the modeling and evaluation of NoC based Multiprocessor systems-on-Chip (MPSoCs). In the proposed version, this extended simulator enables the modeling and evaluation of complex clustered MPSoCs and many-cores. The clusters are compound of several independent subgroups. Each subgroup includes an OVP processor connected by a local bus to its own local memory for code, stack and heap. The subgroups being independent, the attached OVP processor model can be different from the other subgroups (ARM, MicroBlaze, MIPS,...) allowing the simulation of heterogeneous platforms. Also, each processor executes its own code. Subgroups are connected to each other through a shared bus allowing all the subgroups in the cluster to access to a shared memory. Finally, clusters are connected through a SystemC NoC supporting mesh topology with wormhole switching and different routing algorithms. The NoC is scalable and the number of subgroups in each cluster is parameterizable. For a dynamic execution, the OVP processor models support different Operating systems (OS). Also, some mechanisms are available in order to control the dynamic execution of applications on the platform. Different platforms and applications have been evaluated in terms of simulated execution time, simulation time on the host machine and number of simulated instructions.
The use of a wide Single-Instruction-Multiple-Data (SIMD) architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, based on our design framework for low-power...
详细信息
ISBN:
(纸本)9781509030774
The use of a wide Single-Instruction-Multiple-Data (SIMD) architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, based on our design framework for low-power SIMD processors, we propose a multiply-accumulate (MAC) unit with variable number of accumulator registers. The proposed MAC unit exploits both the merits of merged operation and register tiling. A Convolutional Neural Network (CNN) is a popular learning based algorithm due to its flexibility and high accuracy. However, a CNN-based application is often computationally intensive as it applies convolution operations extensively on a large data set. In this work, a CNN-based intelligent learning application is analyzed and mapped in the context of SIMD architectures. Experimental results show that the proposed architecture is efficient. In a 64-PE instance, the proposed SIMD processor with MAC4reg achieves an effective performance of 63.2 GOPS. Compared to the two baseline SIMD processors without MAC4reg, the proposed design brings 54.0% and 32.1% reduction in execution time, and 20.5% and 35.1% reduction in energy consumption respectively.
Field programmable gate arrays (FPGAs) are fundamentally different to fixed processors architectures because their memory hierarchies can be tailored to the needs of an algorithm. FPGA compilers for high level languag...
详细信息
Image processing algorithms applied on programmable embeddedsystems very often do not meet the given constraints in terms of real time capability. Mapping these algorithms to reconfigurable hardware solves this issue...
详细信息
Image processing algorithms applied on programmable embeddedsystems very often do not meet the given constraints in terms of real time capability. Mapping these algorithms to reconfigurable hardware solves this issue, but demands further specific knowledge in hardware development. The design process can be accelerated by code generation through high-level synthesis, which is very flexible. However, the quality of synthesis depends on many factors like the provided constraints, code description, and algorithmic complexity. Hence, optimizing these parameters may improve the generated results in terms of logic and memory utilization, as well as data throughput and synthesis duration. In this work, we aim to exploit domain-specific knowledge for a hybrid code description in order to benefit from rapid development through high-level synthesis in combination with throughput optimized generic hardware descriptions. By utilizing code generation techniques, the entire design flow gets accelerated. Our synthesis results show a similar resource utilization and achievable throughput to a pure HDL described hardware.
It is almost impossible to maintain the logical correctness of the architectural model of a complex system (e.g., a System of systems). Consequently, system engineers need a rigorous formal methodology to evaluate the...
详细信息
Real-Time systems not only require functional correctness, but also specific timing properties. Correct timing is especially challenging for hard real-Time systems such as in medicine, avionics, and space industries, ...
详细信息
The rapid growth area of ubiquitous applications and location-based services has made indoor localization an interesting topic for research. Some indoor localization solutions for smartphones exploit radio information...
详细信息
Amongst real-Time scheduling community, several methods aim at enhencing the performance of the control. Subtask scheduling is one of the embedded convenient methods that reduce the input-output latency in the control...
详细信息
Implementation times for moderate to large designs targeting FPGAs can be formidable. When FPGA compile times exceed that of a typical software compile time, virtual prototyping environments become increasingly attrac...
详细信息
ISBN:
(纸本)9781509030774
Implementation times for moderate to large designs targeting FPGAs can be formidable. When FPGA compile times exceed that of a typical software compile time, virtual prototyping environments become increasingly attractive. Virtual prototyping environments, however, are limited in their ability to capture and operate on live data, sometimes exhibit behavior mismatches between the modeled, and implemented domain, and are often constrained to sub-realtime performance. Rapid design assembly (RDA) is a technology that enables the fast creation of FPGA bitstreams, reducing compile times to that of software compile times. RDA is a rapid prototyping framework that targets real hardware, yet can compile an arbitrary modular design in seconds. RDA is not a network-on-chip, nor a slot-based partial reconfiguration flow, but instead a free-form modular assembly tool unlike anything presently available. This paper presents a framework for retaining all of the benefits of a virtual prototyping environment, yet adds the capability of deploying the prototype into a real-time hardware/software system. The RDA environment targets contemporary Xilinx 7-Series and UltraScale FPGA families.
Fast and efficient design space exploration is a critical requirement for designing computersystems, however, the growing complexity of hardware/software systems and significantly long run-times of detailed simulator...
详细信息
ISBN:
(纸本)9781509030774
Fast and efficient design space exploration is a critical requirement for designing computersystems, however, the growing complexity of hardware/software systems and significantly long run-times of detailed simulators often makes it challenging. Machine learning (ML) models have been proposed as popular alternatives that enable fast exploratory studies. The accuracy of any ML model depends heavily on the re-presentativeness of applications used for training the predictive models. While prior studies have used standard benchmarks or hand-tuned micro-benchmarks to train their predictive models, in this paper, we argue that it is often sub-optimal because of their limited coverage of the program state-space and their inability to be representative of the larger suite of real-world applications. In order to overcome challenges in creating representative training sets, we propose Genesys, an automatic workload generation methodology and framework, which builds upon key low-level application characteristics and enables systematic generation of applications covering a broad range of program behavior state-space without increasing the training time. We demonstrate that the automatically generated training sets improve upon the state-space coverage provided by applications from popular benchmarking suites like SPEC-CPU2006, MiBench, Media Bench, TPC-H by over llx and improve the accuracy of two machine learning based power and performance prediction systems by over 2.5x and 3.6x respectively.
暂无评论