In this paper, we propose a new approach for the predictability and optimality of the inter-core communication and execution of tasks allocated on different cores of multicore architectures. Our approach is based on t...
详细信息
In this paper, we propose a new approach for the predictability and optimality of the inter-core communication and execution of tasks allocated on different cores of multicore architectures. Our approach is based on the execution of synchronous programs written in the ForeC programming language on deterministic architectures called PREcision Timed. The originality of the work resides in the time-triggered model of computation and communication that allows for a very precise control over the thread execution. Synchronization is done via configurable Time Division Multiple Access (TDMA) arbitrations where the optimal size and offset of the time slots are computed to reduce the inter-core synchronization costs. We implemented a robotic application and simulated it using MORSE, a robotic simulation environment. Results show that the model we propose guarantees time-predictable inter-core communication, the absence of concurrent accesses (without relying on hardware mechanisms), and allows for optimized execution throughput.
The ACM/IEEE international Symposium on computer Architecture (ISCA) conference is one of the premier forums for presenting, debating and advanc- ing new ideas and experimental results in computer architecture. Accord...
详细信息
In this era, the requirement of high-performance computing at low power cost can be met by the parallel execution of an application on a large number of programmable cores. Emerging many-core architectures provide den...
详细信息
ISBN:
(纸本)9781538655641
In this era, the requirement of high-performance computing at low power cost can be met by the parallel execution of an application on a large number of programmable cores. Emerging many-core architectures provide dense interconnection fabrics leading to new communication requirements. In particular, the effective exploitation of synchronous and asynchronous channels for fast communication from/to internal cores and external devices is a key issue for these architectures. In this paper, we propose a methodology for clustering sequential commands used for configuring the parallel execution of tasks on a globally asynchronous locally synchronous multi-chip many-core neuromorphic platform. With the purpose of reducing communication costs and maximise the exploitation of the available communication bandwidth, we adapted the Multiple Sequence Alignment (MSA) algorithm for clustering the unicast streams of packets used for the configuration of each core so as to generate a coherent multicast stream that configures all cores at once. In preliminary experiments, we demonstrate how the proposed method can lead up to a 97% reduction in packet transmission thus positively affecting the overall communication cost.
In this paper, we propose an approach for designing application-specific heterogeneous systems based on performance models through combining accelerator and processor core models. An application-specific program is pr...
详细信息
In this paper, we propose an approach for designing application-specific heterogeneous systems based on performance models through combining accelerator and processor core models. An application-specific program is profiled by the dynamic execution trace and is used to construct a data flow model of the accelerator. modeling of the processor is partitioned into an instruction set architecture (ISA) execution and a micro-architecture specific timing model. These models are implemented on FPGAs to take advantage of their parallelism and speed up the simulation when architecture complexity increases. This approach aims to ease the design of multi-core multi-accelerator architecture, consequently contributes to explore the design space by automating the design steps. A case study is conducted to confirm that presented design flow can model the accelerator starting from an algorithm, validate its integration in a simulation framework, allowing precise performance to be estimated. We also assess the performance of our RISC-V single-core and RISC-V-based heterogeneous architecture models.
The proceedings contain 33 papers. The topics discussed include: a performance evaluation of multi-FPGA architectures for computations of information transfer;massively parallel computation of linear recurrence equati...
ISBN:
(纸本)9781450364942
The proceedings contain 33 papers. The topics discussed include: a performance evaluation of multi-FPGA architectures for computations of information transfer;massively parallel computation of linear recurrence equations with graphics processing units;a first-order approximation of microarchitecture energy-efficiency;delays and states in dataflow models of computation;communication-aware scheduling algorithms for synchronous dataflow graphs on multicore systems;towards power management verification of time-triggered systems using virtual platforms;architectural considerations for FPGA acceleration of machine learning applications in MapReduce;and fast parallel simulation of a manycore architecture with a flit-level on-chip network model.
A large number of different applications are associated with different types of embeddedsystems where sensors play the key role in the creating of a particular view of an environment a system is being operated in. Em...
详细信息
ISBN:
(纸本)9781728140704
A large number of different applications are associated with different types of embeddedsystems where sensors play the key role in the creating of a particular view of an environment a system is being operated in. embeddedsystems are often characterized as the soft- or hard-real-time systems, with high requirements for safety, thus imposing strict requirements for the timing behavior and accuracy of sensors in order to ensure determinism and dependability of a system. At early stage of a system design, analysis or optimization, the satisfaction of the requirements can be checked with models of sensors. The authors aim to investigate the timing performance and accuracy achieved during simulation of the same sensor model implemented in two different ways: as a software artifact and as a field programmable gate arrays (FPGA) solution. This article constitutes a part of the research activities defined in [1].
Deep learning is rapidly becoming a strong boost to the already pervasive field of computer vision. State-of-the-art Convolutional Neural Networks reach accuracies comparable to human senses. However, the high computa...
详细信息
ISBN:
(纸本)9781538695623
Deep learning is rapidly becoming a strong boost to the already pervasive field of computer vision. State-of-the-art Convolutional Neural Networks reach accuracies comparable to human senses. However, the high computational load and low energy efficiency make their implementation on modern embeddedsystems hard. In this paper, several strategies for designing fast convolutional engines suitable to hardware accelerate Convolutional Neural Networks are evaluated. When implemented within a complete embedded system based on a Zynq Ultrascale+ SoC device, two of the proposed architectures achieve a peak performance of 131.6 GMAC/s at 234MHz running frequency, by occupying at most similar to 13% of the DSP slices available on chip. All the proposed engines overcome state-of-the-art competitors, exhibiting a performance/DSP utilization ratio up to 29.6 times higher.
In this paper, we evaluate a partitioning and placement technique for mapping concurrent applications over a globally asynchronous locally synchronous (GALS) multi-core architecture designed for simulating a spiking n...
详细信息
ISBN:
(纸本)9781538655641
In this paper, we evaluate a partitioning and placement technique for mapping concurrent applications over a globally asynchronous locally synchronous (GALS) multi-core architecture designed for simulating a spiking neural network (SNN) in real-time. We designed a task placement pipeline capable of analysing the network of neurons and producing a placement configuration that enables a reduction of communication between computational nodes. The neuron-to-core mapping problem has been formalised as a two phases problem: Partitioning and Placement. The Partitioning phase aims at grouping together the most connected network components, maximising the amount of self-connections within each identified group. For this purpose we used a multilevel k-way graph partitioning strategy capable of generating network-partitions. The Placement phase aims at placing groups of neurons over the chip mesh minimising the communication between computational nodes. For implementing this step, we designed and evaluate the performances of three placement variants. In the results, we point out the importance of using a partitioning algorithm for the SNN graph. We were able to achieve an increase in self-connections of 19% and an improvement of the final overall post-placement synaptic elongation of 29% using the simulated annealing placement technique, compared to 22% obtained without partitioning.
In this paper we investigate the relation between energy efficiency model and workload type executed in modern embeddedarchitectures. From the energy efficiency model obtained in our previous work we select a few con...
详细信息
ISBN:
(纸本)9781538649756
In this paper we investigate the relation between energy efficiency model and workload type executed in modern embeddedarchitectures. From the energy efficiency model obtained in our previous work we select a few configuration points to verify that the prediction in terms of relative energy efficiency is maintained through different workload scenarios. A configuration point is defined as a set of platform tunable metrics, such as DVFS point, DPM level and utilization rate. As workloads, we use a combination of synthetic generators and real world applications from the embedded domain. In our experiments we use two different architectures for testing the model generality, which provide examples of real systems. First we have a comparison of the efficiency obtained by the two architecturally different chips (ARM and INTEL) in different configuration points and different workload scenarios. Second we try to explain the different results through the thermal management done by the two different chips. At the end we show that only in the case of workloads highly composed by integer instructions the results from the two architectures converge and show the need for a specific model trained with integer operations.
暂无评论