Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network ***,with neural network architectures going deeper and wider,the limited memory capacity has become a ...
详细信息
Uniform memory multicore neural network accelerators(UNNAs)furnish huge computing power to emerging neural network ***,with neural network architectures going deeper and wider,the limited memory capacity has become a constraint to deploy models on UNNA *** how to efficiently manage memory space and how to reduce workload footprints are urgently *** this paper,we propose Tetris:a heuristic static memory management framework for UNNA *** reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness *** the memory management problem is converted to a sequence permutation *** uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory *** evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods,and achieves an average memory reduction ratio of 91.9%and 87.9%for a quad-core and a 16-core Cambricon-X platform,respectively.
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for sc...
详细信息
With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good f(mndation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.
In recent years, the advent of emerging computing applications, such as cloud computing, artificial intelligence, and the Internet of Things, has led to three common requirements in computersystem design: high utili...
详细信息
In recent years, the advent of emerging computing applications, such as cloud computing, artificial intelligence, and the Internet of Things, has led to three common requirements in computersystem design: high utilization, high throughput, and low latency. Herein, these are referred to as the requirements of 'high-throughput computing (HTC)'. We further propose a new indicator called 'sysentropy' for measuring the degree of chaos and uncertainty within a computersystem. We argue that unlike the designs of traditional computingsystems that pursue high performance and low power consumption, HTC should aim at achieving low sysentropy. However, from the perspective of computerarchitecture, HTC faces two major challenges that relate to (1) the full exploitation of the application's data parallelism and execution concurrency to achieve high throughput, and (2) the achievement of low latency, even in the cases at which severe contention occurs in data paths with high utilization. To overcome these two challenges, we introduce two techniques: on-chip data flow architecture and labeled von Neumann architecture. We build two prototypes that can achieve high throughput and low latency, thereby significantly reducing sysentropy.
Real-time transformation was important for the practical implementation of impedance flow *** major obstacle was the time-consuming step of translating raw data to cellular intrinsic electrical properties(e.g.,specifi...
详细信息
Real-time transformation was important for the practical implementation of impedance flow *** major obstacle was the time-consuming step of translating raw data to cellular intrinsic electrical properties(e.g.,specific membrane capacitance C_(sm) and cytoplasm conductivityσ_(cyto)).Although optimization strategies such as neural network-aided strategies were recently reported to provide an impressive boost to the translation process,simultaneously achieving high speed,accuracy,and generalization capability is still *** this end,we proposed a fast parallel physical fitting solver that could characterize single cells’C_(sm)andσ_(cyto)within 0.62 ms/cell without any data preacquisition or pretraining *** achieved the 27000-fold acceleration without loss of accuracy compared with the traditional *** on the solver,we implemented physics-informed real-time impedance flow cytometry(piRT-IFC),which was able to characterize up to 100,902 cells’C_(sm) andσ_(cyto)within 50 min in a real-time *** to the fully connected neural network(FCNN)predictor,the proposed real-time solver showed comparable processing speed but higher ***,we used a neutrophil degranulation cell model to represent tasks to test unfamiliar samples without data for *** being treated with cytochalasin B and N-Formyl-Met-Leu-Phe,HL-60 cells underwent dynamic degranulation processes,and we characterized cell’s C_(sm)andσ_(cyto)using *** to the results from our solver,accuracy loss was observed in the results predicted by the FCNN,revealing the advantages of high speed,accuracy,and generalizability of the proposed piRT-IFC.
In modern multi-core chip architecture, the DRAM system is shared by more and more cores and high bandwidth I/O devices. This trend would make the problem of request contention and un-fairness more serious. Previous r...
详细信息
Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly co...
详细信息
Efficient resource utilization requires that emerging datacenter interconnects support both high performance communication and efficient remote resource sharing. These goals require that the network be more tightly coupled with the CPU chips. Designing a new interconnection technology thus requires considering not only the interconnection itself, but also the design of the processors that will rely on it. In this paper, we study memory hierarchy implications for the design of high-speed datacenter interconnects particularly as they affect remote memory access -- and we use PCIe as the vehicle for our investigations. To that end, we build three complementary platforms: a PCIe-interconnected prototype server with which we measure and analyze current bottlenecks; a software simulator that lets us model microarchitectural and cache hierarchy changes; and an FPGA prototype system with a streamlined switchless customized protocol Thunder with which we study hardware optimizations outside the processor. We highlight several architectural modifications to better support remote memory access and communication, and quantify their impact and ]imitations.
Recently, selective encoding of scan slices is proposed to compress test data. This encoding technique, unlike many other compression techniques encoding all the bits, only encodes the target-symbol by specifying sing...
详细信息
In the field of high-performance computing, some application scenarios make extensive use of bit manipulation. RISC-V foundation issues B extension to reduce the number of instructions during the static compilation. B...
详细信息
A critical concern for post-silicon debug is the need to control the chip at clock cycle level. In a single clock chip, run-stop control can be implemented by gating the clock signal using a stop signal. However, data...
详细信息
Currently, large-scale vision and language models has significantly improved the performances of cross-modal retrieval tasks. However, large-scale models require a substantial amount of computing resources, so the exe...
详细信息
暂无评论