Recently a number of heuristic based system-level synthesis algorithms have been proposed. Though these algorithms quickly generate good solutions, how close these solutions are to optimal is a question that is diffic...
详细信息
ISBN:
(纸本)9781581132441
Recently a number of heuristic based system-level synthesis algorithms have been proposed. Though these algorithms quickly generate good solutions, how close these solutions are to optimal is a question that is difficult to answer. While current exact techniques produce optimal results, they fail to produce them in reasonable time. This paper presents a synthesis algorithm that produces solutions of guaranteed quality (optimal in most cases or within a known bound) with practical synthesis times (few seconds to minutes). It takes a unified look (the lack of which is one of the main sources of sub-optimality in the heuristic techniques) at different aspects of system synthesis such as pipelining, selection, allocation, scheduling and FPGA reconfiguration. Our technique can handle both time constrained as well as resource constrained synthesis problems. We present results of our algorithm implemented as part of the Match project at Northwestern University.
The challenging aspect of building neuromorphic circuits in mature CMOS technology to match brain-like architectures is two-fold: scalability and connectivity. Scalability means that the circuits have to be expandable...
详细信息
ISBN:
(纸本)9781450318679
The challenging aspect of building neuromorphic circuits in mature CMOS technology to match brain-like architectures is two-fold: scalability and connectivity. Scalability means that the circuits have to be expandable to match biological brains in terms of synaptic and neuronal densities. The challenge here is to implement 10~6 neurons and 10~(10) synapses with an average fanout of 10~4, in a square cm of CMOS. Connectivity means that the circuit has to offer the capability to have both short and long range (by physical distance) connections between neurons. A large part of this challenge is how to implement a connectivity of 10~4 synapses per neuron. Unfortunately, even the exponential transistor density growth being experienced today is not sufficient to realize such massive connectivity and synaptic densities in a traditional CMOS process. Recent approaches to address these challenges have been to integrate CMOS with nanotechnology in order to achieve the required synaptic densities. These solutions use crossbar architectures predominantly but the connectivity challenge still remains a daunting task for such solutions. To meet these challenges, a novel synaptic time-multiplexing (STM) concept was developed along with a neural fabric design. This combination has the advantage of offering greater flexibility and long range connectivity. It also provides a method to overcome the limitations of conventional CMOS technology to match the synaptic density and connectivity requirements found in mammalian brains while maintaining nonlinear synapses and learning. In order to program neuromorphic hardware for any desired brain architecture, the topology would first have to be converted into a connectivity matrix or a graph representation. This matrix along with the statistics on the number of neurons and synapses is provided as input to a neuromorphic compiler. The neuromorphic compiler compiles the neural network structure description into: 1) an assignment of the network'
Deep Neural Networks (DNNs) have achieved tremendous success in the past few years. However, their training and inference demand exceptional computational and memory resources. Quantization has been shown as an effect...
详细信息
ISBN:
(数字)9798350350579
ISBN:
(纸本)9798350350586
Deep Neural Networks (DNNs) have achieved tremendous success in the past few years. However, their training and inference demand exceptional computational and memory resources. Quantization has been shown as an effective approach to mitigate the cost, with the mainstream data types reduced from FP32 to FP16/BF16 and recently FP4 in the latest NVIDIA B100 GPUs. With increasingly aggressive quantization, however, the conventional floating-point formats suffer from limited precision in representing numbers around zero. Recently, NVIDIA demonstrated the potential of using a Logarithmic Number System (LNS) for the next generation of tensor cores. While LNS mitigates the hurdles in representing small numbers, in this work we observed a mismatch between LNS and the emerging Large Language Models (LLM), where LLM exhibits significant outliers when directly adopting the LNS format. In this paper, we present a data-format/architecture codesign to bright this gap. On the format side, we propose a dynamic LNS format to flexibly represent outliers at a higher precision, by exploiting asymmetry in the LNS representation and identifying outliers through a per-block basis. On the architecture side, for demonstration, we realize the dynamic LNS format in a systolic array, which can handle the irregularity of the outliers at runtime. We implement our approach on an Alveo U280 FPGA as a prototype. Experimental results show that our design can effectively handle the outliers and resolve the mismatch between LNS and LLM, contributing to an accuracy improvement of 15.4% and 16% over the floating-point and the original LNS baselines, with up to 15.3% over the state-of-the-art quantization methods using four LLM models. Our observation and design lay a solid foundation for the large-scale adoption of the LNS format in the next-generation deep learning hardware.
As the scaling of memory density slows physically, a promising solution is to scale memory logically by enhancing the CPU's memory controller to encode and store data more densely in memory. This is known as hardw...
详细信息
ISBN:
(数字)9798350350579
ISBN:
(纸本)9798350350586
As the scaling of memory density slows physically, a promising solution is to scale memory logically by enhancing the CPU's memory controller to encode and store data more densely in memory. This is known as hardware memory compression. Hardware memory compression decouples OS-managed physical memory from actual memory (i.e., DRAM); the memory controller spends a dynamically varying amount of DRAM on each physical page, depending on the compressibility of the page's content. The newly-decoupled actual memory effectively forms a new layer of memory beyond the traditional layers of virtual, pseudo-physical, and physical memory. We note unlike these traditional memory layers, each with its own specialized allocation interface (e.g., malloc/mmap for virtual memory, page tables+MMU for physical memory), this new layer of memory introduced by hardware memory compression still awaits its own unique memory allocation interface; its absence makes the allocation of actual memory imprecise and, sometimes, even impossible. Imprecisely allocating less actual memory, and/or unable to allocate more, can harm performance. Even imprecisely allocating more actual memory to some jobs can be harmful as it can result in allocating less actual memory to other jobs in highly-occupied memory systems, where compression is useful. To restore precise memory allocation, we design a new memory allocation specialized for this new layer of memory and, subsequently, architect a new MMU-like component in the memory controller and tackle the corresponding design challenges. We create a full-system FPGA prototype of a hardware-compressed memory system with precise memory allocation. Our evaluations using the prototype show that jobs perform stably under colocation. The performance variation is only 1%-2%; in comparison, it is 19%-89% under the prior art.
Congestion Control (CC) plays a vital role in deploying lossless datacenter networks based on Remote Direct Memory Access (RDMA). A high-performance CC scheme should provide low-latency and precise feedback to congest...
详细信息
ISBN:
(数字)9798350350128
ISBN:
(纸本)9798350350135
Congestion Control (CC) plays a vital role in deploying lossless datacenter networks based on Remote Direct Memory Access (RDMA). A high-performance CC scheme should provide low-latency and precise feedback to congestion events. However, no existing CC schemes achieved both features simultaneously. In this paper, we propose LHCC, a Low-latency and Hi-precision Congestion Control scheme for RDMA datacenter networks. LHCC uses out-band signaling to notify the network status and hence a packet sender can detect congestion events within an RTT. In addition, LHCC adjusts packet sending rate by taking into consideration all queues along the entire path that a packet has gone through. Accordingly, it provides a more precise CC compared with existing schemes especially when there are multiple bottlenecks in the networks. We build the LHCC prototype on a real testbed carrying NVIDIA Bluefield-3 NICs and AGM39D FPGAs. Both testbed experiments and extensive simulations show that LHCC can reduce the Flow Completion Time (FCT) slow down and reduce the buffer usage (i.e., reduce the queue lengths) by up to 62.5% and 58%, respectively, compared with the state-of-the-art high-precision CC scheme, HPCC.
暂无评论