The utilization of hardware-designed approximate computing in Convolutional Neural Networks (CNNs) offers notable advantages, including accelerated performance, enhanced power efficiency, and a compact design footprin...
详细信息
ISBN:
(数字)9798350309270
ISBN:
(纸本)9798350309287
The utilization of hardware-designed approximate computing in Convolutional Neural Networks (CNNs) offers notable advantages, including accelerated performance, enhanced power efficiency, and a compact design footprint. Systolic Array (SA) architectures, optimized for matrix multiplication and convolution operations, have been extensively studied in the context of stand-alone image processing applications. However, their potential for CNN workloads has not been thoroughly assessed. SAs consist of an array of Processing Elements (PEs) structured to perform product operations and accumulations. Incorporating inexact computing units in the SA introduces deviations from precise results, posing a challenge for sustaining hardware accelerator designs in CNN workloads. This paper presents a strategy for the optimal placement of both positive and negative error-distributed multipliers as PE elements to create an error-diluted SA structure. The proposed strategy to structure SA is evaluated for prewitt filter and three other filters extracted from first layer of *** paper introduces an optimization framework for selecting the most suitable PEs from a pool of positive and negative error-distributed multipliers, aiming to achieve a balance between hardware efficiency and image quality metrics. Furthermore, the framework and hardware design files are made available for further usage to the designers’ and researchers community.
Tracing garbage collectors are widely deployed in modern programming languages. But tracing an arbitrary heap shape incurs poor locality and may hinder scalability. In this paper, we explore an avenue for mitigating t...
详细信息
ISBN:
(纸本)9798400716447
Tracing garbage collectors are widely deployed in modern programming languages. But tracing an arbitrary heap shape incurs poor locality and may hinder scalability. In this paper, we explore an avenue for mitigating these inefficiencies at the expense of conservative, less accurate identification of live objects. We do this by proposing and studying an alternative to the Mark-Sweep tracing algorithm, called Linear-Mark. It turns out that although Linear-Mark improves locality and scalability, the accuracy of Mark-Sweep outweighs the achieved enhancements. We present the Linear-Mark garbage-collecting algorithm and provide an evaluation that highlights the trade-offs between the Linear-Mark and the Mark-Sweep approaches. Our hope is that this research will inspire further algorithmic improvements, ultimately leading to better garbage collection algorithms.
Summary form only given. The vast majority of computer architects believe the future of the microprocessor is hundreds to thousands of processors ("cores") on a chip. Given such widespread agreement, it'...
Summary form only given. The vast majority of computer architects believe the future of the microprocessor is hundreds to thousands of processors ("cores") on a chip. Given such widespread agreement, it's surprising how much research remains to be done in algorithms, computer architecture, networks, operating systems, file systems, compilers, programming languages, applications, and so on to realize this vision. Fortunately, Moore's law has not only enabled dense multi-core chips, it has also enabled extremely dense FPGAs. Today, one to two dozen soft cores can be programmed into a single FPGA. With multiple FPGAs on a board and multiple boards in a system, 1000-processor designs can be economically and rapidly explored. To make this happen, however, requires a significant amount of infrastructure in hardware, software, and what we call "gateware", the register-transfer level models that fill the FGPAs. By using the Berkeley Emulation Engine boards that were created for other purposes, the hardware is already done. A group of architects plan to design the gateware, create this infrastructure, and share the results in an open-source fashion so that every institution could have their own. Such a system would not just invigorate multiprocessors research in the architecture community. Since processors cores can run at 100 to 200 MHz, a large scale multiprocessor would be fast enough to run operating systems and large programs at speeds sufficient to support software research. Moreover, there is a new generation of FPGAs every 18 months with capacity for twice as many cores and run them faster, so future multiboard FPGA systems are even more attractive. Hence, we believe such a system would accelerate research across all the fields that touch multiple processors. Thus the acronynm RAMP, for Research Accelerator for Multiple Processors. RAMP has the potential to transform the parallel computing community in computer science from a simulation-driven to a prototype-driven d
Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consum...
详细信息
Virtual machine (VM) migration is a widely used technique in cloud computing systems to increase reliability. There are also many other reasons that a VM is migrated during its lifetime, such as reducing energy consumption, improving performance, maintenance, etc. During a live VM migration, the underlying VM continues being up until all or part of its data has been transmitted from source to destination. The remaining data are transmitted in an off-line manner by suspending the corresponding VM. The longer the off-line transmission time, the worse the performance of the respective VM. The above is because during the off-line data transmission, the VM service is down. Because a running VM's memory is subject to changes, already transmitted data pages may get dirtied and thus needing re-transmission. The decision of when suspending the VM is not a trivial task at all. The above is justified by the fact that when suspending the VM early we may result in transmitting off-line a significant amount of data degrading thus the VM's performance. On the other hand, a long waiting time to suspend the VM may result in re-transmitting a huge amount of dirty data, leading in that way to waste of resources. In this paper, we tackle the joint problem of minimizing both the total VM migration time (reflecting the resources spent during a migration) and the VM downtime (reflecting the performance degradation). The aforementioned objective functions are weighted according to the needs of the underlying cloud provider/user. To tackle the problem, we propose an online deterministic algorithm resulting in an strong competitive ratio, as well as a randomized online algorithm achieving significantly better results against the deterministic algorithm.
The desire to build a computer that operates in the same manner as our brains is as old as the computer itself. Although computer engineering has made great strides in hardware performance as a result of Dennard scali...
详细信息
ISBN:
(纸本)9781450323055
The desire to build a computer that operates in the same manner as our brains is as old as the computer itself. Although computer engineering has made great strides in hardware performance as a result of Dennard scaling, and even great advances in 'brain like' computation, the field still struggles to move beyond sequential, analytical computing architectures. Neuromorphic systems are being developed to transcend the barriers imposed by silicon power consumption, develop new algorithms that help machines achieve cognitive behaviors, and both exploit and enable further research in neuroscience. In this talk I will discuss a system im-plementing spiking neural networks. These systems hold the promise of an architecture that is event based, broad and shallow, and thus more power efficient than conventional computing solu-tions. This new approach to computation based on modeling the brain and its simple but highly connected units presents a host of new challenges. Hardware faces tradeoffs such as density or lower power at the cost of high interconnection overhead. Consequently, software systems must face choices about new language design. Highly distributed hardware systems require complex place and route algorithms to distribute the execution of the neural network across a large number of highly interconnected processing units. Finally, the overall design, simulation and testing process has to be entirely reimagined. We discuss these issues in the context of the Zeroth processor and how this approach compares to other neuromorphic systems that are becoming available.
Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto G...
详细信息
ISBN:
(数字)9781728173832
ISBN:
(纸本)9781728173849
Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations.
As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, ve...
详细信息
ISBN:
(数字)9781728160955
ISBN:
(纸本)9781728196497
As the scale of high-performance computing (HPC) systems continues to grow, increasing levels of parallelism must be implored to achieve optimal performance. Recently, the processors support wide vector extensions, vectorization becomes much more important to exploit the potential peak performance of target architecture. Novel processor architectures, such as the Armv8-A architecture, introduce Scalable Vector Extension (SVE) - an optional separate architectural extension with a new set of A64 instruction encodings, which enables even greater parallelisms. In this paper, we analyze the usage and performance of the SVE instructions in Arm SVE vector Instruction Set Architecture (ISA); and utilize those instructions to improve the memcpy and various local reduction operations. Furthermore, we propose new strategies to improve the performance of MPI operations including datatype packing/unpacking and MPI reduction. With these optimizations, we not only provide a higher-parallelism for a single node, but also achieve a more efficient communication scheme of message exchanging. The resulting efforts have been implemented in the context of OPEN MPI, providing efficient and scalable capabilities of SVE usage and extending the possible implementations of SVE to a more extensive range of programming and execution paradigms. The evaluation of the resulting software stack under different scenarios with both simulator and Fujitsu's A64FX processor demonstrates that the solution is at the same time generic and efficient.
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities....
详细信息
ISBN:
(纸本)9781450371896
The success of Deep Learning (DL) algorithms in computer vision tasks have created an on-going demand of dedicated hardware architectures that could keep up with the their required computation and memory complexities. This task is particularly challenging when embedded smart camera platforms have constrained resources such as power consumption, Processing Element (PE) and communication. This article describes a heterogeneous system embedding an FPGA and a GPU for executing CNN inference for computer vision applications. The built system addresses some challenges of embedded CNN such as task and data partitioning, and workload balancing. The selected heterogeneous platform embeds an Nvidia® Jetson TX2 for the CPU-GPU side and an Intel Altera® Cyclone10GX for the FPGA side interconnected by PCIe Gen2 with a MIPI-CSI camera for prototyping. This test environment will be used as a support for future work on a methodology for optimized model partitioning.
With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management prob...
详细信息
With the increasing number of scientific applications manipulating huge amounts of data, effective data management is an increasingly important problem. Unfortunately, so far the solutions to this data management problem either require deep understanding of specific storage architectures and file layouts (as in high-performance file systems) or produce unsatisfactory I/O performance in exchange for ease-of-use and portability (as in relational DBMSs). In this paper we present a new environment which is built around an active meta-data management system (MDMS). The key components of our three-tiered architecture are user application, the MDMS, and a hierarchical storage system (HSS). Our environment overcomes the performance problems of pure database-oriented solutions, while maintaining their advantages in terms of ease-of-use and portability. The high levels of performance are achieved by the MDMS, with the aid of user-specified directives. Our environment supports a simple, easy-to-use yet powerful user interface, leaving the task of choosing appropriate I/O techniques to the MDMS. We discuss the importance of an active MDMS and show how the three components, namely application, the MDMS, and the HSS, fit together. We also report performance numbers from our initial implementation and illustrate that significant improvements are made possible without undue programming effort.
Molecular Dynamics (MD) is a computational technique with applicability in fields as diverse as material science, biomolecules and chemical physics. Assisted Model Building with Energy Refinement (AMBER) is an MD pack...
详细信息
ISBN:
(纸本)9781510801011
Molecular Dynamics (MD) is a computational technique with applicability in fields as diverse as material science, biomolecules and chemical physics. Assisted Model Building with Energy Refinement (AMBER) is an MD package and it uses Message Passing Interface (MPI) to scale in multi-core and cluster environments. In our earlier work [1], we modified one of AMBER's algorithms called Generalized Born (GB) algorithm to run optimally on the Xeon Phi co-processor. This improved performance by 277% on the co-processor. The same changes improved performance on the host server by 80%. In this paper, we extend our earlier work and implement a symmetric solution using both the host server and the co-processor. Since the calculations in GB algorithm involve interactions between all possible atom combinations, it has been very difficult to scale GB algorithm in distributed memory. We evaluate various alternate techniques using combination of MPI and Open Multi-Processing (OpenMP) to get a scalable solution that utilizes the computing power of both the host server as well as the co-processor.
暂无评论