Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal processing (DSP) slices. We propose modifyi...
详细信息
ISBN:
(纸本)9781665483322
Block RAMs (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using Logic Blocks (LBs) and Digital Signal processing (DSP) slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-In-Memory Blocks for FPGAs) RAMs. These RAMs provide highly-parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual port nature of FPGA BRAMs and contain multiple programmable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute in any precision, which is extremely important for evolving applications like Deep Learning. Adding CoMeFa RAMs to FPGAs significantly increases their compute density. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying SRAM technology like simultaneously activating multiple rows on the same port, and are practical to implement. CoMeFa RAMs are versatile blocks that find applications in numerous diverse parallelapplications like Deep Learning, signal processing, databases, etc. By augmenting an Intel Arria-10-like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55x (1.85x), across several representative benchmarks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of modern compute-intensive workloads.
Augmented Reality (AR) applications are becoming more and more popular, and smart devices are the most common platform for running AR applications, such as online games, travel guides, and personal assistants. However...
详细信息
ISBN:
(纸本)9781538679104
Augmented Reality (AR) applications are becoming more and more popular, and smart devices are the most common platform for running AR applications, such as online games, travel guides, and personal assistants. However, these types of AR applications are usually interactive applications that require fast response time and an extremely high power consumption. AR applications need to be supported by highly adaptable multi core processors equipped smart devices where the optimal low power control technique, should be used. In this paper, the power consumption model of AR application workloads are mathematically modeled, taking into account the dynamic voltage and frequency scaling (DVFS) of the multi-core central processing unit (CPU) and the parallel execution of multi-core CPUs. Based on the proposed model, the optimal core operation frequency and minimized power consumption are derived. Experimental results show that the proposed scheme satisfies the interaction time limit with the lowest energy consumption.
Computers of a non-dedicated cluster are often idle (users attend meetings, have lunch or coffee breaks) or lightly loaded (users carry out simple computations). These underutilized computers can be employed to execut...
详细信息
The time-domain simulation of coplanar waveguide (CPW) elements for picosecond pulse applications is described. CPW discontinuities were simulated using the 3D transmission-line-matrix (TLM) method. We present a cost-...
详细信息
The time-domain simulation of coplanar waveguide (CPW) elements for picosecond pulse applications is described. CPW discontinuities were simulated using the 3D transmission-line-matrix (TLM) method. We present a cost-effective approach for the time-domain simulation of coplanar circuit structures utilizing distributed computing within a parallel software environment. The use of matched layers and skin effect models is discussed. The application of TLM method and distributed computing to the efficient analysis of coplanar structures is presented.
Polygon overlay is one of the complex operations in computational geometry. It is applied in many fields such as Geographic Information Systems (GIS), computer graphics and VLSI CAD. Sequential algorithms for this pro...
详细信息
One approach for building the next generation of parallel computers is based on large aggregates of multiprocessor chips with support for hardware multithreading. An initial design for IBM's Blue Gene/C project ex...
Cross-subject electroencephalogram (EEG) drowsiness recognition is currently one of the most efficient methods. However, the traditional cross-subject approaches overlook the correlation between channel sub-features a...
详细信息
In this paper we report new results concerning developing parallel multiprocessor scheduling algorithms working in cellular automata (CAs) - based scheduler. We consider the simplest case when a multiprocessor system ...
详细信息
Many performance problems observed in high end systems are actually caused by the runtime system and not the application code. Detecting these cases will require parallel performance tools to incorporate information a...
详细信息
New kinds of applications with lots of threads or irregular conununication patterns which rely a lot on point-topoint MPI communications have emerged. It stresses the MPI library with potentially a lot of simultaneous...
详细信息
ISBN:
(纸本)9781728109121
New kinds of applications with lots of threads or irregular conununication patterns which rely a lot on point-topoint MPI communications have emerged. It stresses the MPI library with potentially a lot of simultaneous MPI requests for sending and receiving at the same lime. To deal with large numbers of simultaneous requests, the bottleneck lies in two main mechanisms: the tag-matching (the algorithm that matches an incoming packet with a posted receive request), and the progression engine. In this paper, we propose algorithms and implementations that overconw, these issues so as to scale up to thousands of requests if needed. In particular our algorithms are able to perform constant-time tag-matching even with any-source and any-tag support. We have implemented these mechanisms in our New-Madeleine communication library. Through micro-benchmarks and computation kernel benchmarks, we demonstrate that our MPI library exhibits better performance than state-of-the-art MPI implementations in cases with many simultaneous requests.
暂无评论