Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data inp...
详细信息
ISBN:
(纸本)9781450326711
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
Geometric algebra (GA) is a powerful and versatile mathematical tool which helps to intuitively express and manipulate complex geometric relationships. It has recently been used in engineering problems such computer g...
详细信息
ISBN:
(纸本)9781450333153
Geometric algebra (GA) is a powerful and versatile mathematical tool which helps to intuitively express and manipulate complex geometric relationships. It has recently been used in engineering problems such computer graphics, machine vision, robotics, among others. The problem with GA in its numeric version is that it requires many arithmetic operations, and the length of the input vectors is unknown until runtime in a generic architecture operating over homogeneous elements. Few works in hardware architectures for GA were developed to improve the performance in GA applications. In this work, a hardware architecture of a unit for GA operations (geometric product) for FPGA is presented. The main contribution of this work is the use of parallel memory arrays with access conflict avoidance for dealing with the issue of unknown length of input/output vectors, the intention is to reduce memory wasted when storing the input and output vectors. In this first stage of the project, we have implemented only a single access function (fixed-length) in the memory array in order to test the core of geometric product. In future works we will implement a full set of access functions with different lengths and shapes. In this work, only the simulations are presented; in the future, we will also present the experimental results
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely inuenced the recently appeared ...
详细信息
ISBN:
(纸本)9781450326711
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely inuenced the recently appeared OpenMP 4.0 specication. Zynq All-programmable SoC combines the features of a SMP and a FPGA and benets DLP, ILP and TLP parallelisms in order to eciently exploit the new tech-nology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous ex-ecution support, presenting a successful combination of the OmpSs programming model and the Zynq All-programmable SoC platforms.
In this paper, we describe the challenges that Place and Route tools face to implement the user designs on modern FPGAs while meeting the timing and power constraints.
ISBN:
(纸本)9781450325929
In this paper, we describe the challenges that Place and Route tools face to implement the user designs on modern FPGAs while meeting the timing and power constraints.
Packing is a critical step in the CAD flow for cluster-based FPGA architectures, and has a significant impact on the quality of the final placement and routing results. One basic quality metric is routability. Traditi...
详细信息
ISBN:
(纸本)9781450326711
Packing is a critical step in the CAD flow for cluster-based FPGA architectures, and has a significant impact on the quality of the final placement and routing results. One basic quality metric is routability. Traditionally, minimizing cut (the number of external signals) has been used as the main criterion in packing for routability optimization. This paper shows that minimizing cut is a sub-optimal criterion, and argues to use the Rent characteristic as the new criterion for FPGA packing. We further propose using a recursive bipartitioning-based k-way partitioner to optimize the Rent characteristic during packing. We developed a new packer, PPack2, based on this approach. Compared to T-VPack, PPack2 achieves 35.4%, 35.6%, and 11.2% reduction in wire length, minimal channel width, and critical path delay, respectively. These improvements show that PPack2 outperforms all previous leading packing tools (including iRAC, HDPack, and the original PPack) by a wide margin.
When are FPGAs more energy ecient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be ex-ploited to minimize energy. Using a wire-dominated...
详细信息
ISBN:
(纸本)9781450326711
When are FPGAs more energy ecient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be ex-ploited to minimize energy. Using a wire-dominated en-ergy model to estimate the absolute energy required for pro-grammable computations, we determine when spatially or-ganized programmable computations (FPGAs) require less energy than temporally organized programmable computa-tions (processors). The point of crossover will depend on the metal layers available, the locality, the SIMD wordwidth regularity, and the compactness of the instructions. When the Rent Exponent, p, is less than 0.7, the spatial design is always more energy ecient. When p = 0:8, the technology oers 8-metal layers for routing, and data can be organized into 16b words and processed in tight loops of no more than 128 instructions, the temporal design uses less energy when the number of LUTs is greater than 64K. We further show that heterogeneous multicontext architectures can use even less energy than the p = 0:8, 16b word temporal case.
As the amount of memory in database systems grows, en-tire database tables, or even databases, are able to t in the system's memory, making in-memory database operations more prevalent. This shift from disk-based ...
详细信息
ISBN:
(纸本)9781450326711
As the amount of memory in database systems grows, en-tire database tables, or even databases, are able to t in the system's memory, making in-memory database operations more prevalent. This shift from disk-based to in-memory database systems has contributed to a move from row-wise to columnar data storage. Furthermore, common database workloads have grown beyond online transaction process-ing (OLTP) to include online analytical processing and data mining. These workloads analyze huge datasets that are of-ten irregular and not indexed, making traditional database operations like joins much more expensive. In this paper we explore using dedicated hardware to ac-celerate in-memory database operations. We present hard-ware to accelerate the selection process of compacting a sin-gle column into a linear column of selected data, joining two sorted columns via merging, and sorting a column. Finally, we put these primitives together to accelerate an entire join operation. We implement a prototype of this system using FPGAs and show substantial improvements in both absolute throughput and utilization of memory bandwidth. Using the prototype as a guide, we explore how the hardware resources required by our design change with the desired throughput.
Polynomial evaluation is important across a wide range of application domains, so signicant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves th...
详细信息
ISBN:
(纸本)9781450326711
Polynomial evaluation is important across a wide range of application domains, so signicant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial com-putation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an effcient polynomial evaluation algorithm, which reforms the evaluation process to include an increased num-ber of squaring steps. By using a squarer design that is more effcient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials.
Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microproce...
详细信息
ISBN:
(纸本)9781450326711
Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.
暂无评论