DSP chips are gaining importance in ultrasound applications as the need for portability and low power grows. One of the more computationally demanding applications for ultrasound involves estimating blood flow charact...
详细信息
Many modern computer architectures feature fused multiply-add (FMA) instructions, which offer potentially faster performance for numerical applications. For DSP transforms, compilers can only generate FMA code to a ve...
详细信息
Many modern computer architectures feature fused multiply-add (FMA) instructions, which offer potentially faster performance for numerical applications. For DSP transforms, compilers can only generate FMA code to a very limited extent because optimal use of FMAs requires modifying the chosen algorithm. In this paper we present a framework for automatically generating FMA code for every linear DSP transform, which we implemented as an extension to the SPIRAL code generation system. We show that for many transforms and transform sizes, our generated FMA code matches the best-known hand-derived FMA algorithms in terms of arithmetic cost. Further, we present actual runtime results that show the speed-up obtained by using FMA instructions.
Restoration of analytical chemistry data from degraded physical acquisitions is an important task for chemists to obtain accurate component analysis and sound interpretation. The high-dimensional nature of these signa...
详细信息
ISBN:
(纸本)9789082797060
Restoration of analytical chemistry data from degraded physical acquisitions is an important task for chemists to obtain accurate component analysis and sound interpretation. The high-dimensional nature of these signals and the large amount of data to be processed call for fast and efficient reconstruction methods. Existing works have primarily relied on optimization algorithms to solve a penalized formulation. Although very powerful, such methods can be computationally heavy, and hyperparameter tuning can be a tedious task for non-experts. Another family of approaches explored recently consists in adopting deep learning to perform the signal recovery task in a supervised fashion. Although fast, thanks to their formulations amenable to GPU implementations, these methods usually need large annotated databases and are not explainable. In this work, we propose to combine the best of both worlds, by proposing unfolded Majorization-Minimization (MM) algorithms with the aim to reach fast and accurate methods for sparse spectroscopy signal restoration. Two state-of-the-art iterative MM algorithms are unfolded onto deep network architectures. This allows both the deployment of GPU-friendly tools for accelerated implementation, as well as the introduction of a supervised learning strategy for tuning automatically the regularization parameter. The effectiveness of our approach is demonstrated on the restoration of a large dataset of realistic mass spectrometry data.
architectures for low-density parity-check (LDPC) decoders are discussed, with methods to reduce their complexity. Serial implementations similar to traditional microprocessor datapaths are compared against implementa...
详细信息
architectures for low-density parity-check (LDPC) decoders are discussed, with methods to reduce their complexity. Serial implementations similar to traditional microprocessor datapaths are compared against implementations with multiple processing elements that exploit the inherent parallelism in the decoding algorithm. Several classes of LDPC codes, such as those based on irregular random graphs and geometric properties of finite fields are evaluated in terms of their suitability for VLSI implementation and performance as measured by bit-error rate. Efficient realizations of low-density parity check decoders under area, power, and throughput constraints are of particular interest in the design of communications receivers.
The paper describes the FPGA technology together with its possibility to exploit spatial and temporal parallelism in order to implement hardware architectures for iterative algorithms. The development of hardware arch...
详细信息
ISBN:
(纸本)9781479904020;9781479904037
The paper describes the FPGA technology together with its possibility to exploit spatial and temporal parallelism in order to implement hardware architectures for iterative algorithms. The development of hardware architecture using FPGA technology represents a reliable solution in case of various applications where fast processing in case of iterative algorithms it's mandatory. Two applications are presented where the FPGA technology is used for processing. Thus, on one hand, automatic microarray grid alignment is performed using FPGA based hardware architecture, while on the other hand, an FPGA based LDPC decoder implementation is proposed in order to improve the decoder throughput compared to state of the art approaches.
This paper describes the implementation of the MPEG AVC CABAC entropy decoder using the RVC-CAL dataflow programming language. CABAC is the Context based Adaptive Binary Arithmetic Coding entropy decoder that is used ...
详细信息
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm...
详细信息
ISBN:
(纸本)9783642246494
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm that reduces the complexity to O(N-2). Three implementations that span a wide range of parallel hardware are developed. The first is based on shared-memory architecture, using the OpenMP programming model. The second implementation is based on message passing, targeting massively parallel machines including high performance clusters, and supercomputers. The third implementation is based on the data parallel programming model mapped on Graphics processing Units (GPUs). Key optimizations include loop reversal, communication pruning, load-balancing, and efficient thread to processors assignment. Experiments have been conducted on 8-core server, IBM BlueGene/L supercomputer 2-node boards with 128 processors, and GPU GTX470 GeForce Nvidia with 448 cores. Results indicate practical scalability on all platforms, with maximum speed up reaching 76x for the GTX470.
This work presents a new performance improvement technique for hardware implementations of non-recursive convolution based image processingalgorithms. It combines an advanced data flow technique (instruction reuse) p...
详细信息
ISBN:
(纸本)9781424427604
This work presents a new performance improvement technique for hardware implementations of non-recursive convolution based image processingalgorithms. It combines an advanced data flow technique (instruction reuse) proposed in modern microprocessor design with the value locality of image data to develop a method, window memoization, that increases the throughput with minimal cost in area and accuracy. We implement window memoization as a 2-wide superscalar pipeline such that it consumes significantly less area than conventional 2-wide superscalar pipelines. As a case study, we have applied window memoization to Kirsch edge detector. The average speedup factor was 1.76 with only 25% extra hardware.
Elliptic Curve Cryptography have known, in recent years, an increasing success in security applications thanks to their advantages such as a short keys size with a high security level. Their popularity has led to vari...
详细信息
ISBN:
(纸本)9781467385268
Elliptic Curve Cryptography have known, in recent years, an increasing success in security applications thanks to their advantages such as a short keys size with a high security level. Their popularity has led to various implementations in terms of algorithms, curves, coordinate systems, platforms, etc. The aim of this work is first to explore actual trends of ECC-based implementations in different platforms through a review of a number of works. Secondly, to identify and gather the criteria and corresponding metrics used to evaluate the performance of these implementations. We also propose a platform for design exploration and evaluation of ECC designs.
Sonar beamforming is an ideal application for reconfigurable computing due to its high available levels of parallelism, relatively low sample rates, and modest word sizes. In this paper we describe a family of beamfor...
详细信息
ISBN:
(纸本)078037147X
Sonar beamforming is an ideal application for reconfigurable computing due to its high available levels of parallelism, relatively low sample rates, and modest word sizes. In this paper we describe a family of beamforming algorithms and their implementation using configurable computing technology, These include algorithms for time-delay, frequency-domain, and matched field beamforming. Configurable computing architectures appropriate for each are described and the tradeoffs associated with the mapping of each to concrete platform is discussed.
暂无评论