JPEG2000 provides excellent compression performance and fine granularity scalability but at the cost of high computational complexity. We propose two speed-up techniques and use the TI DSP optimization tools to accele...
详细信息
ISBN:
(纸本)9781424414369
JPEG2000 provides excellent compression performance and fine granularity scalability but at the cost of high computational complexity. We propose two speed-up techniques and use the TI DSP optimization tools to accelerate the Tier1 module. We eliminate the unnecessary checking cycles by recording the NBC (Need-to-Be-Coded) samples on a list. Furthermore, the sample index is reordered to facilitate fast execution. In the DSP implementation of the proposed methods, we use code acceleration techniques, cache memory allocation, and TI DSP compiler-level optimization tools. Even when the original program is compiled with the same DSP optimization tools and proper cache assignment, our fast algorithm can still reduce the computation by 45%.
Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate th...
详细信息
Advanced bit manipulation operations are not efficiently supported by commodity word-oriented microprocessors. Programming tricks are typically devised to shorten the long sequence of instructions needed to emulate these complicated bit operations. As these bit manipulation operations are relevant to applications that are becoming increasingly important, we propose direct support for them in microprocessors. In particular, we propose fast bit gather (or parallel extract), bit scatter (or parallel deposit) and bit permutation instructions (including group, butterfly and inverse butterfly). We show that all these instructions can be implemented efficiently using both the fast butterfly and inverse butterfly network datapaths. Specifically, we show that parallel deposit can be mapped onto a butterfly circuit and parallel extract can be mapped onto an inverse butterfly circuit. We define static, dynamic and loop invariant versions of the instructions, with static versions utilizing a much simpler functional unit. We show how a hardware decoder can be implemented for the dynamic and loop-invariant versions to generate, dynamically, the control signals for the butterfly and inverse butterfly datapaths. The simplest functional unit we propose is smaller and faster than an ALU. We also show that these instructions yield significant speedups over a basic RISC architecture for a variety of different application kernels taken from applications domains including bioinformatics, steganography, coding, compression and random number generation.
Accelerating iterative algorithms for solving inverse problems using neural networks have become a very popular strategy in the recent years. In this work, we propose a theoretical analysis that may provide an explana...
详细信息
ISBN:
(纸本)9781538646588
Accelerating iterative algorithms for solving inverse problems using neural networks have become a very popular strategy in the recent years. In this work, we propose a theoretical analysis that may provide an explanation for its success. Our theory relies on the usage of inexact projections with the projected gradient descent (PGD) method. It is demonstrated in various problems including image super-resolution.
Stereo depth estimation is used for many computer vision applications. Though many popular methods strive solely for depth quality, for real-time mobile applications (e.g. prosthetic glasses or micro-UAVs), speed and ...
详细信息
ISBN:
(纸本)9781728102139
Stereo depth estimation is used for many computer vision applications. Though many popular methods strive solely for depth quality, for real-time mobile applications (e.g. prosthetic glasses or micro-UAVs), speed and power efficiency are equally, if not more, important. Many real-world systems rely on Semi-Global Matching (SGM) to achieve a good accuracy vs. speed balance, but power efficiency is hard to achieve with conventional hardware, making the use of embedded devices such as FPGAs attractive for low-power applications. However, the full SGM algorithm is ill-suited to deployment on FPGAs, and so most FPGA variants of it are partial, at the expense of accuracy. In a nonFPGA context, the accuracy of SGM has been improved by More Global Matching (MGM), which also helps tackle the streaking artifacts that afflict SGM. In this paper, we propose a novel, resource-efficient method that is inspired by MGM's techniques for improving depth quality, but which can be implemented to run in real time on a low-power FPGA. Through evaluation on multiple datasets (KITTI and Middlebury), we show that in comparison to other real-time capable stereo approaches, we can achieve a state-of-the-art balance between accuracy, power efficiency and speed, making our approach highly desirable for use in real-time systems with limited power.
Machine learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, ap...
详细信息
ISBN:
(纸本)9783319264288;9783319264271
Machine learning algorithms are benefiting from the continuous improvement of programming models, including MPI, MapReduce and PGAS. k-Nearest Neighbors (k-NN) algorithm is a widely used machine learning algorithm, applied to supervised learning tasks such as classification. Several parallel implementations of k-NN have been proposed in the literature and practice. However, on high-performance computing systems with high-speed interconnects, it is important to further accelerate existing designs of the k-NN algorithm through taking advantage of scalable programming models. To improve the performance of k-NN on large-scale environment with InfiniBand network, this paper proposes several alternative hybrid MPI+OpenSHMEM designs and performs a systemic evaluation and analysis on typical workloads. The hybrid designs leverage the one-sided memory access to better overlap communication with computation than the existing pureMPI design, and propose better schemes for efficient buffer management. The implementation based on k-NN program from MaTEx toolkit with MVAPICH2-X (Unified MPI+ PGAS Communication Runtime over InfiniBand) shows up to 9.0% time reduction for training KDD Cup 2010 workload over 512 cores, and 27.6% time reduction for small workload with balanced communication and computation. Experiments of running with varied number of cores show that our design can maintain good scalability.
Tato práce se zabývá implementací a možnými optimalizacemi metody lattice-Boltzmann. Tato metoda umožňuje modelovat tok kapalin pomocí simulace pohybu fiktivních částic. Pr...
详细信息
Tato práce se zabývá implementací a možnými optimalizacemi metody lattice-Boltzmann. Tato metoda umožňuje modelovat tok kapalin pomocí simulace pohybu fiktivních částic. Práce se zaměřuje na možná vylepšení existujícícho nástroje HemeLB, který se specializuje na simulaci proudění krve v mozku. V práci jsou mimo jiné zkoumány techniky vektorizace a paralelizace jejichž implementace by mohla pro tento nástroj být přínosná. Součástí práce je implementace aplikace srovnávající několik vybraných algoritmů pro metodu lattice-Boltzmann včetně jejich možných optimalizací. Zahrnuty jsou rovněž testy zaměřené na srovnání těchto algoritmů dle dosaženého výkonu, využití paměti cache a celkové spotřeby paměti. Nejlepší dosažený výkon byl 150 milionů aktualizovaných bodů mřížky za sekundu.
Accelerating iterative algorithms for solving inverse problems using neural networks have become a very popular strategy in the recent years. In this work, we propose a theoretical analysis that may provide an explana...
详细信息
ISBN:
(纸本)9781538646595
Accelerating iterative algorithms for solving inverse problems using neural networks have become a very popular strategy in the recent years. In this work, we propose a theoretical analysis that may provide an explanation for its success. Our theory relies on the usage of inexact projections with the projected gradient descent (PGD) method. It is demonstrated in various problems including image super-resolution.
High speed computing and growing amounts of data are driving the quest for ever faster sorting algorithms. Sorting networks executing parallel sorting and dataflow computational paradigm are offered as a possible solu...
详细信息
ISBN:
(纸本)9781479914180
High speed computing and growing amounts of data are driving the quest for ever faster sorting algorithms. Sorting networks executing parallel sorting and dataflow computational paradigm are offered as a possible solution. In presented experiments Bitonic mergesort algorithm is implemented on an entry model of the Maxeler dataflow supercomputing system. Our results show, that sorting of a small size arrays on Maxeler, comparing to the fastest sorting algorithm on a CPU, achieves the speedup factor of 16. Using more advanced Maxeler systems, we expect to be able to sort larger arrays and achieve greater speedups.
暂无评论