Digital signal processing is at the heart of modern personal communication devices that support high throughput media streams. Although DSPs are typically implemented using high performance clocked processors, delay-i...
详细信息
ISBN:
(纸本)9781509009428
Digital signal processing is at the heart of modern personal communication devices that support high throughput media streams. Although DSPs are typically implemented using high performance clocked processors, delay-insensitive asynchronous techniques are being increasingly applied as they are robust to the extreme variability of modern nano-scale fabrication processes. As an additional benefit, these systems do not exhibit the trade off between latency and throughput that is a characteristic of clocked Boolean systems. In this paper, a low latency Null Convention Logic based parallel 16×16 bit multiplier is used to illustrate the pipelining behaviour in an NCL system. We show that a 16×16 non-pipelined NCL multiplier achieves a throughput of nearly 260 Mops/sec, with a latency of around 1.20 ns. By including 15 pipeline stages, the multiplier throughput can be more than doubled to 609 Mops/sec while the latency increases to only 2.26 ns. this can be contrasted withthe corresponding synchronous multiplier incorporating the same number of pipeline stages, which would exhibit a latency of around 25 ns at an equivalent throughput.
Massive multiple-input multiple-output (MIMO) detection plays a prominent role in the field of wireless communication. And withthis trend, message passing detection (MPD) attracts a great attention since its advantag...
Massive multiple-input multiple-output (MIMO) detection plays a prominent role in the field of wireless communication. And withthis trend, message passing detection (MPD) attracts a great attention since its advantages of high throughput and low complexity between distinct detection algorithms. Among various MPD algorithms, Gaussian approximate interference belief propagation (GAI-BP) detection can exhibit excellent convergence and performance in the different scenarios. However, its hardware design should be more efficient and flexible further. this paper proposes a novel semi-parallel hardware architecture for GAI-BP detection, aimed at improving its configurability using a layer scheduling scheme. through this scheme, the proposed detector achieves a gain of 0.65 dB when compared to the original algorithm, while simultaneously accommodating multi-antenna radios through the proposed hardware design.
the longstanding theory of “parallelprocessing” predicts that, except for a sign reversal, ON and OFF cells are driven by a similar pre-synaptic circuit and have similar visual field coverage, direction/orientation...
详细信息
ISBN:
(纸本)9781728143378;9781728143385
the longstanding theory of “parallelprocessing” predicts that, except for a sign reversal, ON and OFF cells are driven by a similar pre-synaptic circuit and have similar visual field coverage, direction/orientation selectivity, visual acuity and other functional properties. However, recent experimental data challenges this view. Here we present an information theory based receptive field (RF) estimation method - quadratic mutual information (QMI) - applied to multi-electrode array electrophysiological recordings from the mouse dorsal lateral geniculate nucleus (dLGN). this estimation method provides more accurate RF estimates than the commonly used Spike-Triggered Average (STA) method, particularly in the presence of spatially correlated inputs. this improved efficiency allowed a larger number of RFs (285 vs 189 cells) to be extracted from a previously published dataset. Fitting a spatial-temporal Difference-of-Gaussians (ST-DoG) model to the RFs revealed that while the structural RF properties of ON and OFF cells are largely symmetric, there were some asymmetries apparent in the functional properties of ON and OFF visual processing streams - with OFF cells preferring higher spatial and temporal frequencies on average, and showing a greater degree of orientation selectivity.
In spite of their striking diversity, numerous tasks and architectures of intelligent systems such as those permeating multivariable data analysis (e.g., time series, spatio-temporal, and spatial dependencies), decisi...
详细信息
In spite of their striking diversity, numerous tasks and architectures of intelligent systems such as those permeating multivariable data analysis (e.g., time series, spatio-temporal, and spatial dependencies), decision-making processes along withtheir models, recommender systems and others exhibit two evident commonalities. they promote human centricity and vigorously engage perceptions (rather than plain numeric entities) in the realization of the systems and their usage. Information granules play a pivotal role in such settings. In the sequel, Granular Computing delivers a cohesive framework supporting a formation of information granules and facilitating their processing. We exploit two essential concepts of Granular Computing. the first one, formed withthe aid of a principle of justifiable granularity, deals withthe construction of information granules. the second one, based on an idea of an optimal allocation of information granularity, helps endow constructs of intelligent systems with a very much required conceptual and modeling flexibility. the talk covers in detail two representative studies. the first one is concerned with a granular interpretation of temporal data where the role of information granularity is profoundly visible when effectively supporting human centric description of relationships existing in data. In the second study being focused on the Analytic Hierarchy Process (AHP) used in decision-making, we show how an optimal allocation of granularity helps facilitate collaborative activities (e.g., consensus building) in group decision-making.
Describes two different approaches to optimize the performance of SoC architectures in the architecture exploration phase. Both solve the problem to map and schedule a task graph on a target architecture under special...
详细信息
Describes two different approaches to optimize the performance of SoC architectures in the architecture exploration phase. Both solve the problem to map and schedule a task graph on a target architecture under special consideration of on-chip communications. A constructive algorithm is presented that extends previous work by taking into account potential data transfers in the future. the second approach is a recursive procedure that is based on local search techniques in a specially defined neighborhood of the critical path. Simulated annealing and tabu search are used as search algorithms. Both approaches find solutions with better performance than established methodologies. the recursive technique leads to superior results than the constructive approach, however, is limited to small and mid-sized problems, whereas the constructive algorithm is not limited by this issue.
Innovations in powerful high-performance computing (HPC) architecture are enabling high-fidelity whole-core neutron transport simulations at reasonable time. Especially, the currently fashionable heterogeneous archite...
Innovations in powerful high-performance computing (HPC) architecture are enabling high-fidelity whole-core neutron transport simulations at reasonable time. Especially, the currently fashionable heterogeneous architectures make the cost of such simulations at very low level. Neutron distribution of a reactor core is governed by the Boltzmann neutron transport equation (BTE), first viable solutions of which need tremendous computer resources. Among of the high-fidelity numerical methods, the discrete ordinates method (SN) is becoming popular in the reaction design community by taking a good balance between computational cost and accuracy. Recently, MT-3000, which is a multizone heterogeneous architecture with a peak double precision performance of 11.6 TFLOPS, is proposed. In this work, the BTE is solved by the SN with heterogenous Koch-Baker-Alcouffe (KBA) parallelalgorithms based on the MT-3000 architecture. A communication mechanism has been established to efficiently transmit data among the acceleration cores and the CPU cores. the kernel computation procedure is largely accelerated by the vectorization and instruction pipelining techniques. Numerical experiments show that our formulation could achieve 1.37 TFLOPs with single MT-3000, that is 11.8% of its peak performance.
the tag sorting circuit in Weighted Fair Queuing (WFQ) is crucial to the Quality of Service (QoS). In this paper, we present a kind of optimized hardware architecture for fast tag sorting, which consists of one-hot en...
详细信息
the tag sorting circuit in Weighted Fair Queuing (WFQ) is crucial to the Quality of Service (QoS). In this paper, we present a kind of optimized hardware architecture for fast tag sorting, which consists of one-hot encoding and leading zero counting. the architecture is parallel and pipelining. It is implemented using FPGA technology. In comparison withthe traditional comparator-tree-based architecture, it can improve the frequency by 15% and reduce the area by 22%.
Fast Fourier Transforms (FFTs) are highly parallel in nature and consist of simple addition, subtraction, and complex rotation operators with phase factors (a.k.a. twiddle factors). Withthe advent of FPGAs and other ...
详细信息
Fast Fourier Transforms (FFTs) are highly parallel in nature and consist of simple addition, subtraction, and complex rotation operators with phase factors (a.k.a. twiddle factors). Withthe advent of FPGAs and other reconfigurable seas-of-logic, it is now possible to construct a fully parallel FFT structure where the phase factors are now constants and good targets for hardware optimization. By varying the fixed-point length of the phase factors using phase angle error percentage as a control for the variable length phase factor quantizer, the number of shifted adders required to implement the complex rotation operators can be reduced. Performance comparisons of fixed length and variable length phase factors, along with two quantizer rounding modes, are investigated.
An optimized software implementation of a high quality MPEG AAC-LC (low complexity) audio encoder is presented in this paper. the standard reference encoder is improved by utilizing several algorithmic optimizations (...
详细信息
An optimized software implementation of a high quality MPEG AAC-LC (low complexity) audio encoder is presented in this paper. the standard reference encoder is improved by utilizing several algorithmic optimizations (fast psycho-acoustic model, new tonality estimation, new time domain block switching, optimized quantizer and Huffman coder) and very careful code optimizations for PC CPU architectures with SIMD (single-instruction-multiple-data) instruction set. the psychoacoustic model used the MDCT filterbank for energy estimation and peak detection as a measure of tonality. Block size decision is based on local perceptual entropies as well as LPC analysis of the time signal. Algorithmic optimizations in the quantizer include loop control module modification and optimized Huffman search. Code optimization is based on parallelprocessing by replacing vector algebra and math junctions withtheir optimized equivalents with Intel/sup /spl reg// Signal processing Library (SPL). the implemented codec outperforms consumer MP3 encoders at 30% less bitrate at the same time achieving encoding times several times faster than real-time.
3D graphics performance is increasing faster than any other computing application. Almost all PC systems now include 3D graphics accelerators for games, Computer Aided Design (CAD) or visualization applications. this ...
详细信息
3D graphics performance is increasing faster than any other computing application. Almost all PC systems now include 3D graphics accelerators for games, Computer Aided Design (CAD) or visualization applications. this paper investigates the suitability of Field Programmable Gate Array (FPGA) devices as a low cost solution for implementing 3D affine trans formations. A proposed solution based on processing large matrix multiplication has been implemented, for large 3D models, on the RC1000-PP Celoxica board based development platform using Handel-C, a C-like language supporting parallelism, flexible data size and compilation of high-level programs directly into FPGA hardware.
暂无评论