During the last decade, processor architectures have emerged with hundreds and thousands of high speed processing cores in a single chip. these cores can work in parallel to share a work load for faster execution. thi...
详细信息
During the last decade, processor architectures have emerged with hundreds and thousands of high speed processing cores in a single chip. these cores can work in parallel to share a work load for faster execution. this paper presents performance evaluations on such multicore and many-core devices by mapping a computationally expensive correlation kernel of a template matching process using various programming models. the work builds a base performance case by a sequential mapping of the algorithm on an Intel processor. In the second step, the performance of the algorithm is enhanced by parallel mapping of the kernel on a shared memory multicore machine using OpenMP programming model. Finally, the Normalized Cross-Correlation (NCC) kernel is scaled to map on a many-core K20 GPU using CUDA programming model. In all steps, the correctness of the implementation of algorithm is taken care by comparing computed data with reference results from a high level implementation in MATLAB. the performance results are presented with various optimization techniques for MATLAB, Sequential, OpenMP and CUDA based implementations. the results show that GPU based implementation achieves 32x and 5x speed-ups respectively to the base case and multicore implementations respectively. Moreover, using inter-block sub-sampling on an 8-bit 4000×4000 reference gray-scale image achieves the execution time upto 2.8sec with an error growth less than 20% for the selected templates of size 96×96.
this work presents an implementation of a high speed Pickup and Delivery Problem with Time Window (PDPTW) problem using GPU cluster. this problem represents a class of a major logistic problem. the software implemente...
详细信息
this work presents an implementation of a high speed Pickup and Delivery Problem with Time Window (PDPTW) problem using GPU cluster. this problem represents a class of a major logistic problem. the software implemented is tested on 8 nodes GPU cluster equipped with two of Tesla M2050 (448 cores) card on each node. the result shows a speedup of nearly 7 times for a small problem and 43 times using 4 nodes for a large problem. In the presentation, some factors that affect the performance will be discussed.
To speed up cloth simulation while achieving realisticsimulation results, a local adaptive Catmull-Clarksubdivision is adopted in this paper. During the growingphase of cloth simulation, several new particles areadded...
详细信息
ISBN:
(纸本)0780379411
To speed up cloth simulation while achieving realisticsimulation results, a local adaptive Catmull-Clarksubdivision is adopted in this paper. During the growingphase of cloth simulation, several new particles areadded to the system if the cloth grids collide with rigidobjects. the cloth grids are merged later when coarsegrids are sufficient. In this article, a mass-spring model isdeveloped for rectangle grids. Collision detection andresponse methods for local adaptive subdivision are alsodescribed. the exploitation of cache control instructionset and SIMD instructions allow us to achieve nearly100% FPU utilization. the algorithm supports tradeoffsbetween simulation time and realistic results.
Multiply and multiply-accumulate (MAC) instructions (see ARM DDI0l00E, ARM Architecture Reference Manual) are fundamental instructions in DSP applications. In an embedded digital signal processing (DSP) core and high-...
详细信息
ISBN:
(纸本)0780374886
Multiply and multiply-accumulate (MAC) instructions (see ARM DDI0l00E, ARM Architecture Reference Manual) are fundamental instructions in DSP applications. In an embedded digital signal processing (DSP) core and high-performance enhanced DSP instruction processor core, the implementation of high-performance multiply and MAC instructions is very important. An algorithm of 32/spl times/32 multiply and MAC instructions' VLSI implementation with 32/spl times/8 multiplier-accumulator in DSP applications is presented. the 32/spl times/32 multiplication is achieved by 4 times 32/spl times/8 multiplication. the result of one 32/spl times/8 multiplication serves as a partial product of the next 32/spl times/8 operation; when the result of four such multiplications is accumulated, we get the result of 32/spl times/32. the 32/spl times/8 multiplication is only implemented by the hardware Booth multiplier. the algorithm of multiply and MAC instructions' implementation is the better trade-off between serial multiplier and parallel multiplier.
Multiple-input multiple-output serves as a key technique for modern wireless communication systems but brings big challenges in data detection. Belief propagation (BP) detection enjoys advances for its near-optimal er...
Multiple-input multiple-output serves as a key technique for modern wireless communication systems but brings big challenges in data detection. Belief propagation (BP) detection enjoys advances for its near-optimal error rate performance and strong robustness. However, the state-of-art BP detection algorithms suffer from either an exponentially increasing computational complexity or sub-optimal error rate performance. To address this issue, this paper proposes an efficient variant based on the real-domain GAI BP (RD-GAI-BP), called the improved RD-GAI-BP, achieving a better error rate performance at the expense of acceptable complexity increments. Numerical results demonstrate that, in the MIMO scenario with N r = 16, N t = 8 and 64-QAM modulation, the proposed improved RD-GAI-BP earns more than 2 dB SNR gains at the BER of 10 -3 when compared withthe state-of-art RD-GAI-BP detection.
To reduce the processing delay from the sequentially running virtual network functions (VNFs) in a service function chain (SFC), network function parallelism (NFP) is introduced that allows VNFs of the SFC to run in p...
详细信息
ISBN:
(数字)9781665406949
ISBN:
(纸本)9781665406956
To reduce the processing delay from the sequentially running virtual network functions (VNFs) in a service function chain (SFC), network function parallelism (NFP) is introduced that allows VNFs of the SFC to run in parallel. Existing NFP solutions only focused on improving parallelism benefits without paying much attention to resource utilization while deploying VNFs of SFCs. We take advantage of resource-delay dependency to propose a flexible and efficient parallelized SFC placement mechanism called FlexSFC which determines the optimal SFC placement while reducing resource usage and meeting end-to-end delay guarantees of the SFCs deployed. Initial results show that FlexSFC guarantees the end-to-end delay requirement with better resource utilization and SFC acceptance rate than the state-of-the-art approaches.
In a real-time application, a transaction may be assigned a value to reflect the profit of completing the transaction before its deadline. Satisfying both goals of maximizing the totally obtained profit and minimizing...
详细信息
In a real-time application, a transaction may be assigned a value to reflect the profit of completing the transaction before its deadline. Satisfying both goals of maximizing the totally obtained profit and minimizing the number of missed transactions at the same time is a challenge. the authors present an adaptive real-time scheduling policy named value-based processor allocation (VPA-k) for scheduling value-based transactions in a multiprocessor real-time database system. Using the VPA-k policy, the transactions with higher values are given higher priorities to execute first, while at most k percentage of total processors are dynamically allocated to execute the urgent transactions. through simulation experiments, VPA-k is shown to outperform other scheduling policies substantially in both maximizing the totally obtained profit and minimizing the number of missed transactions under various system environments.
To combine presented MIMO scheme with multiuser detectors for uplink will suffer from the problems of high computation complexity and channenl ***,in this paper we propose a MIMO multiuser detection(MUD) scheme that r...
详细信息
To combine presented MIMO scheme with multiuser detectors for uplink will suffer from the problems of high computation complexity and channenl ***,in this paper we propose a MIMO multiuser detection(MUD) scheme that reduces considerably the system computation complexity. the proposed algorithm adopts inverse channel matrix for MIMO decoding,which is not sensitive to the coherency of *** of the scattering characteristic of the MIMO channel,the inverse channel matrices are always nonsingular, which keeps the receivers can get stable spatial diversity gain. the MUD algorithms can be realized using a parallel modular *** is based on a Minimum Mean Square Error (MMSE) *** results show that our MIMO-MUD performs much better than presented MIMO-MUD for the same order of complexity,though the MIMO CDMA system has only two antennas at each BS and two antennas at each mobile station.
the article proposes a solution to restructure Convolutional Neural Network (CNN) architectures by integrating parameter quantization techniques with traditional CNN models capable of deployment on Field-Programmable ...
详细信息
ISBN:
(数字)9798350371864
ISBN:
(纸本)9798350371871
the article proposes a solution to restructure Convolutional Neural Network (CNN) architectures by integrating parameter quantization techniques with traditional CNN models capable of deployment on Field-Programmable Gate Array (FPGA) hardware for evaluating the classification performance of two-dimensional data in real-world scenarios. the solution introduces an additional quantum layer before the final classification layer of the CNN to receive standardized outputs, followed by computations in the Hilbert vector space to generate probability values for assessing classification results. the quantization process helps the model swiftly identify data features while optimizing the parallel computing capability of FPGA hardware. the model is evaluated on the MNIST handwritten digit dataset, revealing two advantages: time-processing on FPGA is four times faster compared to using only the Central processing Unit (CPU) on the PynQ-z2 kit board; classification accuracy is higher when utilizing the quantum layer compared to without it, withthe same number of training iterations. these results demonstrate the feasibility of hardware-accelerated AI algorithms combined with quantum algorithms in real applications.
Digital signal processors (DSP) play an important role in signal processing, wireless communication and many other fields. Withthe improvement of DSP's computing performance, memory architecture became the neck o...
详细信息
ISBN:
(纸本)9781479987962
Digital signal processors (DSP) play an important role in signal processing, wireless communication and many other fields. Withthe improvement of DSP's computing performance, memory architecture became the neck of the whole DSP's efficiency. A new memory architecture which can be accessed by two computation slots, DMA controller, debug module and wishbone bus in parallel is presented in this paper. Our data memory capacity is 1MB and instruction memory capacity is 256KB. After synthesized, placed, and routed in a commercial 65nm low power process, the area of our data memory is about 8, 600, 600μm2, and the area of our instruction memory is about 2, 140, 000μm2. the delay result of our data memory is 1.65ns (SS), and the delay result of our instruction memory is 1.78ns (SS).
暂无评论