Viterbi algorithm is an optimal convolutional decoding algorithm with superpolynomial time complexity. Basic principles of Viterbi algorithm are shown in Figures 1, 2, and 3. In order to improve the algorithm throughp...
详细信息
Viterbi algorithm is an optimal convolutional decoding algorithm with superpolynomial time complexity. Basic principles of Viterbi algorithm are shown in Figures 1, 2, and 3. In order to improve the algorithm throughput, one has to apply parallelism. This can be done at different levels, e.g., bit, word, or algorithm level. The paper discusses various approaches to the parallelisation of the decoding algorithm, some implemented in VLSI processing elements, and the other implemented by multiprocessor systems with general purpose processors. A Viterbi decoder basically consists of three main functional blocks shown in Figure 4. Branch Metrics block BM calculates in each time step all branch weights. Add-Compare-Select ACS unit calculates sums of weights and selects optimal survivor paths. Survivor Memory SM analyses partial results from BM and ACS and outputs decoded data within a time delay D. Note that data dependent loop is present in the ACS unit that limits the speed of the decoding procedure because actual branch weight has to be added to the accumulated weights of the survivor path at each time step. Performances of the Viterbi decoder can be improved on bit level by breaking the data dependant loop using carry-save addition and pipelining (see Figure 5). Further, several ACS units can be used in parallel on the word level. Finally, more independent decoders may work on different blocks of input data. After decoding procedure the final result can be obtained by the multiplexing of decoded segments. The mentioned principles can be implemented either in VLSI components connected into a ring topology or by several independent general purpose or DSP processors (see Figures 6 and 7). Theoretical speedup attainable by parallel processing is estimated to be S = pN M / E, where pN represents the number of processors, E the length of the decoded block and M &le E the length of the uniquely decoded data in a block. Considering the performance of contemporary processing e
Task scheduling determines the performance of NOW computing to a large extent. However, the computer system architecture, computing capability and system load are rarely proposed together. In this paper, a biggest het...
详细信息
Task scheduling determines the performance of NOW computing to a large extent. However, the computer system architecture, computing capability and system load are rarely proposed together. In this paper, a biggest heterogeneous scheduling algorithm is presented. It fully considers the system characteristics (from application view), structure and state. So it always can utilize all processing resource under a reasonable premise. The results of experiment show the algorithm can significantly shorten the response time of jobs.
Connected dominating set (CDs) has been proposed as virtual backbone or spine of wireless ad hoc networks. Three distributed approximation algorithms have been proposed in the literature for minimum CDS. We first rein...
详细信息
algorithms for 2D wavelet transform decomposition on clusters of workstations are described and analyzed. For the parallel algorithm employed, the computation of the transform is structured so that the exchange of int...
详细信息
algorithms for 2D wavelet transform decomposition on clusters of workstations are described and analyzed. For the parallel algorithm employed, the computation of the transform is structured so that the exchange of intermediate transform coefficients is restricted only to neighboring processors and the amount of data communicated is independent of the problem size. Results show that the performance of the parallel implementation improves with increasing data size making the parallel algorithm particularly suitable for applications such as image processing, image coding and computer vision. Timings measured on a Myrinet connected Beowulf cluster agree well with the theoretical analysis and indicate that the implementation is cost optimal.
In the presented work the authors included the comparison of the calculations of a parallel FDTD algorithm with the computations obtained with the use of the Quick Wave programme published by QWED. The authors worked ...
详细信息
In the presented work the authors included the comparison of the calculations of a parallel FDTD algorithm with the computations obtained with the use of the Quick Wave programme published by QWED. The authors worked out a parallel implementation of the standard FDTD algorithm which is based on MPI communication library. The parallel algorithm was examined in a heterogeneous PC cluster.
The active integrated antenna array presented by S. Nogi et al. (see IEEE Microwave Theory Tech., vol.41, p.1827-37, 1993) is simulated by use of a parallel FDTD algorithm to reduce the computational requirement. A 4-...
详细信息
The active integrated antenna array presented by S. Nogi et al. (see IEEE Microwave Theory Tech., vol.41, p.1827-37, 1993) is simulated by use of a parallel FDTD algorithm to reduce the computational requirement. A 4-unit array with dual-modes and dual-frequencies is found in simulation.
In this paper, an improved version of the BiCGStab method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. The method combines elements of numerical ...
详细信息
In this paper, an improved version of the BiCGStab method for the solutions of large and sparse linear systems of equations with unsymmetric coefficient matrices is proposed. The method combines elements of numerical stability and parallel algorithm design without increasing the computational costs. The algorithm is derived such that all inner products of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time of vector updates. Therefore, the cost of global communication can be significantly reduced. In this paper, the bulk synchronous parallel (BSP) model is used to design a fully efficient, scalable and portable parallel proposed algorithm and to provide accurate performance prediction of the algorithm for a wide range of architectures including the Cray T3D, the Parsytec, and a cluster of workstations connected by an Ethernet. This performance model provides us useful insight in the time complexity of the method using only a few system dependent parameters based on a simple and accurate cost modelling. The theoretical performance prediction are compared with some preliminary measured timing results of a numerical application from ocean flow simulation.
In recent years we introduced a continuity operator, the "Superindividual", that allows for the inclusion of knowledge in the evolution of the genetic algorithm. Since we deal with very complex optimization ...
详细信息
In recent years we introduced a continuity operator, the "Superindividual", that allows for the inclusion of knowledge in the evolution of the genetic algorithm. Since we deal with very complex optimization problems, we developed a parallel genetic algorithm, with the Superindividual operator. The paper presents this parallel algorithm, which improves on the results of the conventional genetic algorithm. Two different models of parallel genetic algorithms are compared. The results are very encouraging.
A new parallel algorithm that solves a dynamic programming paradigm is proposed. It has the time complexity of O(n) and uses (n-1)n/2 processors. An MPI implementation is used to test the algorithm.
A new parallel algorithm that solves a dynamic programming paradigm is proposed. It has the time complexity of O(n) and uses (n-1)n/2 processors. An MPI implementation is used to test the algorithm.
The paper introduces the application of FCCN (fully connected cubic network) topology in massively parallel processing systems. Because of the simple self-routing algorithm, small diameter and average number of intern...
详细信息
The paper introduces the application of FCCN (fully connected cubic network) topology in massively parallel processing systems. Because of the simple self-routing algorithm, small diameter and average number of internode distance, the fault-tolerance, FCCN can act as a high performance interconnection network in the massively parallel processing systems. Moreover, the hypercube can be embedded in FCCN nature, so that FCCN will implement all developed parallel algorithms for the hypercube easily and efficiently. And the broadcasting algorithm of FCCN is proposed.
暂无评论