The PIPO block cipher, a domestic lightweight block cipher, was announced at ICISC'20. In particular, the bitslicing technique is implemented in the S-Layer for the PIPO block cipher. Because this is a part that c...
详细信息
The PIPO block cipher, a domestic lightweight block cipher, was announced at ICISC'20. In particular, the bitslicing technique is implemented in the S-Layer for the PIPO block cipher. Because this is a part that can be operated in parallel, we implemented the PIPO block cipher efficiently in a parallel approach through AVX2 instructions, and provide implementations for ECB and CTR modes. Compared to the existing PIPO implementation, we achieved a performance improvement by 7.345x. In addition, we applied the AVX2-PIPO implementation to the round function of format-preserving encryption. When repeatedly encrypting 128-byte plaintext, we achieved performance similar to that of the existing FF1-AES implementation. The FF1-AVX2-PIPO implementation successfully encrypted the database and enabled efficient database management in terms of memory space and speed factor. Finally, AVX2-PIPO-CTR and FF1-AVX2-PIPO were applied to image processing. In the case of CTR mode, the encryption performance was better than that of ECB mode. Partial encryption with object detection and FF1-AVX2-PIPO was successfully performed, and it is expected that privacy protection in CCTV or image processing can be improved.
We describe how an aggregation/disaggregation method for finding quasi-stationary distributions of continuous-time Markov chains can be implemented on a massively parallel computer. The method is similar to an algebra...
详细信息
We describe how an aggregation/disaggregation method for finding quasi-stationary distributions of continuous-time Markov chains can be implemented on a massively parallel computer. The method is similar to an algebraic multigrid, using restriction operators that depend on the current iteration of the solution, and Jacobi smoothers at each level of the multigrid. The method is illustrated using a simple epidemic model, and the performance compared to a sequential implementation as the size of the population increases. (C) 1997 Elsevier Science B.V.
Pedestrian detection from visual images, which is used for driver assistance or video surveillance, is a recent challenging problem. Co-occurrence histograms of oriented gradients (CoHOG) is a powerful feature descrip...
详细信息
Pedestrian detection from visual images, which is used for driver assistance or video surveillance, is a recent challenging problem. Co-occurrence histograms of oriented gradients (CoHOG) is a powerful feature descriptor for pedestrian detection and achieves the highest detection accuracy. However, its calculation cost is too large to calculate it in real-time on state-of-the-art processors. In this paper, to obtain optimal parallel implementation for an NVIDIA GPU, several kinds of parallelism of CoHOG-based detection are shown and evaluated suitability for implementation. The experimental result shows that the detection process can be performed at 16.5 fps in QVGA images on NVIDIA Testa C1060 by optimized parallel implementation. By our evaluation, it is shown that the optimal strategy of parallel implementation for an NVIDIA GPU is different from that of FPGA. We discuss about the reason and show the advantages of each device. To show the scalability and portability of GPU implementation, the same object code is executed on other NVIDA GPUs. The experimental result shows that GTX570 can perform the CoHOG-based pedestiran detection 21.3 fps in QVGA images.
In this paper a parallel implementation of the Kalman filter is proposed, to speed up computation using concurrent calculus techniques and factorisation methods, that help avoid numerical instability problems. The alg...
详细信息
In this paper a parallel implementation of the Kalman filter is proposed, to speed up computation using concurrent calculus techniques and factorisation methods, that help avoid numerical instability problems. The algorithm has been implemented on a measuring system based on the use of a transputer network and a data acquisition board, and applied to measurement on asynchronous motors. Some experimental results obtained with the proposed real system are also shown and the performance is reported.
The K-means algorithm is widely used to find correlations between data in different application domains. However, given the massive amount of data stored, known as Big Data, the need for high-speed processing to analy...
详细信息
The K-means algorithm is widely used to find correlations between data in different application domains. However, given the massive amount of data stored, known as Big Data, the need for high-speed processing to analyze data has become even more critical, especially for real-time applications. A solution that has been adopted to increase the processing speed is the use of parallel implementations on FPGA, which has proved to be more efficient than sequential systems. Hence, this paper proposes a fully parallel implementation of the K-means algorithm on FPGA to optimize the system's processing time, thus enabling real-time applications. This proposal, unlike most implementations proposed in the literature, even parallel ones, do not have sequential steps, a limiting factor of processing speed. Results related to processing time (or throughput) and FPGA area occupancy (or hardware resources) were analyzed for different parameters, reaching performances higher than 53 millions of data points processed per second. Comparisons to the state of the art are also presented, showing speedups of more than over a partially serial implementation.
The paper presents a three-dimensional cellular automaton model of electrochemical oxidation of the carbon. The sample of the electro-conductive carbon black "Ketjen-black EC-600JD" consisting of granules of...
详细信息
The paper presents a three-dimensional cellular automaton model of electrochemical oxidation of the carbon. The sample of the electro-conductive carbon black "Ketjen-black EC-600JD" consisting of granules of carbon is simulated. The electrochemical oxidation of the carbon granules occurs through a fewsuccessive stages. parallel implementation of the three-dimensional cellular automaton model of carbon corrosion is developed. The efficiency and speedup of the parallel code are analyzed. The portions of surface carbon atoms and atoms with different degree of oxidation are computed by the parallel code. Based on the obtained values of atom portions the electrochemical capacity is calculated. The results of computer simulation are compared with the experimental data.
作者:
ZAROWSKI, CJDept. of Electr. Eng.
Queen""s Univ. Kingston Ont. Canada Abstract Authors References Cited By Keywords Metrics Similar Download Citation Email Print Request Permissions
The Berlekamp-Massey algorithm (BMA) is important in the decoding of Reed-Solomon (RS), and more generally, Bose-Chaudhuri-Hocquenghem (BCH) block error-control codes. For a t-error correcting code the BMA has time co...
详细信息
The Berlekamp-Massey algorithm (BMA) is important in the decoding of Reed-Solomon (RS), and more generally, Bose-Chaudhuri-Hocquenghem (BCH) block error-control codes. For a t-error correcting code the BMA has time complexity O(t(2)) when implemented on a sequential computer. However, the BMA does not run efficiently on a parallel computer. The Bh IA can be mapped into the Schur BMA. This paper presents the implementation of the BMA and Schur BMA together on a linearly connected array of 2t processors. The resulting machine computes the error-locator polynomial with a time complexity of O(t).
parallel implementations of the extended square-root covariance filter (ESRCF) for tracking applications are developed in this paper. The decoupling technique and special properties in the tracking Kalman filter (KF) ...
详细信息
parallel implementations of the extended square-root covariance filter (ESRCF) for tracking applications are developed in this paper. The decoupling technique and special properties in the tracking Kalman filter (KF) are explored to reduce computational requirements and to increase parallelism. The application of the decoupling technique to the ESRCF results in the time and measurement updates of m decoupled (n/m)-dimensional matrices instead of 1 coupled n-dimensional matrix, where m denotes the tracking dimension and n denotes the number of state elements. The updates of m decoupled matrices are found to require approximately m times less processing elements and clock cycles than the updates of 1 coupled matrix. The transformation of the Kalman gain which accounts for the decoupling technique is found straightforward to implement. The sparse nature of the measurement matrix and the sparse, band nature of the transition matrix are explored to simplify matrix multiplications.
The paper presents a parallel implementation of a 3D convex-hull algorithm on a Meiko Computing Surface using OCCAM and C. The parallel program is adapted from a serial divide-and-conquer version;the outline of the se...
详细信息
The paper presents a parallel implementation of a 3D convex-hull algorithm on a Meiko Computing Surface using OCCAM and C. The parallel program is adapted from a serial divide-and-conquer version;the outline of the serial verison is also given. Details relating to the practical problems involved in the parallelization of such a geometric algorithm are reported. The performance of the parallel program is monitored for several different sizes of network, and compared with the performance of the serial version running on a Sun workstation. Experimental results are presented, and suggestions for further developments of the implementation are also discussed.
For computing weights of deep neural networks (DNNs), the backpropagation (BP) method has been widely used as a de-facto standard algorithm. Since the BP method is based on a stochastic gradient descent method using d...
详细信息
For computing weights of deep neural networks (DNNs), the backpropagation (BP) method has been widely used as a de-facto standard algorithm. Since the BP method is based on a stochastic gradient descent method using derivatives of objective functions, the BP method has some difficulties finding appropriate parameters such as learning rate. As another approach for computing weight matrices, we recently proposed an alternating optimization method using linear and nonlinear semi-nonnegative matrix factorizations (semi-NMFs). In this paper, we propose a parallel implementation of the nonlinear semi-NMF based method. The experimental results show that our nonlinear semi-NMF based method and its parallel implementation have competitive advantages to the conventional DNNs with the BP method.
暂无评论