Quantum image histogram as a preprocessing result in quantum image processing contains the gray information of the image and plays an important role in subsequent image processing. As far as we know, there are only a ...
详细信息
Quantum image histogram as a preprocessing result in quantum image processing contains the gray information of the image and plays an important role in subsequent image processing. As far as we know, there are only a few results on quantum image histogram, and studies on quantum video histogram have not been conducted. So a novel histogram statistic algorithm for quantum video in terms of the idea of parallel computing is proposed in the paper. To this end, the quantum version of carry-lookahead full-adder is first devised, and based on the novel full-adder, an entirely new hierarchical quantum adder for superposition states is also devised, which not only improves the delays generated by mutual carries of classical adder, but also reduces the complexity of classical adder from O(2mxn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {O(2<^>{m}\times n)}$$\end{document} to O(m2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {O(m<^>{2})}$$\end{document}. Subsequently, in order to realize the parallel statistics of quantum video, the algorithm and circuit implementation of image stitching are also given. Finally, combining the results of image stitching and Grover's search algorithm, the quantum video histogram statistics is ultimately realized in parallel.
Deep neural networks (DNNs) have been widely used for learning various wireless communication policies. While DNNs have demonstrated the ability to reduce the time complexity of inference, their training often incurs ...
详细信息
Deep neural networks (DNNs) have been widely used for learning various wireless communication policies. While DNNs have demonstrated the ability to reduce the time complexity of inference, their training often incurs a high computational cost. Since practical wireless systems require retraining due to operating in open and dynamic environments, it is crucial to analyze the factors affecting the training complexity, which can guide the DNN architecture selection and the hyper-parameter tuning for efficient policy learning. As a metric of time complexity, the number of floating-point operations (FLOPs) for inference has been analyzed in the literature. However, the time complexity of training DNNs for learning wireless communication policies has only been evaluated in terms of runtime. In this paper, we introduce the number of serial FLOPs (se-FLOPs) as a new metric of time complexity, accounting for the ability of parallel computing. The se-FLOPs metric is consistent with actual runtime, making it suitable for measuring the time complexity of training DNNs. Since graph neural networks (GNNs) can learn a multitude of wireless communication policies efficiently and their architectures depend on specific policies, no universal GNN architecture is available for analyzing complexities across different policies. Thus, we first use precoder learning as an example to demonstrate the derivation of the numbers of se-FLOPs required to train several DNNs. Then, we compare the results with the se-FLOPs for inference of the DNNs and for executing a popular numerical algorithm, and provide the scaling laws of these complexities with respect to the numbers of antennas and users. Finally, we extend the analyses to the learning of general wireless communication policies. We use simulations to validate the analyses and compare the time complexity of each DNN trained for achieving the best learning performance and achieving an expected performance.
Energy efficiency has become a major focus in optimizing hardware resource usage for Cloud servers. One approach widely employed to enhance the execution of parallel applications is thread-level parallelism (TLP) expl...
详细信息
Energy efficiency has become a major focus in optimizing hardware resource usage for Cloud servers. One approach widely employed to enhance the execution of parallel applications is thread-level parallelism (TLP) exploitation. This technique leverages multiple threads to improve computational efficiency and performance. However, the increasing heterogeneity of resources in cloud environments and the complexity of selecting the optimal configuration for each application pose a significant challenge to cloud users due to the massive number of possible configurations and the need to effectively harness TLP in diverse hardware setups to achieve optimal energy efficiency and performance. To address this challenge, we propose TLP-Allocator, an artificial neural network (ANN) optimization strategy that uses hardware and software metrics to build and train an ANN model. It predicts worker node and thread count combinations that provide optimal energy-delay product (EDP) results. In experiments using ten well-known applications on a private cloud with heterogeneous resources, we show that TLP-Allocator predicts combinations that yield EDP values close to the best achieved by an exhaustive search. It also improves the overall EDP by 38.2% compared to state-of-the-art workloads scheduling on cloud environments.
Discrete ordinates (SN) method with unstructured meshes is highly appropriate for high-fidelity modeling and simulation of radiation shielding problems with complicated geometries. However, the large number of unknown...
详细信息
Discrete ordinates (SN) method with unstructured meshes is highly appropriate for high-fidelity modeling and simulation of radiation shielding problems with complicated geometries. However, the large number of unknowns resulting from discretization of the transport equation on the spatial, angular, and energy variables, necessitates the use of parallel computing to achieve efficient solutions. In this work, a GPU-based SN transport sweep algorithm combined with a GPU-parallel multi-group Krylov subspace solver has been proposed and implemented in the STRAUM (SN Transport for Radiation Analysis with Unstructured Meshes) code. A group chunk decomposition method within the framework of the multi-group Krylov subspace solver has been applied to STRAUM to leverage multi-GPU parallel computing. For the Kobayashi-like and reactor pressure vessel problems, STRAUM typically runs faster by factors of 100 similar to 200 on a single NVIDIA GeForce RTX 4090 GPU and by factors of 70 similar to 120 on a single NVIDIA GeForce RTX 3080 Ti GPU than on a single AMD Ryzen 9 7900X CPU core. For the simulations on dual-GPU systems, the group chunk decomposition method achieves parallel computing efficiencies greater than 90% without degradation in convergence except for cases using very coarse angular divisions. Besides, this method reduces per-GPU memory usage by more than 40% and enables STRAUM to effectively simulate problems with up to ten billion unknowns using two RTX 4090 GPUs.
Accurate and faster computation of the Fast Fourier Transform (FFT) using parallel computing is the result of a novel algorithm called FFTpc described in this paper. As opposed to the Cooley-Tukey FFT, the FFTpc uses ...
详细信息
ISBN:
(纸本)9798350332865
Accurate and faster computation of the Fast Fourier Transform (FFT) using parallel computing is the result of a novel algorithm called FFTpc described in this paper. As opposed to the Cooley-Tukey FFT, the FFTpc uses only real-valued operations until the very last step. Filtering in parallel in the frequency domain is done on data subsets that are processed simultaneously with no data interchange between processors through the main parts of the filtering process. In addition, if the user only requires the magnitude of the transform, the algorithm involves no complex-valued operations at all. Many other novel aspects of the FFTpc and both estimated and actual speedups are reported.
parallel computing architectures are urgently needed to speed up the training process of artificial neural networks. This study proposes a novel approach to parallel computing using ion-modulated organic electrochemic...
详细信息
parallel computing architectures are urgently needed to speed up the training process of artificial neural networks. This study proposes a novel approach to parallel computing using ion-modulated organic electrochemical transistors (OECTs). Thanks to electrochemical doping and de-doping mechanism, the OECTs demonstrate longterm plasticity and exhibit distinguishable conductive states with high linearity. Moreover, our device array enables efficient weighted sum and convolution operations for image feature extraction and performs effectively in simulating hardware-based Faster R-CNN for object detection via transfer learning. The OECTs array, with its separate read and write features and controllable conductive states, achieves the integration of forward inference and backward training, resulting in successful in-situ training of convolutional neural networks (CNNs). The CNNs based on OECTs achieve accuracies of 96.49 % and 82.57 % on the MNIST and Fashion-MNIST datasets, respectively, showcasing the potential of OECTs in edge computing for enhanced resource utilization and time efficiency.
The paper proposes a software and hardware model for organizing parallel computing on hybrid computing clusters, aimed at creating tools for converting parallel programs into a hybrid form. A method has been developed...
详细信息
The paper proposes a software and hardware model for organizing parallel computing on hybrid computing clusters, aimed at creating tools for converting parallel programs into a hybrid form. A method has been developed for organizing the execution of parallel programs on hybrid computing clusters using compatible fuzzy cognitive maps, focused on advanced capabilities of the proposed hardware-software environment model, taking into account various hardware and software indicators, allowing to reduce the share of exchange operations performed through slow network interfaces.
We develop an algorithm to calculate invariant distributions of large Markov chains whose state spaces are partitioned into "islands" and "ports". An island is a group of states (cluster) with pote...
详细信息
We develop an algorithm to calculate invariant distributions of large Markov chains whose state spaces are partitioned into "islands" and "ports". An island is a group of states (cluster) with potentially many connections inside of the island but a relatively small number of connections between islands. The states connecting different islands are called ports. Our algorithm is developed in the framework of the "state reduction approach", but the special structure of the state space allows calculation of the invariant distribution to be done in parallel. Additional problems such as computation of fundamental matrices and optimal stopping problems are also analyzed for such Markov chains.
Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and ...
详细信息
Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.
One key advantage of quantum algorithms is based on the assumption that arbitrary quantum states can be efficiently prepared. However, limited by qubits' number and decoherence time, existing preparation methods a...
详细信息
One key advantage of quantum algorithms is based on the assumption that arbitrary quantum states can be efficiently prepared. However, limited by qubits' number and decoherence time, existing preparation methods are not friendly to the quantum state preparation of highdimensional data. In this paper, we propose a Trainable Parameterized Quantum Encoding (TPQE) method for realizing approximate encoding of arbitrary quantum states, and the representation ability is proposed as the verification criterion for TPQE for arbitrary quantum states. To enhance the robustness of the TPQE, we take into account noise as part of the TPQE and absorb the noise into the parameters through training. Moreover, we utilize parallel computing between multiple quantum processors to achieve a speedup of TPQE. Finally, the representation ability of TPQE is validated on a publicly available dataset of breast cancer using amplitude encoding as a benchmark. Experiments demonstrate that our method shows good robustness.
暂无评论