检索结果-内蒙古大学图书馆

Histogram algorithm and its circuit design based on parallel computing for quantum video

MULTIMEDIA TOOLS AND APPLICATIONS 2024年第31期83卷 76177-76199页

作者： Zhang, Qianqian Lu, Dayong Hu, Yingying Xu, Meiyu Henan Univ Sch Math & Stat Kaifeng 475001 Peoples R China Henan Univ Sci & Technol Sch Math & Stat Luoyang 471000 Peoples R China

Quantum image histogram as a preprocessing result in quantum image processing contains the gray information of the image and plays an important role in subsequent image processing. As far as we know, there are only a few results on quantum image histogram, and studies on quantum video histogram have not been conducted. So a novel histogram statistic algorithm for quantum video in terms of the idea of parallel computing is proposed in the paper. To this end, the quantum version of carry-lookahead full-adder is first devised, and based on the novel full-adder, an entirely new hierarchical quantum adder for superposition states is also devised, which not only improves the delays generated by mutual carries of classical adder, but also reduces the complexity of classical adder from O(2mxn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {O(2<^>{m}\times n)}$$\end{document} to O(m2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {O(m<^>{2})}$$\end{document}. Subsequently, in order to realize the parallel statistics of quantum video, the algorithm and circuit implementation of image stitching are also given. Finally, combining the results of image stitching and Grover's search algorithm, the quantum video histogram statistics is ultimately realized in parallel.

关键词： Quantum video Quantum histogram parallel computing Grover's search algorithm Carry-lookahead full-adder

来源：评论

学校读者我要写书评

暂无评论

Time Complexity of Training DNNs With parallel computing for Wireless Communications

IEEE OPEN JOURNAL OF VEHICULAR TECHNOLOGY

引用

IEEE OPEN JOURNAL OF VEHICULAR TECHNOLOGY 2025年 6卷 359-384页

作者： Cong, Pengyu Yang, Chenyang Han, Shengqian Han, Shuangfeng Wang, Xiaoyun Beihang Univ Beijing 100191 Peoples R China China Mobile Res Inst Beijing 100053 Peoples R China

Deep neural networks (DNNs) have been widely used for learning various wireless communication policies. While DNNs have demonstrated the ability to reduce the time complexity of inference, their training often incurs a high computational cost. Since practical wireless systems require retraining due to operating in open and dynamic environments, it is crucial to analyze the factors affecting the training complexity, which can guide the DNN architecture selection and the hyper-parameter tuning for efficient policy learning. As a metric of time complexity, the number of floating-point operations (FLOPs) for inference has been analyzed in the literature. However, the time complexity of training DNNs for learning wireless communication policies has only been evaluated in terms of runtime. In this paper, we introduce the number of serial FLOPs (se-FLOPs) as a new metric of time complexity, accounting for the ability of parallel computing. The se-FLOPs metric is consistent with actual runtime, making it suitable for measuring the time complexity of training DNNs. Since graph neural networks (GNNs) can learn a multitude of wireless communication policies efficiently and their architectures depend on specific policies, no universal GNN architecture is available for analyzing complexities across different policies. Thus, we first use precoder learning as an example to demonstrate the derivation of the numbers of se-FLOPs required to train several DNNs. Then, we compare the results with the se-FLOPs for inference of the DNNs and for executing a popular numerical algorithm, and provide the scaling laws of these complexities with respect to the numbers of antennas and users. Finally, we extend the analyses to the learning of general wireless communication policies. We use simulations to validate the analyses and compare the time complexity of each DNN trained for achieving the best learning performance and achieving an expected performance.

关键词： Training Complexity theory Wireless communication Time complexity parallel processing Precoding Measurement Time measurement Runtime Hardware DNN parallel computing precoding se-FLOPs time complexity

来源：评论

学校读者我要写书评

暂无评论

A neural network framework for optimizing parallel computing in cloud servers

引用

JOURNAL OF SYSTEMS ARCHITECTURE 2024年 150卷

作者： de Lima, Everton C. Rossi, Fabio D. Luizelli, Marcelo C. Calheiros, Rodrigo N. Lorenzon, Arthur F. Fed Univ Pampa Alegrete RS Brazil Fed Inst Farroupilha Alegrete RS Brazil Western Sydney Univ Sydney Australia Univ Fed Rio Grande do Sul Inst Informat Porto Alegre RS Brazil

Energy efficiency has become a major focus in optimizing hardware resource usage for Cloud servers. One approach widely employed to enhance the execution of parallel applications is thread-level parallelism (TLP) exploitation. This technique leverages multiple threads to improve computational efficiency and performance. However, the increasing heterogeneity of resources in cloud environments and the complexity of selecting the optimal configuration for each application pose a significant challenge to cloud users due to the massive number of possible configurations and the need to effectively harness TLP in diverse hardware setups to achieve optimal energy efficiency and performance. To address this challenge, we propose TLP-Allocator, an artificial neural network (ANN) optimization strategy that uses hardware and software metrics to build and train an ANN model. It predicts worker node and thread count combinations that provide optimal energy-delay product (EDP) results. In experiments using ten well-known applications on a private cloud with heterogeneous resources, we show that TLP-Allocator predicts combinations that yield EDP values close to the best achieved by an exhaustive search. It also improves the overall EDP by 38.2% compared to state-of-the-art workloads scheduling on cloud environments.

关键词： parallel computing Energy efficiency Cloud computing Artificial neural network

来源：评论

学校读者我要写书评

暂无评论

GPU-based multi-group discrete ordinates transport calculations: parallel computing implementation in STRAUM

引用

NUCLEAR ENGINEERING AND TECHNOLOGY 2025年第9期57卷

作者： Zhang, Ao Hong, Ser Gi Jeong, Seungil Chen, Jingen Hanyang Univ Dept Nucl Engn 222 Wangsimni Ro Seoul 04763 South Korea Chinese Acad Sci Inst Appl Phys Shanghai 201800 Peoples R China

Discrete ordinates (SN) method with unstructured meshes is highly appropriate for high-fidelity modeling and simulation of radiation shielding problems with complicated geometries. However, the large number of unknowns resulting from discretization of the transport equation on the spatial, angular, and energy variables, necessitates the use of parallel computing to achieve efficient solutions. In this work, a GPU-based SN transport sweep algorithm combined with a GPU-parallel multi-group Krylov subspace solver has been proposed and implemented in the STRAUM (SN Transport for Radiation Analysis with Unstructured Meshes) code. A group chunk decomposition method within the framework of the multi-group Krylov subspace solver has been applied to STRAUM to leverage multi-GPU parallel computing. For the Kobayashi-like and reactor pressure vessel problems, STRAUM typically runs faster by factors of 100 similar to 200 on a single NVIDIA GeForce RTX 4090 GPU and by factors of 70 similar to 120 on a single NVIDIA GeForce RTX 3080 Ti GPU than on a single AMD Ryzen 9 7900X CPU core. For the simulations on dual-GPU systems, the group chunk decomposition method achieves parallel computing efficiencies greater than 90% without degradation in convergence except for cases using very coarse angular divisions. Besides, this method reduces per-GPU memory usage by more than 40% and enables STRAUM to effectively simulate problems with up to ten billion unknowns using two RTX 4090 GPUs.

关键词： Multi-GPUs parallel computing Group chunk decomposition Unstructured SN transport

来源：评论

学校读者我要写书评

暂无评论

parallel computing of the FFT and of Spectral Filtering with Few to No Complex-Valued Operations 13

Parallel Computing of the FFT and of Spectral Filtering with...

引用

IEEE 13th Annual computing and Communication Workshop and Conference (CCWC)

作者： Mugler, Dale H. Ocius Technol LLC Akron OH 44311 USA

ISBN: (纸本)9798350332865

Accurate and faster computation of the Fast Fourier Transform (FFT) using parallel computing is the result of a novel algorithm called FFTpc described in this paper. As opposed to the Cooley-Tukey FFT, the FFTpc uses only real-valued operations until the very last step. Filtering in parallel in the frequency domain is done on data subsets that are processed simultaneously with no data interchange between processors through the main parts of the filtering process. In addition, if the user only requires the magnitude of the transform, the algorithm involves no complex-valued operations at all. Many other novel aspects of the FFTpc and both estimated and actual speedups are reported.

关键词： parallel computing FFT parallel filtering DCTIV speedup OFDM

来源：评论

学校读者我要写书评

暂无评论

An ion-modulated organic electrochemical synaptic transistor for efficient parallel computing and in-situ training

引用

ORGANIC ELECTRONICS 2025年 143卷

作者： Wan, Xiang Yan, Jie Cui, Shengnan Xu, Yong Sun, Huabin Nanjing Univ Posts & Telecommun Coll Integrated Circuit Sci & Engn Nanjing 210023 Peoples R China Inst Integrated Circuit & Syst Guangdong Greater Bay Area Guangzhou 510535 Peoples R China

parallel computing architectures are urgently needed to speed up the training process of artificial neural networks. This study proposes a novel approach to parallel computing using ion-modulated organic electrochemical transistors (OECTs). Thanks to electrochemical doping and de-doping mechanism, the OECTs demonstrate longterm plasticity and exhibit distinguishable conductive states with high linearity. Moreover, our device array enables efficient weighted sum and convolution operations for image feature extraction and performs effectively in simulating hardware-based Faster R-CNN for object detection via transfer learning. The OECTs array, with its separate read and write features and controllable conductive states, achieves the integration of forward inference and backward training, resulting in successful in-situ training of convolutional neural networks (CNNs). The CNNs based on OECTs achieve accuracies of 96.49 % and 82.57 % on the MNIST and Fashion-MNIST datasets, respectively, showcasing the potential of OECTs in edge computing for enhanced resource utilization and time efficiency.

关键词： Ion modulation Organic electrochemical transistor Synaptic plasticity parallel computing In-situ training

来源：评论

学校读者我要写书评

暂无评论

Organization of parallel computing on Hybrid computing Clusters Using Fuzzy Intellectual Analysis

引用

PATTERN RECOGNITION AND IMAGE ANALYSIS 2024年第3期34卷 639-644页

作者： Fedulov, Ya. A. Fedulova, A. S. Natl Res Univ Moscow Power Engn Inst Smolensk Smolensk 214013 Russia

The paper proposes a software and hardware model for organizing parallel computing on hybrid computing clusters, aimed at creating tools for converting parallel programs into a hybrid form. A method has been developed for organizing the execution of parallel programs on hybrid computing clusters using compatible fuzzy cognitive maps, focused on advanced capabilities of the proposed hardware-software environment model, taking into account various hardware and software indicators, allowing to reduce the share of exchange operations performed through slow network interfaces.

关键词： fuzzy logic parallel computing parallel programming technologies fuzzy data analysis program conversion

来源：评论

学校读者我要写书评

暂无评论

parallel computing for Markov chains with islands and ports

引用

ANNALS OF OPERATIONS RESEARCH 2022年第2期317卷 335-352页

作者： Basnet, Amod J. Sonin, Isaac M. Univ North Carolina Charlotte Dept Math & Stat Charloote NC 28223 USA

We develop an algorithm to calculate invariant distributions of large Markov chains whose state spaces are partitioned into "islands" and "ports". An island is a group of states (cluster) with potentially many connections inside of the island but a relatively small number of connections between islands. The states connecting different islands are called ports. Our algorithm is developed in the framework of the "state reduction approach", but the special structure of the state space allows calculation of the invariant distribution to be done in parallel. Additional problems such as computation of fundamental matrices and optimal stopping problems are also analyzed for such Markov chains.

关键词： Markov chains Invariant distribution Islands and ports parallel computing State reduction approach

来源：评论

学校读者我要写书评

暂无评论

oclCUB: an OpenCL parallel computing library for deep learning operators

引用

CCF TRANSACTIONS ON HIGH PERFORMANCE computing 2024年第3期6卷 319-329页

作者： Shi, Changqing Sun, Yufei Sui, Yicheng Chen, Yuqiao Wang, Haotian Zhang, Yuzhi Nankai Univ Coll Software Tianjin 300450 Peoples R China ITAI Haihe Lab Tianjin 300350 Peoples R China

Deep learning (DL) mainly uses various parallel computing libraries to optimize the speed of model training. The underlying computations of the DL operators typically include essential functions such as reduction and prefix scan, the efficiency of which can be greatly improved using parallel acceleration devices. However, the acceleration of these computations is mainly supported by collective primitive libraries such as NVIDIA CUB and AMD hipCUB, which are only available on vendor-specific hardware accelerators due to the highly segregated computational ecology between different vendors. To address this issue, we propose an OpenCL parallel computing library called oclCUB that can run on different heterogeneous platforms. OclCUB abstracts the OpenCL execution environment, implements reusable common underlying computations of DL, and designs two types of interfaces targeting the operators' heterogeneous acceleration pattern, enabling users to design and optimize DL operators efficiently. We evaluate the oclCUB on various hardware accelerators across Nvidia Tesla V100s with OpenCL 1.2, AMD RADEON PRO V520 with OpenCL 2.0, MT-3000 with MOCL 3, and Kunpeng 920 with POCL 1.6. Our experiments show that the oclCUB-based operators achieve accurate computational results on various platforms. The results also demonstrate that oclCUB is able to maintain a smaller, acceptable performance gap with CUB, and comparable in performance to hipCUB.

关键词： parallel computing Deep learning Heterogeneous computing OpenCL High Performance Super Computer

来源：评论

学校读者我要写书评

暂无评论

Trainable parameterized quantum encoding and application based on enhanced robustness and parallel computing

引用

CHINESE JOURNAL OF PHYSICS 2024年 90卷 901-910页

作者： Ding, Xiaodong Xu, Jinchen Song, Zhihui Hou, Yifan Shan, Zheng Lab Adv Comp & Intelligence Engn Zhengzhou 450001 Peoples R China

One key advantage of quantum algorithms is based on the assumption that arbitrary quantum states can be efficiently prepared. However, limited by qubits' number and decoherence time, existing preparation methods are not friendly to the quantum state preparation of highdimensional data. In this paper, we propose a Trainable Parameterized Quantum Encoding (TPQE) method for realizing approximate encoding of arbitrary quantum states, and the representation ability is proposed as the verification criterion for TPQE for arbitrary quantum states. To enhance the robustness of the TPQE, we take into account noise as part of the TPQE and absorb the noise into the parameters through training. Moreover, we utilize parallel computing between multiple quantum processors to achieve a speedup of TPQE. Finally, the representation ability of TPQE is validated on a publicly available dataset of breast cancer using amplitude encoding as a benchmark. Experiments demonstrate that our method shows good robustness.

关键词： TPQE Representation ability Approximate encoding Quantum states parallel computing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：