Iterative methods could provide high-quality image reconstruction for Fourier-domain optical coherence tomography (FD-OCT) by solving an inverse problem. Compared with the regular IFFT-based reconstruction, a more acc...
详细信息
ISBN:
(纸本)9781510658394;9781510658400
Iterative methods could provide high-quality image reconstruction for Fourier-domain optical coherence tomography (FD-OCT) by solving an inverse problem. Compared with the regular IFFT-based reconstruction, a more accurate estimation could be iteratively solved by integrating prior knowledge, however, it is often more time-consuming. To deal with the time problem, we proposed a fast iterative method for FD-OCT image reconstruction empowered by GPU acceleration. An iterative scheme is adopted, including a forward model and an inverse solver. Large-scale parallelism of OCT image reconstruction is performed on B-scans. We deployed the framework on Nvidia GeForce RTX 3090 graphic card that enables parallelprocessing. With the widely used toolkit Pytorch, the inverse problem of OCT image reconstruction is solved by the stochastic gradient descent (SGD) algorithm. To validate the effectiveness of the proposed method, we compare the computational time and image quality with other iterative approaches including ADMM, AR, and RFIAA method. The proposed method could provide a significant speed enhancement of 1,500 times with comparable image quality to that of ADMM reconstruction. The result indicates a potential for high-quality real-time volumetric OCT image reconstruction via iterative algorithms.
Due to dark environments, optical aberrations, etc, the remote sensing images are often submerged under low contrast degradation, which greatly hinders their practical applications for agricultural management and othe...
详细信息
ISBN:
(数字)9798350368741
ISBN:
(纸本)9798350368758
Due to dark environments, optical aberrations, etc, the remote sensing images are often submerged under low contrast degradation, which greatly hinders their practical applications for agricultural management and other related tasks. The surface features of remote sensing images are often continuously distributed in space, thus, the sizes of the network’s receptive fields and its ability to learn long-range dependencies are crucial for restoring low-light remote sensing images. Existing methods based on CNN provide limited receptive fields, while Transformer-based methods are constrained by their quadratic computational complexity. To cope with these issues, we propose a novel low-light remote sensing image enhancement network that combines multi-scale receptive fields with frequency-domain attention. Specifically, this network employs multiple parallel kernels of varying sizes to learn multi-scale local features in the spatial domain and complements frequency-domain information to learn global long-range correlations, which achieves local-global feature extraction and further facilitates subsequent degraded images enhancement. We have conducted extensive experiments to demonstrate that our network outperforms existing methods quantitatively and achieves exceptional visual performance, which fully highlights the effectiveness and superiority of our method in enhancing low-light remote sensing images.
Nowadays, the throughput improvement in large clusters of computers recommends the development of malleable applications. Thus, during the execution of these applications in a job, the resource management system (RMS)...
详细信息
Transformer-based methods have demonstrated remarkable performance on image super-resolution tasks. Due to high computational complexity, researchers have been working to achieve a balance between computation costs an...
详细信息
Preconditioned Conjugate Gradient (PCG) method has been one of the widely used methods for solving linear systems of equations for sparse problems. Pipelined PCG (PIPECG) attempts to eliminate the dependencies in the ...
详细信息
ISBN:
(纸本)9783031061561;9783031061554
Preconditioned Conjugate Gradient (PCG) method has been one of the widely used methods for solving linear systems of equations for sparse problems. Pipelined PCG (PIPECG) attempts to eliminate the dependencies in the computations in the PCG algorithm and overlap non-dependent computations by reorganizing the traditional PCG code and using non-blocking allreduces . We have developed a novel pipelined PCG algorithm called PIPECG-OATI (One Allreduce per Two Iterations) which reduces the number of non-blocking allreduces to one per two iterations and provides large overlap of global communication and computations at higher number of cores in distributed memory CPU systems. PIPECG-OATI gives up to 3x speedup over PCG and 1.73x speedup over PIPECG at large number of cores. For GPU accelerated heterogeneous architectures, we have developed three methods for efficient execution of the PIPECG algorithm. These methods achieve task and data parallelism. Our methods give considerable performance improvements over PCG CPU and GPU implementations of Paralution and PETSc libraries.
The proceedings contain 88 papers. The topics discussed include: HSP: hybrid synchronous parallelism for fast distributed deep learning;adaptive and efficient GPU time sharing for hyperparameter tuning in cloud;TCB: a...
ISBN:
(纸本)9781450397339
The proceedings contain 88 papers. The topics discussed include: HSP: hybrid synchronous parallelism for fast distributed deep learning;adaptive and efficient GPU time sharing for hyperparameter tuning in cloud;TCB: accelerating transformer inference services with request concatenation;EmbRace: accelerating sparse communication for distributed training of deep neural networks;FedHiSyn: a hierarchical synchronous federated learning framework for resource and data heterogeneity;parallel algorithms for masked sparse matrix-matrix products;tesseract: parallelize the tensor parallelism efficiently;automatic differentiation of parallel loops with formal methods;a single-tree algorithm to compute Euclidean minimum spanning tree on GPUs;efficient phase-functioned real-time character control in mobile games: a TVM enabled approach;and SHE: a generic framework for data stream mining over sliding windows.
作者:
Marzolla, Moreno
Center for Inter-Department Industrial Research ICT Bologna Italy
Mini-applications are widely used in parallel computing for testing and benchmarking purposes. However, many existing mini-applications are not suitable for teaching, since they require advanced knowledge of algebra, ...
详细信息
Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With inc...
详细信息
ISBN:
(纸本)9781713871088
Variational inequalities in general and saddle point problems in particular are increasingly relevant in machine learning applications, including adversarial learning, GANs, transport and robust optimization. With increasing data and problem sizes necessary to train high performing models across various applications, we need to rely on parallel and distributed computing. However, in distributed training, communication among the compute nodes is a key bottleneck during training, and this problem is exacerbated for high dimensional and over-parameterized models. Due to these considerations, it is important to equip existing methods with strategies that would allow to reduce the volume of transmitted information during training while obtaining a model of comparable quality. In this paper, we present the first theoretically grounded distributedmethods for solving variational inequalities and saddle point problems using compressed communication: MASHA1 and MASHA2. Our theory and methods allow for the use of both unbiased (such as Randk;MASHA1) and contractive (such as Topk;MASHA2) compressors. New algorithms support bidirectional compressions, and also can be modified for stochastic setting with batches and for federated learning with partial participation of clients. We empirically validated our conclusions using two experimental setups: a standard bilinear min-max problem, and large-scale distributed adversarial training of transformers.
The goal of single-image defocus deblurring is to reconstruct a clear image from a defocused one. Although existing methods perform well in common blurry scenes, they still face the challenge of feature extraction whe...
详细信息
Computational electromagnetics plays a crucial role across diverse domains, notably in fields such as antenna design and radar signature prediction, owing to the omnipresence of electromagnetic phenomena. Numerical me...
详细信息
ISBN:
(纸本)9783031637537;9783031637513
Computational electromagnetics plays a crucial role across diverse domains, notably in fields such as antenna design and radar signature prediction, owing to the omnipresence of electromagnetic phenomena. Numerical methods have replaced traditional experimental approaches, expediting design iterations and scenario characterization. The emergence of GPU accelerators offers an efficient implementation of numerical methods that can significantly enhance the computational capabilities of partial differential equations (PDE) solvers with specific boundary-value conditions. This paper explores parallelization strategies for implementing a Finite-Difference Time-Domain (FDTD) solver on GPUs, leveraging shared memory and optimizing memory access patterns to achieve performance gains. One notable innovation presented in this research involves utilizing strategies such as exploiting temporal locality and avoiding misaligned global memory accesses to enhance data processing efficiency. Additionally, we break down the computation process into multiple kernels, each focusing on computing different electromagnetic (EM) field components, to enhance shared memory utilization and GPU cache efficiency. We implement crucial design optimizations to exploit GPU's parallelprocessing capabilities fully. These include maintaining consistent block sizes, analyzing optimal configurations for field-updating kernels, and optimizing memory access patterns for CUDA threads within warps. Our experimental analysis verifies the effectiveness of these strategies, resulting in improvements in both reducing execution time and enhancing the GPU's effective memory bandwidth. Throughput evaluation demonstrates performance gains, with our CUDA implementation achieving up to 17 times higher throughput than CPU-based methods. Speedup gains and throughput comparisons illustrate the scalability and efficiency of our approach, showcasing its potential for developing large-scale electromagnetic simulations on
暂无评论