A parallel iterative layered-medium integral-equation solver is presented for fast and scalable network parameter extraction of electronic packages. The solver, which relies on a 2-D fast Fourier transform (FFT)-based...
详细信息
A parallel iterative layered-medium integral-equation solver is presented for fast and scalable network parameter extraction of electronic packages. The solver, which relies on a 2-D fast Fourier transform (FFT)-based algorithm and a sparse preconditioner to reduce computational complexity, is parallelized using three workload decomposition strategies, including a pencil decomposition that increases the scalability of the computationally dominant FFT-based multiplication stage. A set of increasingly difficult benchmark problems, which require network parameter computations for N-trace = 1 to 257 package-scale interconnects, are solved on a petaflop scale computer to quantify the solver's accuracy, efficiency, and scalability. The total serialized computation time is observed to scale asymptotically as Ntrace2.6logNtrace. For the largest problem, using similar to 1.14 million unknowns and 1536 processes, the solver requires a wall-clock time of similar to 0.05 s per iteration, similar to 1 minute per excitation, similar to 9 h per frequency, and similar to 424 hours to extract the 514-port network parameters at 40 sample frequencies between 1 to 40 GHz.
Although many NP-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the h...
详细信息
Although many NP-hard graph optimization problems can be solved in polynomial time on graphs of bounded tree-width, the adoption of these techniques into mainstream scientific computation has been limited due to the high memory requirements of the dynamic programming tables and excessive runtimes of sequential implementations. This work addresses both challenges by proposing a set of new parallel algorithms for all steps of a tree decomposition-based approach to solve the maximum weighted independent set problem. A hybrid OpenMP/MPI implementation includes a highly scalable parallel dynamic programming algorithm leveraging the MADNESS task based runtime, and computational results demonstrate scaling. This work enables a significant expansion of the scale of graphs on which exact solutions to maximum weighted independent set can be obtained, and forms a framework for solving additional graph optimization problems with similar techniques.
Iterative methods are commonly used for solving large, sparse systems of linear equations on parallel computers. Implementations of parallel iterative solvers contain kernels (e.g., parallel sparse matrix-vector produ...
详细信息
Iterative methods are commonly used for solving large, sparse systems of linear equations on parallel computers. Implementations of parallel iterative solvers contain kernels (e.g., parallel sparse matrix-vector products) in which parallel processes alternate between phases of computation and communication. Standard software packages use synchronous implementations where there are one or more synchronization points per iteration. These synchronization points occur during communication phases where each process sends data to other processes and idles until all data needed for the next iteration is received. Synchronization points scale poorly on massively parallel machines and may become the primary bottleneck for future exascale computers. This calls for research and development of asynchronous iterative methods, which is the subject of this dissertation. In asynchronous iterative methods there are no synchronization points. This means that, after a phase of computation, processes immediately proceed to the next phase of computation using whatever data is currently available. Since the late 1960s, research on asynchronous methods has primarily considered basic fixed-point methods, e.g., Jacobi, where proving asymptotic convergence bounds has been the focus. However, the practical behavior of asynchronous methods is not well understood, and asynchronous versions of certain fast-converging solvers have not been developed. This dissertation focuses on studying the practical behavior of asynchronous Jacobi, developing new communication-avoiding asynchronous iterative solvers, and introducing the first asynchronous versions of multigrid and Chebyshev. To better understand the practical behavior of asynchronous Jacobi, we examine a model of asynchronous Jacobi where communication delays are neglected. We call this model simplified asynchronous Jacobi. Simplified asynchronous Jacobi can be used to model asynchronous Jacobi implemented in shared memory or distributed memo
Edge computing aims at improving performance by storing and processing data closer to their source. The k Nearest-Neighbor (k-NN) query is a common spatial query in several applications. For example, this query can be...
详细信息
Edge computing aims at improving performance by storing and processing data closer to their source. The k Nearest-Neighbor (k-NN) query is a common spatial query in several applications. For example, this query can be used for distance classification of a group of points against a big reference dataset to derive the dominating feature class. Typically, GPU devices have much larger numbers of processing cores than CPUs and faster device memory than main memory accessed by CPUs, thus, providing higher computing power. However, since device and/or main memory may not be able to host an entire reference dataset, the use of secondary storage is inevitable. Solid State Disks (SSDs) could be used for storing such a dataset. In this paper, we propose an architecture of a distributed edge-computing environment where large-scale processing of the k-NN query can be accomplished by executing an efficient algorithm for processing the k-NN query on its (GPU and SSD enabled) edge nodes. We also propose a new algorithm for this purpose, a GPU-based partitioning algorithm for processing the k-NN query on big reference data stored on SSDs. We implement this algorithm in a GPU-enabled edge-computing device, hosting reference data on an SSD. Using synthetic datasets, we present an extensive experimental performance comparison of the new algorithm against two existing ones (working on memory-resident data) proposed by other researchers and two existing ones (working on SSD-resident data) recently proposed by us. The new algorithm excels in all the conducted experiments and outperforms its competitors. (C) 2021 Elsevier B.V. All rights reserved.
We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation and in the high-dimensional regime. We show...
详细信息
We study the question of whether parallelization in the exploration of the feasible set can be used to speed up convex optimization, in the local oracle model of computation and in the high-dimensional regime. We show that the answer is negative for both deterministic and randomized algorithms applied to essentially any of the interesting geometries and nonsmooth, weakly-smooth, or smooth objective functions. In particular, we show that it is not possible to obtain a polylogarithmic (in the sequential complexity of the problem) number of parallel rounds with a polynomial (in the dimension) number of queries per round. In the majority of these settings and when the dimension of the space is polynomial in the inverse target accuracy, our lower bounds match the oracle complexity of sequential convex optimization, up to at most a logarithmic factor in the dimension, which makes them (nearly) tight. Another conceptual contribution of our work is in providing a general and streamlined framework for proving lower bounds in the setting of parallel convex optimization. Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean (l2) and l∞ spaces.
This paper considers a variant of the Vehicle Routing Problem with Time Windows, with site dependencies, multiple depots and outsourcing costs. This problem is the basis for many technician routing problems. Having bo...
详细信息
This paper considers a variant of the Vehicle Routing Problem with Time Windows, with site dependencies, multiple depots and outsourcing costs. This problem is the basis for many technician routing problems. Having both site-dependency and time window constraints lresults in difficulties in finding feasible solutions and induces highly constrained instances. Matheuristics based on Mixed Integer Linear Programming compact formulations are firstly designed. Column Generation matheuristics are then described by using previous matheuristics and machine learning techniques to stabilize and speed up the convergence of the Column Generation algorithm. The computational experiments are analyzed on public instances with graduated difficulties in order to analyze the accuracy of algorithms for ensuring feasibility and the quality of solutions for weakly to highly constrained instances. The results emphasize the interest of the multiple types of hybridization between mathematical programming, machine learning and heuristics inside the Column Generation framework. This work offers perspectives for many extensions of technician routing problems.
IoT, being a field of great interest and importance for the coming generations, involves certain challenging and improving aspects for the IoT application developers and researchers to work upon. A wireless sensor mes...
详细信息
The paper covers the mathematical modelling of the main biogenic matter transformations in the production-destruction processes of phytoplankton populations in the Azov Sea, taking into account the influence of extern...
详细信息
This brief announcement presents parallel algorithms with a tradeoff between work and span for single source reachability and approximate shortest paths on directed graphs. Both algorithms have ∼O(mρ2 + nρ4) work a...
详细信息
In the analysis of methods of multicriteria optimization. The detailed implementation of the parallel algorithm of the simulated annealing method is reproduced by the example of the extension of a large-scale travelli...
详细信息
暂无评论