Optimization of memristor based ultrasonic transducers for mesoscopic characterization of biomaterials has been presented during the last ieee 2022 ISAF. In parallel, the development of quaternionic Clifford based tra...
详细信息
ISBN:
(纸本)9798350371918;9798350371901
Optimization of memristor based ultrasonic transducers for mesoscopic characterization of biomaterials has been presented during the last ieee 2022 ISAF. In parallel, the development of quaternionic Clifford based transforms for improving deep learning algorithms within nondestructive testing (NDT) has been recently suggested. Practical implementatiion of this signal processing tool was tested on the time reversal based nonlinear elastic wave spectroscopy (TR-NEWS) experiments. The (2+1)D image data is suggested using quaternions H, detecting anomalous scattering positions in media with hysteresis, while in the (3+1)D image data analysis, an extension was necessary. The (3+1)D image processing was conducted using Clifford algebra A3,1 isomorphic to M2(H), where ultrasonic signals are represented in the biquaternionic space. The optimal weight function of paths associated with this biquaternionic signal processing was obtained by modifying the Echo State Network (ESN) method, and comparison with TR-NEWS experiments was conducted on complex samples (biomaterials or NDT samples) with intrinsic hysteretic nonlinearities. Stability of the weight function of the ultrasonic (US) wave path in (3+1)D is checked by the Machine Learning technique.
Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems hav...
详细信息
ISBN:
(纸本)9798350337662
Over the years, much research involving mobile computational entities has been performed. From modeling actual microscopic (and smaller) robots, to modeling software processes on a network, many important problems have been studied in this context. Gathering is one such fundamental problem in this area. The problem of gathering k robots, initially arbitrarily placed on the nodes of an n-node graph, asks that these robots coordinate and communicate in a local manner, as opposed to global, to move around the graph, find each other, and settle down on a single node as fast as possible. A more difficult problem to solve is gathering with detection, where once the robots gather, they must subsequently realize that gathering has occurred and then terminate. In this paper, we propose a deterministic approach to solve gathering with detection for any arbitrary connected graph that is faster than existing deterministic solutions for even just gathering (without the requirement of detection) for arbitrary graphs. In contrast to earlier work on gathering, it leverages the fact that there are more robots present in the system to achieve gathering with detection faster than those previous papers that focused on just gathering. The state of the art solution for deterministic gathering [Ta-Shma and Zwick, TALG, 2014] takes O-similar to (n(5) log l) rounds, where l is the smallest label among robots and O-similar to hides a polylog factor. We design a deterministic algorithm for gathering with detection with the following trade-offs depending on how many robots are present: (i) when k = >= n/2 + 1, the algorithm takes O(n(3)) rounds, (ii) when k >= [n/3] + 1, the algorithm takes O(n(4) log n) rounds, and (iii) otherwise, the algorithm takes O-similar to (n(5)) rounds. The algorithm is not required to know k, but only n.
K-means clustering is a popular unsupervised machine learning method widely used in various applications, such as data mining, image processing, and social sciences. However, clustering can be computationally expensiv...
详细信息
The GPU programming model is primarily aimed at the development of applications that run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memo...
详细信息
ISBN:
(纸本)9781665481069
The GPU programming model is primarily aimed at the development of applications that run one GPU. However, this limits the scalability of GPU code to the capabilities of a single GPU in terms of compute power and memory capacity. To scale GPU applications further, a great engineering effort is typically required: work and data must be divided over multiple GPUs by hand, possibly in multiple nodes, and data must be manually spilled from GPU memory to higher-level memories. We present Lightning: a framework that follows the common GPU programming paradigm but enables scaling to large problems with ease. Lightning supports multi-GPU execution of GPU kernels, even across multiple nodes, and seamlessly spills data to higher-level memories (main memory and disk). Existing CUDA kernels can easily be adapted for use in Lightning, with data access annotations on these kernels allowing Lightning to infer their data requirements and the dependencies between subsequent kernel launches. Lightning efficiently distributes the work/data across GPUs and maximizes efficiency by overlapping scheduling, data movement, and kernel execution when possible. We present the design and implementation of Lightning, as well as experimental results on up to 32 GPUs for eight benchmarks and one real-world application. Evaluation shows excellent performance and scalability, such as a speedup of 57.2x over the CPU using Lighting with 16 GPUs over 4 nodes and 80 GB of data, far beyond the memory capacity of one GPU.
The present work investigates the modeling of pre-exascale input/output (DO) workloads of Adaptive Mesh Refinement (AMR) simulations through a simple proxy application. We collect data from the AMReX Castro framework ...
详细信息
ISBN:
(纸本)9781665497473
The present work investigates the modeling of pre-exascale input/output (DO) workloads of Adaptive Mesh Refinement (AMR) simulations through a simple proxy application. We collect data from the AMReX Castro framework running on the Summit supercomputer for a wide range of scales and mesh partitions for the hydrodynamic Sedov case as a baseline to provide sufficient coverage to the formulated proxy model. The non-linear analysis data production rates are quantified as a function of a set of input parameters such as output frequency, grid size, number of levels, and the Courant-Friedrichs-Lewy (CFL) condition number for each rank, mesh level and simulation time step. Linear regression is then applied to formulate a simple analytical model which allows to translate AMReX inputs into MACSio proxy I/O application parameters, resulting in a simple "kernel" approximation for data production at each time step. Results show that MACSio can simulate actual AMReX nonlinear "static" I/O workloads to a certain degree of confidence on the Summit supercomputer using the present methodology. The goal is to provide an initial level of understanding of AMR I/O workloads via lightweight proxy applications models to facilitate autotune data management strategies in anticipation of exascale systems.
There is a growing need, for example in machine learning and analytics, to decompose applications into smaller schedulable units. Such decomposition can improve performance, reduce energy consumption, and increase res...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
There is a growing need, for example in machine learning and analytics, to decompose applications into smaller schedulable units. Such decomposition can improve performance, reduce energy consumption, and increase resource utilization. Unfortunately, enabling fine-grained parallelism comes with significant overheads and requires improvements at all layers of the programming stack. We consider the challenges of supporting fine-grained parallelism in the increasingly popular Python-based programming libraries. Specifically, we focus on Parsl, a Python library that is widely used to parallelize the execution of fine-grained Python functions. Parsl's Python-based runtime supports a maximum throughput of around 1200 tasks per second insufficient to meet modern application needs. We perform a comprehensive analysis of Parsl and identify areas that prohibit it from achieving higher throughput. We first profile Parsl components and identify that, with fine-grained tasks workers are often not saturated. We find that tasks spend a majority of their time in the components between the scheduler and worker, however, we also learned that the scheduler is capable of submitting thousands of tasks per second. We then focused on developing new optimizations and implementing crucial components in C to improve throughput. Our new implementation increases Parsl's throughput 6 fold.
We propose an efficient version of the PageRank algorithm for adjacency matrices, that reduces the complexity by a factor two. This method computes the A(T)x operation on the transpose matrix A(T) without having to ex...
详细信息
ISBN:
(纸本)9781665497473
We propose an efficient version of the PageRank algorithm for adjacency matrices, that reduces the complexity by a factor two. This method computes the A(T)x operation on the transpose matrix A(T) without having to explicitly normalize and transpose the matrix. We implement the method using standard row-major and column-major matrix storage formats. We perform experiments with parallel implementations in OpenMP, on synthetic data as well as on matrices extracted from large-scale graphs. The experiments are done on two different Intel processors from recent generations. The column-major storage format version of our method shows good scaling and outperforms the standard PageRank in a majority of cases, even when not considering the preprocessing burden in the latter.
A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-port...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
A performance-portable application can run on a variety of different hardware platforms, achieving an acceptable level of performance without requiring significant rewriting for each platform. Several performance-portable programming models are now suitable for high-performance scientific application development, including OpenMP and Kokkos. Chapel is a parallel programming language that supports the productive development of high-performance scientific applications and has recently added support for GPU architectures through native code generation. Using three mini-apps BabelStream, miniBUDE, and TeaLeaf we evaluate the Chapel language's performance portability across various CPU and GPU platforms. In our evaluation, we replicate and build on previous studies of performance portability using mini-apps, comparing Chapel against OpenMP, Kokkos, and the vendor programming models CUDA and HIP. We find that Chapel achieves comparable performance portability to OpenMP and Kokkos and identify several implementation issues that limit Chapel's performance portability on certain platforms.
With the rapid growth of cloud computing, outsourcing databases to cloud servers is becoming increasingly popular. Query integrity authentication is an effective technique to obtain reliable query results from untrust...
详细信息
Hypergraphs offer flexible and robust data representations for many applications, but methods that work directly on hypergraphs are not readily available and tend to be prohibitively expensive. Much of the current ana...
详细信息
ISBN:
(纸本)9781665481069
Hypergraphs offer flexible and robust data representations for many applications, but methods that work directly on hypergraphs are not readily available and tend to be prohibitively expensive. Much of the current analysis of hypergraphs relies on first performing a graph expansion - either based on the nodes (clique expansion), or on the hyperedges (line graph) - and then running standard graph analytics on the resulting representative graph. However, this approach suffers from massive space complexity and high computational cost with increasing hypergraph size. Here, we present efficient, parallel algorithms to accelerate and reduce the memory footprint of higher-order graph expansions of hypergraphs. Our results focus on the hyperedge-based s-line graph expansion, but the methods we develop work for higher-order clique expansions as well. To the best of our knowledge, ours is the first framework to enable hypergraph spectral analysis of a large dataset on a single sharedmemory machine. Our methods enable the analysis of datasets from many domains that previous graph-expansion-based models are unable to provide. The proposed s-line graph computation algorithms are orders of magnitude faster than state-of-the-art sparse general matrix-matrix multiplication methods, and obtain approximately 2 - 31x speedup over a prior state-of-the-art heuristic-based algorithm for s-line graph computation.
暂无评论