The seismic reflection method is one of the most important methods in geophysical *** are three stages in a seismic exploration survey:acquisition,processing,and *** paper focuses on a pre-processing tool,the Non-Loca...
详细信息
The seismic reflection method is one of the most important methods in geophysical *** are three stages in a seismic exploration survey:acquisition,processing,and *** paper focuses on a pre-processing tool,the Non-Local Means(NLM)filter algorithm,which is a powerful technique that can significantly suppress noise in seismic ***,the domain of the NLM algorithm is the whole dataset and 3D seismic data being very large,often exceeding one terabyte(TB),it is impossible to store all the data in Random Access Memory(RAM).Furthermore,the NLM filter would require a considerably long *** factors make a straightforward implementation of the NLM algorithm on real geophysical exploration data *** paper redesigned and implemented the NLM filter algorithm to fit the challenges of seismic *** optimized implementation of the NLM filter is capable of processing production-size seismic data on modern clusters and is 87 times faster than the straightforward implementation of NLM.
OpenMP is the predominant standard for shared memory systems in high-performance computing (HPC), offering a tasking paradigm for parallelism. However, existing OpenMP implementations, like GCC and LLVM, face computat...
详细信息
The paper proposes a metric that evaluates the overhead introduced into parallel programs by the additional operations that parallelism implicitly imposes. We consider the case of multithreaded parallel programs that ...
详细信息
ISBN:
(纸本)9783031365966;9783031365973
The paper proposes a metric that evaluates the overhead introduced into parallel programs by the additional operations that parallelism implicitly imposes. We consider the case of multithreaded parallel programs that follows SPMD (Single Program Multiple Data) model. Java programs were considered for this proposal, but the metric could be easily adapted for any multithreading supporting imperative language. The metric is defined as a combination of several atomic metrics considering various synchronisation mechanisms that could be discovered using the source code analysis. A theoretical validation of this metric is presented, and an empirical evaluation of several use cases. Additionally, we propose an Artificial Intelligence based strategy to refine the evaluation of the metric by obtaining approximation for the weights that are used in combining the considered atomic metrics. The methodology of the approach is statistical using the multiple linear regression, which considers as dependent variable the execution times of different concrete use-cases, and as independent variables the corresponding overhead times introduced by the considered synchronization mechanisms, which are approximated through the atomic metrics. The results indicated a high degree of correlation between the dependent and independent variables. The Root Mean Square Error obtained is 0.155186, thus being very small, the predicted and observed values are very close.
Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating ...
详细信息
Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: 1) manycore co-processors rely on simple hardware putting significant demands on the software programmer and 2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this article presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPU-manycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naive-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggests these workloads can achieve approximately 2-6 x performance improvement when scaled to a future 2000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energy efficiency compared to general-purpose graphics processing units.
The Discrete Cosine Transform (DCT) is commonly used for image and video coding and very efficient implementations of the forward and inverse transforms are of great importance. The popular libjpegturbo library contai...
详细信息
The Discrete Cosine Transform (DCT) is commonly used for image and video coding and very efficient implementations of the forward and inverse transforms are of great importance. The popular libjpegturbo library contains handwritten, highly-optimised assembly language DCT implementations utilizing SIMD instruction sets for a variety of architectures. We present an alternative approach, implementing the 8x8 IDCT and FDCT written in the functional image processing language Halide. We show how less than 200 lines of Halide can replace over 20,000 lines of code the libjpeg-turbo library to perform JPEG encoding and decoding. The Halide implementation is compared for ARMv8 NEON and x86-64 SIMD extensions and shows a 5-25 percent performance improvement over the SIMD code in libjpeg-turbo for decoding and a 10-40 percent improvement for encoding. The Halide code is significantly easier to maintain and port to new architectures than the existing code. (c) 2022 Elsevier Inc. All rights reserved.
Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility ca...
详细信息
The Dataflow model, where instructions or tasks are fired as soon as their input data is ready, was proven to be a good fit for parallel/distributed computation. Previous works have presented DFER (Dataflow Error Reco...
详细信息
ISBN:
(纸本)9798350381603
The Dataflow model, where instructions or tasks are fired as soon as their input data is ready, was proven to be a good fit for parallel/distributed computation. Previous works have presented DFER (Dataflow Error Recovery Model), that allows transient error and recovery in dataflow by adding special tasks and edges to the dataflow graph itself. However, permanent faults or faults that cause a processing element (PE) to become irresponsive are not addressed by DFER. For those cases it is necessary to adopt a checkpointing method. Since the whole purpose of Dataflow is to achieve high levels of parallelism and explore the potential asynchronicity between PEs, it is clear that the checkpointing method adopted must be uncoordinated and distributed. Current algorithms for distributed checkpointing rely solely on guaranteeing that causality between checkpoints can be trackable. In the context of Dataflow with static scheduling, i.e. when the dataflow graph is partitioned among the available PEs at compile-time, causality trackability is not sufficient as we will show. Since static scheduling of dataflow graphs is very important in various scenarios, it calls for a new algorithm for distributed checkpointing that can be adopted for the execution of statically scheduled dataflow graphs. In this paper we describe why the ability to track causality is not enough for statically scheduled dataflow and introduce a new algorithm for distributed checkpointing specifically tailored for such model of execution.
The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the numb...
详细信息
The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI convergence. Based on this study, the paper identifies the challenges of a new workflow platform to manage complex workflows. Finally, it proposes a development approach for such a workflow platform addressing these challenges in two directions: first, by defining a software stack that provides the functionalities to manage these complex workflows;and second, by proposing the HPC Workflow as a Service (HPCWaaS) paradigm, which leverages the software stack to facilitate the reusability of complex workflows in federated HPC infrastructures. Proposals presented in this work are subject to study and development as part of the EuroHPC eFlows4HPC project. (C) 2022 Elsevier B.V. All rights reserved.
In-memory key-value stores have quickly become a key enabling technology to build high-performance applications that must cope with massively distributed workloads. In-memory key-value stores (also referred to as NoSQ...
详细信息
In-memory key-value stores have quickly become a key enabling technology to build high-performance applications that must cope with massively distributed workloads. In-memory key-value stores (also referred to as NoSQL) primarly aim to offer low-latency and high-throughput data access which motivates the rapid adoption of modern network cards such as Remote Direct Memory Access (RDMA). In this paper, we present the fundamental design principles for exploiting RDMAs in modern NoSQL systems. Moreover, we describe a break-down analysis of the state-of-the-art of the RDMA-based in-memory NoSQL systems regarding the indexing, data consistency, and the communication protocol. In addition, we compare traditional in-memory NoSQL with their RDMA-enabled counterparts. Finally, we present a comprehensive analysis and evaluation of the existing systems based on a wide range of configurations such as the number of clients, real-world request distributions, and workload read-write ratios.
The presentation of Peachy parallel Assignments at parallel and distributed computing education workshops is an effort to promote the reuse of high-quality assignments, both saving precious faculty time and improving ...
详细信息
ISBN:
(纸本)9798350311990
The presentation of Peachy parallel Assignments at parallel and distributed computing education workshops is an effort to promote the reuse of high-quality assignments, both saving precious faculty time and improving the quality of course assignments. These assignments must have been used in class and are selected for being easy to adopt by other instructors and for being "cool and inspirational" so that students spend time on them and talk about them with others. The assignments and their materials are also archived on the Peachy parallel Assignments website. In this paper, we present two new assignments. The first has students implement the Mandelbrot set in Python, combining an interesting image with Python's ease of use. The second assignment is a substantial project to implement a programming contest judge. It requires that students use many parallel and distributed computing concepts, with the added benefit of solving a "real problem" and creating software with which students may have personally interacted.
暂无评论