检索结果-内蒙古大学图书馆

Applying Non-Local Means Filter on Seismic Exploration

Computer Systems Science & Engineering 2022年第2期40卷 619-628页

作者： Mustafa Youldash Saleh Al-Dossary Lama AlDaej Farah AlOtaibi Asma AlDubaikil Noora AlBinali Maha AlGhamdi College of Computer Science and Information Technology Imam Abdulrahman Bin Faisal UniversityDammamP.O.1982Saudi Arabia Geophysical Application Division Exploration Application Services DepartmentSaudi AramcoDhahranSaudi Arabia

The seismic reflection method is one of the most important methods in geophysical *** are three stages in a seismic exploration survey:acquisition,processing,and *** paper focuses on a pre-processing tool,the Non-Local Means(NLM)filter algorithm,which is a powerful technique that can significantly suppress noise in seismic ***,the domain of the NLM algorithm is the whole dataset and 3D seismic data being very large,often exceeding one terabyte(TB),it is impossible to store all the data in Random Access Memory(RAM).Furthermore,the NLM filter would require a considerably long *** factors make a straightforward implementation of the NLM algorithm on real geophysical exploration data *** paper redesigned and implemented the NLM filter algorithm to fit the challenges of seismic *** optimized implementation of the NLM filter is capable of processing production-size seismic data on modern clusters and is 87 times faster than the straightforward implementation of NLM.

关键词： Seismic exploration parallel programming seismic processing optimizing methods

来源：评论

学校读者我要写书评

暂无评论

Taskgraph Framework: A Competitive Alternative to the OpenMP Thread Model 3

Taskgraph Framework: A Competitive Alternative to the OpenMP...

引用

3rd International Conference on Intelligent Systems, Advanced Computing, and Communication, ISACC 2025

作者： Chavan, Snehal Nile, Prathamesh Kumar, Sunil Bhowmik, Biswajit National Institute of Technology Karnataka Maharshi Sushrut CAS Lab BRICS Laboratory Dept. of Computer Science and Engineering Mangalore India National Institute of Technology Karnataka Maharshi Patanjali CPS Lab BRICS Laboratory Dept. of Computer Science and Engineering Mangalore India National Institute of Technology Karnataka BRICS Laboratory Dept. of Computer Science and Engineering Mangalore India

ISBN: (纸本)9798331523893

OpenMP is the predominant standard for shared memory systems in high-performance computing (HPC), offering a tasking paradigm for parallelism. However, existing OpenMP implementations, like GCC and LLVM, face computational limitations that hinder performance, especially for large-scale tasks. This paper presents the Taskgraph framework, a novel solution that overcomes the limitations of traditional task dependency graphs (TDGs). Unlike conventional TDGs, which require costly reconstruction for dynamic program structures, the Taskgraph framework uses a taskgraph clause with a list of variables, enabling real-time adaptation without complete reconstruction. This approach significantly reduces overhead, making the Task-graph model highly efficient for tasks with minimal dependencies, offering a competitive alternative to the OpenMP thread model, and enhancing efficiency and adaptability in dynamic HPC environments. © 2025 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A New Metric for Multithreaded parallel Programs Overhead Time Prediction 17th

A New Metric for Multithreaded Parallel Programs Overhead Ti...

引用

17th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE)

作者： Niculescu, Virginia Serban, Camelia Vescan, Andreea Babes Bolyai Univ Fac Math & Comp Sci Cluj Napoca Romania

ISBN: (纸本)9783031365966;9783031365973

The paper proposes a metric that evaluates the overhead introduced into parallel programs by the additional operations that parallelism implicitly imposes. We consider the case of multithreaded parallel programs that follows SPMD (Single Program Multiple Data) model. Java programs were considered for this proposal, but the metric could be easily adapted for any multithreading supporting imperative language. The metric is defined as a combination of several atomic metrics considering various synchronisation mechanisms that could be discovered using the source code analysis. A theoretical validation of this metric is presented, and an empirical evaluation of several use cases. Additionally, we propose an Artificial Intelligence based strategy to refine the evaluation of the metric by obtaining approximation for the weights that are used in combining the considered atomic metrics. The methodology of the approach is statistical using the multiple linear regression, which considers as dependent variable the execution times of different concrete use-cases, and as independent variables the corresponding overhead times introduced by the considered synchronization mechanisms, which are approximated through the atomic metrics. The results indicated a high degree of correlation between the dependent and independent variables. The Root Mean Square Error obtained is 0.155186, thus being very small, the predicted and observed values are very close.

关键词： parallel programming Metrics Overhead Multithreading Synchronization Estimation Multiple linear regression

来源：评论

学校读者我要写书评

暂无评论

A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems

引用

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2022年第6期41卷 1620-1635页

作者： Cheng, Lin Pan, Peitian Zhao, Zhongyuan Ranjan, Krithik Weber, Jack Veluri, Bandhav Ehsani, Seyed Borna Ruttenberg, Max Jung, Dai Cheol Ivanov, Preslav Richmond, Dustin Taylor, Michael B. Zhang, Zhiru Batten, Christopher Cornell Univ Sch Elect & Comp Engn Ithaca NY 14853 USA Accenture New York NY USA Univ Washington Paul Allen Sch Comp Sci & Engn Seattle WA 98105 USA Apple Inc Los Altos CA USA Univ Washington Dept Elect & Comp Engn Seattle WA 98105 USA

Future CPU-manycore heterogeneous systems can provide high peak throughput by integrating thousands of simple, independent, energy-efficient cores in a single die. However, there are two key challenges to translating this high peak throughput into improved end-to-end workload performance: 1) manycore co-processors rely on simple hardware putting significant demands on the software programmer and 2) manycore co-processors use in-order cores that struggle to tolerate long memory latencies. To address the manycore programmability challenge, this article presents a dense and sparse tensor processing framework based on PyTorch that enables domain experts to easily accelerate off-the-shelf workloads on CPU-manycore heterogeneous systems. To address the manycore memory latency challenge, we use our extended PyTorch framework to explore the potential for decoupled access/execute (DAE) software and hardware mechanisms. More specifically, we propose two software-only techniques, naive-software DAE and systolic-software DAE, along with a lightweight hardware access accelerator to further improve area-normalized throughput. We evaluate our techniques using a combination of PyTorch operator microbenchmarking and real-world PyTorch workloads running on a detailed register-transfer-level model of a 128-core manycore architecture. Our evaluation on three real-world dense and sparse tensor workloads suggests these workloads can achieve approximately 2-6 x performance improvement when scaled to a future 2000-core CPU-manycore heterogeneous system compared to an 18-core out-of-order CPU baseline, while potentially achieving higher area-normalized throughput and improved energy efficiency compared to general-purpose graphics processing units.

关键词： Accelerator architectures open source software parallel programming software libraries

来源：评论

学校读者我要写书评

暂无评论

A fast and concise parallel implementation of the 8x8 2D forward and inverse DCTs using halide

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2022年 163卷 20-29页

作者： Johnson, Martin Playne, Daniel Massey Univ Albany Sch Math & Computat Sci Auckland 0632 New Zealand

The Discrete Cosine Transform (DCT) is commonly used for image and video coding and very efficient implementations of the forward and inverse transforms are of great importance. The popular libjpegturbo library contains handwritten, highly-optimised assembly language DCT implementations utilizing SIMD instruction sets for a variety of architectures. We present an alternative approach, implementing the 8x8 IDCT and FDCT written in the functional image processing language Halide. We show how less than 200 lines of Halide can replace over 20,000 lines of code the libjpeg-turbo library to perform JPEG encoding and decoding. The Halide implementation is compared for ARMv8 NEON and x86-64 SIMD extensions and shows a 5-25 percent performance improvement over the SIMD code in libjpeg-turbo for decoding and a 10-40 percent improvement for encoding. The Halide code is significantly easier to maintain and port to new architectures than the existing code. (c) 2022 Elsevier Inc. All rights reserved.

关键词： Halide Discrete cosine transform JPEG parallel programming SIMD

来源：评论

学校读者我要写书评

暂无评论

Impacts of floating-point non-associativity on reproducibility for HPC and deep learning applications

Impacts of floating-point non-associativity on reproducibili...

引用

2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024

作者： Shanmugavelu, Sanjif Taillefumier, Mathieu Culver, Christopher Hernandez, Oscar Coletti, Mark Sedova, Ada Maxeler Technologies Groq Company LondonW6 0ND United Kingdom Lugano6900 Switzerland Oak Ridge National Laboratory Oak RidgeTN United States

ISBN: (纸本)9798350355543

Run to run variability in parallel programs caused by floating-point non-associativity has been known to significantly affect reproducibility in iterative algorithms, due to accumulating errors. Non-reproducibility can critically affect the efficiency and effectiveness of correctness testing for stochastic programs. Recently, the sensitivity of deep learning training and inference pipelines to floating-point non-associativity has been found to sometimes be extreme. It can prevent certification for commercial applications, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled deep learning models with high-performance computing, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of the statistical properties of floating-point non-associativity within modern parallel programming models, and analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs. We examine the recently-added deterministic options in PyTorch within the context of GPU deployment for deep learning, uncovering and quantifying the impacts of input parameters triggering run to run variability and reporting on the reliability and completeness of the documentation. Finally, we evaluate the strategy of exploiting automatic determinism that could be provided by deterministic hardware, using the Groq LPUTM accelerator for inference portions of the deep learning pipeline. We demonstrate the benefits that a hardware-based strategy can provide within reproducibility and correctness efforts. © 2024 IEEE.

关键词： deep learning floating-point arithmetic high-performance computing parallel programming Reproducibility of results

来源：评论

学校读者我要写书评

暂无评论

Distributed Checkpointing in Dataflow with Static Scheduling 35

Distributed Checkpointing in Dataflow with Static Scheduling

引用

35th IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

作者： Alves, Tiago A. O. Univ Estado Rio De Janeiro Rio De Janeiro Brazil

ISBN: (纸本)9798350381603

The Dataflow model, where instructions or tasks are fired as soon as their input data is ready, was proven to be a good fit for parallel/distributed computation. Previous works have presented DFER (Dataflow Error Recovery Model), that allows transient error and recovery in dataflow by adding special tasks and edges to the dataflow graph itself. However, permanent faults or faults that cause a processing element (PE) to become irresponsive are not addressed by DFER. For those cases it is necessary to adopt a checkpointing method. Since the whole purpose of Dataflow is to achieve high levels of parallelism and explore the potential asynchronicity between PEs, it is clear that the checkpointing method adopted must be uncoordinated and distributed. Current algorithms for distributed checkpointing rely solely on guaranteeing that causality between checkpoints can be trackable. In the context of Dataflow with static scheduling, i.e. when the dataflow graph is partitioned among the available PEs at compile-time, causality trackability is not sufficient as we will show. Since static scheduling of dataflow graphs is very important in various scenarios, it calls for a new algorithm for distributed checkpointing that can be adopted for the execution of statically scheduled dataflow graphs. In this paper we describe why the ability to track causality is not enough for statically scheduled dataflow and introduce a new algorithm for distributed checkpointing specifically tailored for such model of execution.

关键词： dataflow distributed checkpointing fault tolerance parallel programming

来源：评论

学校读者我要写书评

暂无评论

Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2022年 134卷 414-429页

作者： Ejarque, Jorge Badia, Rosa M. Albertin, Loic Aloisio, Giovanni Baglione, Enrico Becerra, Yolanda Boschert, Stefan Berlin, Julian R. D'Anca, Alessandro Elia, Donatello Exertier, Francois Fiore, Sandro Flich, Jose Folch, Arnau Gibbons, Steven J. Koldunov, Nikolay Lordan, Francesc Lorito, Stefano Lovholt, Finn Macias, Jorge Marozzo, Fabrizio Michelini, Alberto Monterrubio-Velasco, Marisol Pienkowska, Marta de la Puente, Josep Queralt, Anna Quintana-Orti, Enrique S. Rodriguez, Juan E. Romano, Fabrizio Rossi, Riccardo Rybicki, Jedrzej Kupczyk, Miroslaw Selva, Jacopo Talia, Domenico Tonini, Roberto Trunfio, Paolo Volpe, Manuela [a]Barcelona Supercomputing Center (BSC) Spain [b]Centre Internacional de Mètodes Numèrics a l’Enginyeria (CIMNE) Spain [c]Universitat Politècnica de Catalunya (UPC) Spain [d]Jülich Supercomputing Centre (JSC) Germany [e]Universitat Politècnica de València (UPV) Spain [f]Atos BDS R&D HPC & AI Software France [g]DtoK Lab Srl Italy [h]Centro Euro-Mediterraneo sui Cambiamenti Climatici (CMCC) Italy [i]Department of Information Engineering and Computer Science University of Trento Italy [j]Universidad de Málaga (UMA) Spain [k]Istituto Nazionale di Geofisica e Vulcanologia (INGV) Italy [l]Alfred-Wegener-Institut Helmholtz-Zentrum für Polar- und Meeresforschung Germany [m]Consejo Superior Investigaciones Cientificas (CSIC) Spain [n]Eidgenössische Technische Hochschule (ETH) Zürich Switzerland [o]Siemens AG Germany [p]Norwegian Geotechnical Institute (NGI) Norway [q]Poznan Supercomputing and Networking Center (PSNC) Poland

The evolution of High-Performance Computing (HPC) platforms enables the design and execution of progressively larger and more complex workflow applications in these systems. The complexity comes not only from the number of elements that compose the workflows but also from the type of computations they perform. While traditional HPC workflows target simulations and modelling of physical phenomena, current needs require in addition data analytics (DA) and artificial intelligence (AI) tasks. However, the development of these workflows is hampered by the lack of proper programming models and environments that support the integration of HPC, DA, and AI, as well as the lack of tools to easily deploy and execute the workflows in HPC systems. To progress in this direction, this paper presents use cases where complex workflows are required and investigates the main issues to be addressed for the HPC/DA/AI convergence. Based on this study, the paper identifies the challenges of a new workflow platform to manage complex workflows. Finally, it proposes a development approach for such a workflow platform addressing these challenges in two directions: first, by defining a software stack that provides the functionalities to manage these complex workflows;and second, by proposing the HPC Workflow as a Service (HPCWaaS) paradigm, which leverages the software stack to facilitate the reusability of complex workflows in federated HPC infrastructures. Proposals presented in this work are subject to study and development as part of the EuroHPC eFlows4HPC project. (C) 2022 Elsevier B.V. All rights reserved.

关键词： High performance computing Distributed computing parallel programming HPC-DA-AI convergence Workflow development Workflow orchestration

来源：评论

学校读者我要写书评

暂无评论

Analyzing In-Memory NoSQL Landscape

引用

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 2022年第4期34卷 1628-1643页

作者： Hemmatpour, Masoud Montrucchio, Bartolomeo Rebaudengo, Maurizio Sadoghi, Mohammad Politecn Torino Dipartimento Automat & Informat I-10129 Turin Italy Univ Calif Davis Comp Sci Dept Davis CA 95616 USA

In-memory key-value stores have quickly become a key enabling technology to build high-performance applications that must cope with massively distributed workloads. In-memory key-value stores (also referred to as NoSQL) primarly aim to offer low-latency and high-throughput data access which motivates the rapid adoption of modern network cards such as Remote Direct Memory Access (RDMA). In this paper, we present the fundamental design principles for exploiting RDMAs in modern NoSQL systems. Moreover, we describe a break-down analysis of the state-of-the-art of the RDMA-based in-memory NoSQL systems regarding the indexing, data consistency, and the communication protocol. In addition, we compare traditional in-memory NoSQL with their RDMA-enabled counterparts. Finally, we present a comprehensive analysis and evaluation of the existing systems based on a wide range of configurations such as the number of clients, real-world request distributions, and workload read-write ratios.

关键词： Protocols Semantics Prefetching Acceleration Facebook Hardware Sockets RDMA memory key-value store big data high performance cluster parallel programming

来源：评论

学校读者我要写书评

暂无评论

Peachy parallel Assignments (EduPar 2023)

Peachy Parallel Assignments (EduPar 2023)

引用

37th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Lazar, Alina Niculescu, Virginia Bunde, David P. Youngstown State Univ Dept SCSIET Youngstown OH 44555 USA Babes Bolyai Univ Dept Comp Sci Cluj Napoca Romania Knox Coll Dept Comp Sci Galesburg IL USA

ISBN: (纸本)9798350311990

The presentation of Peachy parallel Assignments at parallel and distributed computing education workshops is an effort to promote the reuse of high-quality assignments, both saving precious faculty time and improving the quality of course assignments. These assignments must have been used in class and are selected for being easy to adopt by other instructors and for being "cool and inspirational" so that students spend time on them and talk about them with others. The assignments and their materials are also archived on the Peachy parallel Assignments website. In this paper, we present two new assignments. The first has students implement the Mandelbrot set in Python, combining an interesting image with Python's ease of use. The second assignment is a substantial project to implement a programming contest judge. It requires that students use many parallel and distributed computing concepts, with the added benefit of solving a "real problem" and creating software with which students may have personally interacted.

关键词： Peachy parallel Assignments parallel computing education parallel programming Curriculum Development Mandelbrot set programming contest

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：