Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carr...
Copyright and Reprint Permissions: Abstracting is permitted with credit to the source. Libraries may photocopy beyond the limits of US copyright law, for private use of patrons, those articles in this volume that carry a code at the bottom of the first page, provided that the per-copy fee indicated in the code is paid through the Copyright Clearance Center. The papers in this book comprise the proceedings of the meeting mentioned on the cover and title page. They reflect the authors' opinions and, in the interests of timely dissemination, are published as presented and without change. Their inclusion in this publication does not necessarily constitute endorsement by the editors or the Institute of Electrical and Electronics Engineers, Inc.
作者:
Moreno MarzollaDept
of Computer Science and Engineering (DISI) and Center for Inter-Department Industrial Research ICT University of Bologna Bologna Italy
Mini-applications are widely used in parallel computing for testing and benchmarking purposes. However, many existing mini-applications are not suitable for teaching, since they require advanced knowledge of algebra, ...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Mini-applications are widely used in parallel computing for testing and benchmarking purposes. However, many existing mini-applications are not suitable for teaching, since they require advanced knowledge of algebra, numerical analysis or physics to be fully understood, which might be beyond the reach of beginners. In this paper we describe a set of programming assignments, called parallel etudes, that have been used in the last years for teaching High Performance Computing at the undergraduate level. These applications are self-contained, self-documenting, and short. They are drawn from more familiar domains such as 3D rendering, simulation, image processing and simple physics models, to be more accessible to students without a strong mathematical background. The mini-applications target shared-memory, distributed-memory and GPU programming. The analysis of the students’ feedback and final grades provides indirect support for the effectiveness of the etudes.
The design of sparse matrix storage formats is essential to achieve high-performance sparse kernels in modern parallel architectures. The bitmap-based bmSparse format was designed with the SPGEMM operation as its main...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
The design of sparse matrix storage formats is essential to achieve high-performance sparse kernels in modern parallel architectures. The bitmap-based bmSparse format was designed with the SPGEMM operation as its main focus, but it shows potential for other operations as well when the sparse matrix has a convenient structure. In this paper, we propose a new SPMV kernel that greatly improves the load balance of previous open-source implementations and utilizes the GPU resources more efficiently. The results show speedups of up to $100 \times$ regarding other SPMV kernels for bmSparse. Finally, we leverage a hybrid implementation between the new and existing approach, selecting the kernel which will probably result in a better performance depending on each matrix characteristics.
Massive Open Online Courses (MOOCs) represent an accessible and user-friendly tool for disseminating innovative and cutting-edge topics to broad segments of civil society via online learning platforms, enabling users ...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Massive Open Online Courses (MOOCs) represent an accessible and user-friendly tool for disseminating innovative and cutting-edge topics to broad segments of civil society via online learning platforms, enabling users to learn at their own pace and on their own schedule. In this contribution, we describe the design and the implementation of a Massive Open Online Course on parallel Computing and High-Performance Computing, developed for Federica Web Learning: the University Center for innovation, experimentation, and dissemination of multimedia teaching at the University of Naples Federico II.
Modeling the performance of real-world applications at scale is essential for designing next-generation platforms and shaping the development of future algorithms. However, accurately capturing the complexity of appli...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Modeling the performance of real-world applications at scale is essential for designing next-generation platforms and shaping the development of future algorithms. However, accurately capturing the complexity of application execution graphs and their interaction with large-scale hardware environments remains a significant challenge. In recent years, several frameworks have been developed to tackle this issue by providing tools to simulate and analyze complex workloads on distributed systems. This paper focuses on seismic wave propagation problems as a representative use case to explore the challenges of modeling at scale. We employ the SimGrid simulation toolkit, a versatile framework for simulating distributed systems, to analyze the performance of large-scale applications. Particular emphasis is placed on the role of critical networking characteristics, such as bandwidth and topology, in influencing overall scalability and performance.
Incomplete factorization methods are powerful algebraic preconditioners widely used to accelerate the convergence of linear solvers. The parallelization of ILU methods has been extensively studied, particularly for GP...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Incomplete factorization methods are powerful algebraic preconditioners widely used to accelerate the convergence of linear solvers. The parallelization of ILU methods has been extensively studied, particularly for GPUs, which are ubiquitous parallel computing devices. In recent years, synchronization-free methods have become the mainstream approach for solving sparse triangular linear *** the sparse triangular solver and ILU factorization are closely related, the application of synchronization-free strategies to ILU factorization has not been explored in the literature to the same extent as the triangular solver. In this work, we present synchronization-free implementations of the ILU-0 preconditioner on GPUs. Specifically, we propose three implementations that vary in how row updates are handled after each coefficient elimination, as well as an additional approach that leverages a prior level-set analysis to optimize the execution schedule.
In the Cloud-Edge Continuum, dynamic infrastructure change and variable workloads complicate efficient resource management. Centralized methods can struggle to adapt, whilst purely decentralized policies lack global o...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
In the Cloud-Edge Continuum, dynamic infrastructure change and variable workloads complicate efficient resource management. Centralized methods can struggle to adapt, whilst purely decentralized policies lack global oversight. This paper proposes a hybrid framework using Graph Neural network (GNN) embeddings and collaborative multi-agent reinforcement learning (MARL). Local agents handle neighbourhood-level decisions, and a global orchestrator coordinates system-wide. This work contributes to decentralized application placement strategies with centralized oversight, GNN integration and collaborative MARL for efficient, adaptive and scalable resource management.
Removing noise in digital images is a fundamental operation that arises in many application domains. In this paper we consider the median filter, a filtering technique that replaces the color of each pixel with the me...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
Removing noise in digital images is a fundamental operation that arises in many application domains. In this paper we consider the median filter, a filtering technique that replaces the color of each pixel with the median of those in a square neighborhood of fixed radius. For some use cases, the size of the neighborhood or the image depth may be large, making existing algorithms either too slow, or not applicable at all due to excessive memory requirements. In this paper we describe architecture-specific optimizations that enable the computation of the median filter with arbitrary window size and image depth on multicore processors and GPUs. We report preliminary results that indicate that the parallel implementations are suitable for practical use, with the GPU version outperforming the CPU.
paralleldistributed applications running on large-scale high-performance computing systems depend on effective point-to-point and collective communication to meet performance goals. Beginning with version 4.0, the Me...
详细信息
ISBN:
(数字)9798331524937
ISBN:
(纸本)9798331524944
paralleldistributed applications running on large-scale high-performance computing systems depend on effective point-to-point and collective communication to meet performance goals. Beginning with version 4.0, the Message Passing Interface (MPI) introduced the partitioned communication API, providing tools for addressing communication bottlenecks raised by hybrid communication models. This API allows individual actors (CPU threads, GPU threads, etc.) to initiate communication on portions of complete buffers, enabling additional communication/computation overlap. Intuitively, the utility of partitioned communication could benefit from network-level support: If there are multiple paths between endpoints, an MPI-aware network could disperse partitions across these paths, avoiding the data serialization entailed by a dependency on a single path. The Cerio Rockport Ethernet Fabric has the ability to expose this capability to communication middleware. In this work we develop this capability to allow for user-level path selection for MPI partitioned communication and explore how this capability impacts point-to-point performance, collective design, and Allreduce efficiency in a Large Language Model task
暂无评论