It has been a decade since the ACM/IEEE CS2013 Curriculum guidelines recommended that all CS students learn about parallel and distributedcomputing (PDC). But few textbooks for "core" CS courses especially ...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
It has been a decade since the ACM/IEEE CS2013 Curriculum guidelines recommended that all CS students learn about parallel and distributedcomputing (PDC). But few textbooks for "core" CS courses especially first-year courses include coverage of PDC topics. To fill this gap, we have written free, online, beginner- and intermediate-level PDC textbooks, containing interactive C/C++ OpenMP, MPI, mpi4py, CUDA, and OpenACC code examples that students can run and modify directly in the browser. The books address a serious challenge to leaching PDC concepts, namely, easy access to the powerful hardware needed for observing patterns and scalability. This paper describes the content of these textbooks and the underlying infrastructure that make them possible. We believe the described textbooks fill a critical gap in PDC education and will be very useful for the community.
The proceedings contain 24 papers. The special focus in this conference is on parallel and distributed Processing Techniques. The topics include: parallel N-Body Performance Comparison: Julia, Rust, and More;REFT...
ISBN:
(纸本)9783031856372
The proceedings contain 24 papers. The special focus in this conference is on parallel and distributed Processing Techniques. The topics include: parallel N-Body Performance Comparison: Julia, Rust, and More;REFT: Resource-Efficient Federated Training Framework for Heterogeneous and Resource-Constrained Environments;An Efficient Data Provenance Collection Framework for HPC I/O Workloads;using Minicasts for Efficient Asynchronous Causal Unicast and Byzantine Tolerance;a Comparative Study of Two Matrix Multiplication Algorithms Under Current Hardware Architectures;Is Manual Code Optimization Still Required to Mitigate GPU Thread Divergence? Applying a Flattening Technique to Observe Performance;towards Automatic, Predictable and High-Performance parallel Code Generation;Attack Graph Generation on HPC Clusters;analyzing the Influence of File Formats on I/O Patterns in Deep Learning;inference of Cell–Cell Interactions Through Spatial Transcriptomics Data Using Graph Convolutional Neural Networks;natural Product-Like Compound Generation with Chemical Language Models;improved Early–Modern Japanese Printed Character Recognition Rate with Generated Characters;Improved Method for Similar Music Recommendation Using Spotify API;Reconfigurable Virtual Accelerator (ReVA) for Large-Scale Acceleration Circuits;Building Simulation Environment of Reconfigurable Virtual Accelerator (ReVA);vector Register Sharing Mechanism for High Performance Hardware Acceleration;Efficient Compute Resource Sharing of RISC-V Packed-SIMD Using Simultaneous Multi-threading;introducing Competitive Mechanism to Differential Evolution for Numerical Optimization;hyper-heuristic Differential Evolution with Novel Boundary Repair for Numerical Optimization;jump Like a Frog: Optimization of Renewable Energy Prediction in Smart Gird Based on Ultra Long Term Network;vision Transformer-Based Meta Loss Landscape Exploration with Actor-Critic Method;Fast Computation Method for Stopping Condition of Range Restricted
The 2024 DEBS Grand Challenge addresses the topic of hard drive failure predictive maintenance, through analysis of data streams that contain SMART readings, reported by drives located in different groups of storage s...
详细信息
ISBN:
(纸本)9798400704437
The 2024 DEBS Grand Challenge addresses the topic of hard drive failure predictive maintenance, through analysis of data streams that contain SMART readings, reported by drives located in different groups of storage servers. This paper details the technical implementation of a solution that focuses primarily on parallelizing the data stream processing to obtain vertical scalability. When processing two queries concerning the addressed topic, and setting a threshold of a maximum 16 ms latency for responding, our solution obtained a throughput of about 57% out of the maximum possible when no processing is made on the data stream. We also describe an initial work-in-progress implementation of a distributed extension that relies on Apache Kafka, meant to further scale the throughput of the parallel solution and to address possible failure conditions of retrieving the input stream.
We propose SCoOL, a programming model and its corresponding parallel runtime systems for implementing optimization problem solvers. In SCoOL, users specify what task is performed for a point in a given search space, a...
详细信息
ISBN:
(纸本)9798350383225
We propose SCoOL, a programming model and its corresponding parallel runtime systems for implementing optimization problem solvers. In SCoOL, users specify what task is performed for a point in a given search space, and what global information should be maintained during the search. The resulting optimization program is then efficiently executed in a BSP-style on a shared or distributed memory computers by a parallel runtime provided with the model. In the paper, we show details of our scalable runtime for distributed memory clusters, including algorithms for work stealing and tasks rebalancing. To benchmark the platform, we implement solutions to several optimization problems and provide performance analysis for Quadratic Assignment Problem, Parent Set Assignment, and Bayesian Networks Structure Learning. Our solvers show strong scaling on a cluster with 1,280 cores, significantly outperforming the current state-of-the-art solvers in Bayesian networks learning.
Cloud-edge collaborative computing paradigm is a promising solution to high-resolution video analytics systems. The key lies in reducing redundant data and managing fluctuating inference workloads effectively. Previou...
详细信息
ISBN:
(纸本)9798350386066;9798350386059
Cloud-edge collaborative computing paradigm is a promising solution to high-resolution video analytics systems. The key lies in reducing redundant data and managing fluctuating inference workloads effectively. Previous work has focused on extracting regions of interest (RoIs) from videos and transmitting them to the cloud for processing. However, a naive Infrastructure as a Service (IaaS) resource configuration falls short in handling highly fluctuating workloads, leading to violations of Service Level Objectives (SLOs) and inefficient resource utilization. Besides, these methods neglect the potential benefits of RoIs batching to leverage parallel processing. In this work, we introduce Tangram, an efficient serverless cloud-edge video analytics system fully optimized for both communication and computation. Tangram adaptively aligns the RoIs into patches and transmits them to the scheduler in the cloud. The system employs a unique "stitching" method to batch the patches with various sizes from the edge cameras. Additionally, we develop an online SLO-aware batching algorithm that judiciously determines the optimal invoking time of the serverless function. Experiments on our prototype reveal that Tangram can reduce bandwidth consumption and computation cost up to 74.30% and 66.35%, respectively, while maintaining SLO violations within 5% and the accuracy loss negligible.
As the world becomes more connected and new digital services emerge at a fast pace, the amount of network traffic increases rapidly. Consequently, processing requirements become more varied and drive the need for flex...
详细信息
ISBN:
(纸本)9798350363074;9798350363081
As the world becomes more connected and new digital services emerge at a fast pace, the amount of network traffic increases rapidly. Consequently, processing requirements become more varied and drive the need for flexible packet-processing designs, especially as in-network computing gains traction. Traditional approaches deploy hardware accelerators in a pipeline in the sequence that the associated tasks are supposed to be executed. Hence, they do not accommodate flows with different processing requirements and provide no possibility to remap flows to task sequences in runtime. In order to address these limitations, we propose FlexRoute, a fast, flexible and priority-aware packet-processing design that can process network traffic at a rate of over 100 Gbit/s on FPGAs. Our design consists of a reconfigurable parser and several processing engines that are arranged in a pipeline. The processing engines are equipped with processing units that execute specific tasks, flexible forwarding logic and priority-aware queuing/scheduling logic. We implement a prototype of FlexRoute in Verilog and evaluate it via cycle-accurate register-transfer level simulations. We also synthesize and implement our design on the Alveo U55C High Performance Compute Card and show its resource usage. The evaluation results demonstrate that FlexRoute can process packets of arbitrary size with different processing requirements at a traffic rate of about 70 Gbit/s significantly faster than two state-of-the-art flexible packet-processing designs.
The increasing quality and availability of Quantum Processing Units (QPUs) is fueling a growing interest in quantum computing across many technological areas. The resulting increase in demand for QPU resources necessi...
详细信息
ISBN:
(纸本)9798331541378
The increasing quality and availability of Quantum Processing Units (QPUs) is fueling a growing interest in quantum computing across many technological areas. The resulting increase in demand for QPU resources necessitates Quantum computing as a Service (QCaaS) providers to support a high throughput of quantum workloads. A major runtime bottleneck in current QCaaS software stacks is the computationally-intensive compilation step which requires significant compute. To address this, Oxford Quantum Circuits has introduced distributed compilation whereby quantum programs are compiled in parallel and stored until the QPU is available. This has replaced our previous serial compilation approach where each program was compiled immediately prior to execution. From experiments using our production compilers and a simulated backend representing the QPU, we show that distributed compilation has resulted in a 78% reduction in processing time as compared to serial compilation. This demonstrates that there are sizeable performance gains to program throughput attainable through the introduction of distributed compilation into a QCaaS architecture. We posit that the usefulness of this feature will only grow given the increasing complexity of quantum programs and the growing popularity of quantum -classical hybrid algorithms.
With ever increasing demand in power, there is a paradigm shift from integrated grids to microgrids to cater small communities or islands. Microgrids are the need of the hour as it is comprises of distributed generati...
详细信息
In this paper, we share our experience in teaching parallel algorithms with the binary-forking model. With hardware advances, multicore computers are now ubiquitous. This has created a substantial demand in both resea...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
In this paper, we share our experience in teaching parallel algorithms with the binary-forking model. With hardware advances, multicore computers are now ubiquitous. This has created a substantial demand in both research and industry to harness the capabilities of parallelcomputing. It is thus important to incorporate parallelism in computer science education, especially in the early stages of the curriculum. However, it is commonly believed that understanding and using parallelism requires a deep understanding of computer systems and architecture, which complicates introducing parallelism to young students and non-experts. We propose to use the binary-forking model in teaching parallel algorithms, proposed by our previous research work. This model is meant to capture the performance of algorithms on modern multicore shared-memory machines, which is a simple abstraction to isolate algorithm design ideas with system-level details. The abstraction allows for simple analysis based on the work-span model in theory, and can be directly implemented as parallel programs in practice. In this paper, we briefly overview some basic primitives in this model, and provide a list of algorithms that we believe are well-suited in parallel algorithm courses.
We develop a distributed-memory parallel algorithm for performing batch updates on streaming graphs, where vertices and edges are continuously added or removed. Our algorithm leverages distributed sparse matrices as t...
详细信息
暂无评论