Recently, many applications have required the ability to perform dynamic graph analytical processing (GAP) tasks on the datasets generated by relational OLTP in realtime. To meet the two key requirements of performan...
详细信息
ISBN:
(纸本)9781939133359
Recently, many applications have required the ability to perform dynamic graph analytical processing (GAP) tasks on the datasets generated by relational OLTP in realtime. To meet the two key requirements of performance and freshness, this paper presents GART, an in-memory system that extends hybrid transactional/analytical processing (HTAP) systems to support GAP, resulting in hybrid transactional and graph analytical processing (HTGAP). GART fulfills two unique goals that are not encountered by HTAP systems. First, to adapt to rich workloads flexibility, GART proposes transparent data model conversion by graph extraction interfaces, which define rules for relational-graph mapping. Second, to ensure GAP performance, GART proposes an efficient dynamic graph storage with good locality that stems from key insights into HTGAP workloads, including (1) an efficient and mutable compressed sparse row (CSR) representation to guarantee the locality of edge scan, (2) a coarse-grained multi-version concurrency control (MVCC) scheme to reduce the temporal and spatial overhead of versioning, and (3) a flexible property storage to efficiently run different GAP workloads. Evaluations show that GART performs several orders of magnitude better than existing solutions in terms of freshness or performance. Meanwhile, for GAP workloads on the LDBC SNB dataset, GART outperforms the state-of-the-art general-purpose dynamic graph storage (i.e., LiveGraph) by up to 4.4x.
The ubiquity of networking infrastructure in modern life necessitates scrutiny into networking fundamentals to ensure the safety and security of that infrastructure. The formalization of concurrent algorithms, a corne...
详细信息
In distributed stochastic optimization, where parallel and asynchronous methods are employed, we establish optimal time complexities under virtually any computation behavior of workers/devices/CPUs/GPUs, capturing pot...
详细信息
This paper addresses the challenges of optimizing task scheduling for a distributed, task-based execution model in OpenMP for cluster computing environments. Traditional OpenMP implementations are primarily designed f...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
This paper addresses the challenges of optimizing task scheduling for a distributed, task-based execution model in OpenMP for cluster computing environments. Traditional OpenMP implementations are primarily designed for shared-memory parallelism and offer limited control over task scheduling. However, improved scheduling mechanisms are critical to achieving performance and portability in distributed and heterogeneous environments. OpenMP Cluster (OMPC) was introduced to overcome these limitations, extending OpenMP with the Heterogeneous Earliest Finish time (HEFT) task scheduling algorithm tailored for large-scale systems. To improve scheduling and enable better system utilization, the runtime system must resolve challenges such as changes in the application balance, amount of parallelism, and varying communication *** work presents three key contributions: first, the refactoring of the OMPC runtime to unify task scheduling across devices and hosts; second, the optimization of the HEFT-based scheduling algorithm to ensure efficient task execution in distributed environments; and third, an extensive evaluation of Work Stealing and HEFT scheduling mechanisms in real-world clusters. While the HEFT implementation in OMPC is not fully optimized, this work provides a significant step toward improving distributed task scheduling in cluster computing, offering insights and incremental advancements that support the development of scalable and high-performance applications. Results show improvements of up to 24% in scheduling time while opening up to more extensions in the scheduling methods.
A program is deterministic if multiple re-executions with the same inputs always lead to the same state. Even concurrent instances of a deterministic program should observe identical behavior - -in realtime - -if ass...
详细信息
The work explores the demand for Big Data processing and delves into the functioning of large-scale data processing architectures, focusing on batch and real-time processing. The experiment conducted analyzes the exec...
详细信息
ISBN:
(数字)9798350369083
ISBN:
(纸本)9798350369090
The work explores the demand for Big Data processing and delves into the functioning of large-scale data processing architectures, focusing on batch and real-time processing. The experiment conducted analyzes the execution time of Apache Hadoop (AH), Apache Spark (AS), and Apache Flink (AF) tools. Results indicate that Spark outperformed Flink and Hadoop across all experiments, demonstrating notable speed advantages. In the first experiment with a 1 GB data source, Spark was 186% faster than Flink and 251% faster than Hadoop. Similarly, in the second experiment with a 3 GB data source, Spark surpassed both competitors, being 233% faster than Hadoop and 334% faster than Flink. Processing a 5 GB data source further highlighted Spark's superiority, with a 197% improvement over Hadoop and 316% over Flink. Despite the absence of parallelism and a distributed execution environment, the findings unanimously conclude that Spark achieved superior performance in batch processing large datasets in a pseudo-distributed cluster setting.
Breaking from the general run of Laplacian solvers that depend on algebraic primitives, we present the first GPU implementation of a message-passing-based solver. Our solver called GPU-LSolve, implements a randomized ...
详细信息
ISBN:
(数字)9798350364606
ISBN:
(纸本)9798350364613
Breaking from the general run of Laplacian solvers that depend on algebraic primitives, we present the first GPU implementation of a message-passing-based solver. Our solver called GPU-LSolve, implements a randomized algorithm that simulates a queueing network where some nodes act as sources that generate messages and one node acts as a sink that removes messages from the network. The steady state of this network provides a solution for the Laplacian system of equations. We show how the simplicity of the primitives of this algorithm can be leveraged in a G PU setting to provide an efficient implementation that can solve Laplacian systems on million-scale graphs. Our solver takes advantage of GPU parallelism through sorting and key-value reduction. We have provided an extensive experimental evaluation on real data sets against several recently developed solvers. It is shown from the results that the presented solver does not suffer much in terms of memory footprint and execution time.
Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and gene...
详细信息
Modern Machine Learning (ML) training on large-scale datasets is a very time-consuming workload. It relies on the optimization algorithm Stochastic Gradient Descent (SGD) due to its effectiveness, simplicity, and generalization performance (i.e., test performance on unseen data). Processor-centric architectures (e.g., CPUs, GPUs) commonly used for modern ML training workloads based on SGD are bottlenecked by data movement between the processor and memory units due to the poor data locality in accessing large training datasets. As a result, processor-centric architectures suffer from low performance and high energy consumption while executing ML training workloads. Processing-In-Memory (PIM) is a promising solution to alleviate the data movement bottleneck by placing the computation mechanisms inside or near memory. Several prior works propose PIM techniques to accelerate ML training;however, prior works either do not consider real-world PIM systems or evaluate algorithms that are not widely used in modern ML training. Our goal is to understand the capabilities and characteristics of popular distributed SGD algorithms on real-world PIM systems to accelerate data-intensive ML training workloads. To this end, we 1) implement several representative centralized parallel SGD algorithms, i.e., based on a central node responsible for synchronization and orchestration, on the real-world general-purpose UPMEM PIM system, 2) rigorously evaluate these algorithms for ML training on large-scale datasets in terms of performance, accuracy, and scalability, 3) compare to conventional CPU and GPU baselines, and 4) discuss implications for future PIM hardware. We highlight the need for a shift to an algorithm-hardware codesign to enable decentralized parallel SGD algorithms in real-world PIM systems, which significantly reduces the communication cost and improves scalability. Our results demonstrate three major findings: 1) The general-purpose UPMEM PIM system can be a viable alternat
Research shows that consumers want to know their appliances’ energy consumption. Also, providing consumers with more information about their energy use and giving them more control over it can lead to choices that re...
详细信息
ISBN:
(数字)9798350361612
ISBN:
(纸本)9798350361629
Research shows that consumers want to know their appliances’ energy consumption. Also, providing consumers with more information about their energy use and giving them more control over it can lead to choices that reduce energy consumption. This research proposes a sensor network of Smart Electricity Meters (SEMs) that system uses a Data Distribution Service (DDS) publish/subscribe middleware (DPSM) for real-time fine-grain sensing of electric loads. The SEMs are connected to household appliances to measure their power consumption and send it to a server for storage and data analysis. Our results indicate an average throughput of 302 bytes/second and an average latency of 25.44 ms. The proposed solution demonstrates promise for real-time monitoring of household appliances’ power consumption in Smart Grid (SG) environments.
暂无评论