Graph clustering is an important technique to detect community clusters in complex networks. SCAN (Structural Clustering Algorithm for Networks) is a well-studied graph clustering algorithm that has been widely applie...
详细信息
ISBN:
(纸本)9781665473156
Graph clustering is an important technique to detect community clusters in complex networks. SCAN (Structural Clustering Algorithm for Networks) is a well-studied graph clustering algorithm that has been widely applied over the years. However, the processing time cost of sequential SCAN and its variants cannot be tolerable on large graphs. The existing parallel variants of SCAN are focusing on fully utilizing the computing capacity of multi-core computer architectures and inventing sophisticated optimization techniques on single computing node. As the objects and their relationships in cyberspace are varying over time, the scale of graph data is increasing with high rate. The graph clustering algorithms on single node are facing challenges from limited computing resources, such as computing performance, memory size and storage volume. The distributed processing algorithm is called for processing large graphs. This work presents a distributed structural graph clustering algorithm using Spark. Furthermore, the edge pruning technique and adaptive checking are optimized to improve clustering efficiency. And the label propagation clustering is simplified to reduce the communication cost in the distributed clustering iterations. It also conduct extensive experiments on real-world datasets to testify the efficiency and scalability of the distributed algorithm. Experimental results show that efficient clustering performance can be achieved and it scales well under different settings.
The proceedings contain 19 papers. The topics discussed include: structures and techniques for streaming dynamic graph processing on decentralized message-driven systems;interference-aware function inlining for code s...
ISBN:
(纸本)9798400718021
The proceedings contain 19 papers. The topics discussed include: structures and techniques for streaming dynamic graph processing on decentralized message-driven systems;interference-aware function inlining for code size reduction;the rewriting of DataRaceBench benchmark for OpenCL program validations;support post quantum cryptography with SIMD everywhere on RISC-V architectures;substitution of kernel functions based on pattern matching on schedule trees;fusing depthwise and pointwise convolutions for efficient inference on GPUs;design of a decentralized Web3 access interface;a distributed particle swarm optimization algorithm based on Apache spark for asynchronous parallel training of deep neural networks;and graph federated learning with center moment constraints for node classification.
The roulette wheel selection is a critical process in heuristic algorithms, enabling the probabilistic choice of items based on assigned fitness values. It selects an item with a probability proportional to its fitnes...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The roulette wheel selection is a critical process in heuristic algorithms, enabling the probabilistic choice of items based on assigned fitness values. It selects an item with a probability proportional to its fitness value. This technique is commonly employed in ant-colony algorithms to randomly determine the next city to visit when solving the traveling salesman problem. Our study focuses on parallel algorithms designed to select one of multiple processors, each associated with fitness values, using random wheel selection. We propose a novel approach called logarithmic random bidding, which achieves an expected runtime logarithmic to the number of processors with non-zero fitness values, using the CRCW-PRAM model with a shared memory of constant size. Notably, the logarithmic random bidding technique demonstrates efficient performance, particularly in scenarios where only a few processors are assigned non-zero fitness values.
Transaction processing systems are the crux for modern datacenter applications, yet current multi-node systems are slow due to network overheads. This paper advocates for Compute Express Link (CXL) as a network altern...
详细信息
ISBN:
(纸本)9798400710797
Transaction processing systems are the crux for modern datacenter applications, yet current multi-node systems are slow due to network overheads. This paper advocates for Compute Express Link (CXL) as a network alternative, which enables low-latency and cache-coherent shared memory accesses. However, directly adopting standard CXL primitives leads to performance degradation due to the high cost of maintaining cross-node cache coherence. To address the CXL challenges, this paper introduces CtXnL, a software-hardware co-designed system that implements a novel hybrid coherence primitive tailored to the loosely coherent nature of transactional data. The core innovation of CtXnL is empowering transaction system developers with the ability to selectively achieve data coherence. Our evaluations on OLTP workloads demonstrate that CtXnL enhances performance, outperforming current network-based systems and achieves up to 2.08x greater throughput than vanilla CXL memory sharing architectures across universal transaction processing policies.
Community detection refers to the identification of coherent partitions in networks. In this poster, we present a parallel dynamic Louvain algorithm that finds conummities in rapidly evolving graphs. Given a batch upd...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Community detection refers to the identification of coherent partitions in networks. In this poster, we present a parallel dynamic Louvain algorithm that finds conummities in rapidly evolving graphs. Given a batch update of edge deletions or insertions, our algorithm identifies an approximate set of affected vertices in the graph with minimal overhead and updates the community membership of each vertex. This process repeats until convergence. Our approach achieves a mean speedup of 7.3 x, compared to our parallel and optimized implementation of Delta-screening combined with Louvain, a recently proposed stateof-the-art approach.
With the exponential growth of deep learning (DL), there arises an escalating need for scalability. Despite significant advancements in communication hardware capabilities, the time consumed by communication remains a...
详细信息
ISBN:
(纸本)9798400706981
With the exponential growth of deep learning (DL), there arises an escalating need for scalability. Despite significant advancements in communication hardware capabilities, the time consumed by communication remains a bottleneck during training. The existing various optimizations are coupled within parallelsystems to implement specific computation-communication overlap. These approaches pose challenges in terms of performance, programmability, and generality. In this paper, we introduce Concerto, a compiler framework designed to address these challenges by automatically optimizing and scheduling communication. We formulate the scheduling problem as a resource-constrained project scheduling problem and use off-the-shelf solver to get the near-optimal scheduling. And use auto-decomposition to create overlap opportunity for critical (synchronous) communication. Our evaluation shows Concerto can match or outperform state-of-the-art parallel frameworks, including Megatron-LM, JAX/XLA, DeepSpeed, and Alpa, all of which include extensive hand-crafted optimization. Unlike previous works, Concerto decouples the parallel approach and communication optimization, then can generalize to a wide variety of parallelisms without manual optimization.
The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptib...
详细信息
ISBN:
(纸本)9781665473156
The estimate that the mean time between failures will be in minutes in exascale supercomputers should be alarming for application developers. The inherent system's complexity, millions of components, and susceptibility to failures make checkpointing more relevant than ever. Since most high performance scientific applications contain an in-house checkpoint restart mechanism, their performance can be impacted by the contention of parallel file system resources. A shift in checkpointing strategies is needed to thwart this behavior. With iCheck, we present a novel checkpointing framework that supports malleable multilevel application-level checkpointing. We employ an RDMA enabled configurable multi-agent-based checkpoint transfer mechanism where minimal application resources are utilized for checkpointing. The high-level API of iCheck facilitates easy integration and malleability. We have added the iCheck library into the ls1 mardyn application providing performance improvement up to five thousand times over the in-house checkpointing mechanism. LULESH, Jacobi 2D heat simulation, and a synthetic application were also used for extensive analysis.
Smart contracts are one of the software application which is developed based on blockchain, are extensively used in Internet of Things, finance management, and other applications. However, development of smart contrac...
详细信息
Medical Image AI systems can assist doctors in making diagnoses, thereby improving diagnostic accuracy. These systems are now widely used in hospitals. However, current AI diagnostic methods typically rely on various ...
详细信息
Deep Reinforcement Learning has been successfully applied in various applications and achieved impressive performance compared with previous traditional methods but suffers from high computation cost and long training...
详细信息
ISBN:
(纸本)9781665473156
Deep Reinforcement Learning has been successfully applied in various applications and achieved impressive performance compared with previous traditional methods but suffers from high computation cost and long training time. MLPerf takes deep reinforcement learning as one of the benchmark tracks and provides a single node training version of MiniGo as a reference. A key challenge is to achieve efficient MiniGo training on a large-scale computing system. According to the training computation pattern in MiniGo and the characteristics of our large-scale heterogeneous computing system, we propose a MultiLevel parallel strategy, MLPs, including task-level parallelism between nodes, CPU-DSP heterogeneous parallelism, and DSP multi-core parallelism. The proposed method reduces the overall execution time from 43 hours to 16 hours while scaling the node size from 1067 to 4139. The scaling efficiency is 69.1%. According to our fitting method, the scaling efficiency is 46.5% when scaling to 8235 nodes. The experimental results show that the proposed method achieves the efficient training of MiniGo on the largescale heterogeneous computing system.
暂无评论