Transitive closure computation is a fundamental operation in graph theory with applications in various domains. However, the increasing size and complexity of real-world graphs make traditional algorithms inefficient,...
详细信息
Serverless computing has shown vast potential for big data analytics applications, especially involving machine learning algorithms. Nevertheless, little consideration has been given in the literature to cloud-agnosti...
详细信息
Community detection is a fundamental operation in graph mining, and by uncovering hidden structures and patterns within complex systems it helps solve fundamental problems pertaining to social networks, such as inform...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Community detection is a fundamental operation in graph mining, and by uncovering hidden structures and patterns within complex systems it helps solve fundamental problems pertaining to social networks, such as information diffusion, epidemics, and recommender systems. Scaling graph algorithms for massive networks becomes challenging on modern distributed-memory multi-GPU (Graphics Processing Unit) systems due to limitations such as irregular memory access patterns, load imbalances, higher communication-computation ratios, and cross-platform support. We present a novel algorithm HiPDPL-GPU (distributedparallel Louvain) to address these challenges. We conduct experiments involving different partitioning techniques to achieve an optimized performance of HiPDPL-GPU on the two largest supercomputers: Frontier and Summit. Remarkably, HiPDPL-GPU processes a graph with 4.2 billion edges in less than 3 minutes using 1024 GPUs. Qualitatively, the performance of HiPDPL-GPU is similar or better compared to other state-of-the-art CPU- and GPU-based implementations. While prior GPU implementations have predominantly employed CUDA, our first-of-its-kind implementation for community detection is cross-platform, accommodating both AMD and NVIDIA GPUs.
The serverless computing model has been on the rise in recent years due to a lower barrier to entry and elastic scalability. However, our experimental evidence suggests that multiple serverless computing platforms suf...
详细信息
ISBN:
(纸本)9798400701559
The serverless computing model has been on the rise in recent years due to a lower barrier to entry and elastic scalability. However, our experimental evidence suggests that multiple serverless computing platforms suffer from serious performance inefficiencies when a high number of concurrent function instances are invoked, which is a desirable capability for parallel applications. To mitigate this challenge, this paper introduces ProPack, a novel solution that provides higher performance and yields cost savings for end users running applications with high concurrency. ProPack leverages insights obtained from experimental study to build a simple and effective analytical model that mitigates the scalability bottleneck. Our evaluation on multiple serverless platforms including AWS Lambda and Google confirms that ProPack can improve average performance by 85% and save cost by 66%. ProPack provides significant improvement (over 50%) over the state-of-the-art serverless workload manager such as Pywren, and is also, effective at mitigating the concurrency bottleneck for FuncX, a recent on-premise serverless execution platform for parallel applications.
Big data technology is increasingly penetrating various industries, bringing unprecedented opportunities to enterprises and society with its powerful data processing and analysis capabilities. At the same time, the ra...
详细信息
Recent advances in imaging and computing technology generate tremendous image data daily. Searching image collections has been made easier with the introduction of some content-based image retrieval (CBIR) approaches....
详细信息
In order to overcome the problems of high intrusion rate and low encryption depth of traditional cloud computing security research methods, this paper proposes a new cloud computing security research method based on V...
详细信息
Analog computing-in-memory (ACiM) technology has shown strong potential for neural network accelerators, addressing von-Neumann performance bottlenecks with in-memory data processing and computation. Understanding the...
详细信息
ISBN:
(数字)9798400706318
ISBN:
(纸本)9798400706318
Analog computing-in-memory (ACiM) technology has shown strong potential for neural network accelerators, addressing von-Neumann performance bottlenecks with in-memory data processing and computation. Understanding the ACiM design space, including its trade-offs and constraints, and systematically and effectively exploring it for optimal performance is essential to turn the promise into a viable product. Recent research demonstrated that multi-objective searches for ACiM architectures with heterogeneous tiles can simultaneously optimize power, performance, and area (PPA), outperforming existing tiled ACiM proposals. In this paper, we propose NavCim, a comprehensive ACiM design space exploration mechanism that advances the prior work in terms of search efficiency, search space coverage, and optimization metrics. NavCim introduces predictive modeling of ACiM hardware performance and uses the PPA prediction models instead of running simulators, significantly reducing search overheads. Faster searches enable NavCim to extend the architecture and model search spaces with an evolutionary search process to optimize architectures with more than two different tile sizes for multiple input models. With accuracy-aware searches, NavCim considers PPA and model accuracy together as optimization goals to achieve more balanced trade-offs. The experimental searches show that NavCim leverages predictive models to reduce search time by up to 7.3x without compromising the quality of search results. It also successfully identifies heterogeneous ACiM architectures that can efficiently execute multiple models on a single chip, improving accuracy by up to 19% over the prior work.
Graph clustering is an important technique to detect community clusters in complex networks. SCAN (Structural Clustering Algorithm for Networks) is a well-studied graph clustering algorithm that has been widely applie...
详细信息
ISBN:
(纸本)9781665473156
Graph clustering is an important technique to detect community clusters in complex networks. SCAN (Structural Clustering Algorithm for Networks) is a well-studied graph clustering algorithm that has been widely applied over the years. However, the processing time cost of sequential SCAN and its variants cannot be tolerable on large graphs. The existing parallel variants of SCAN are focusing on fully utilizing the computing capacity of multi-core computer architectures and inventing sophisticated optimization techniques on single computing node. As the objects and their relationships in cyberspace are varying over time, the scale of graph data is increasing with high rate. The graph clustering algorithms on single node are facing challenges from limited computing resources, such as computing performance, memory size and storage volume. The distributed processing algorithm is called for processing large graphs. This work presents a distributed structural graph clustering algorithm using Spark. Furthermore, the edge pruning technique and adaptive checking are optimized to improve clustering efficiency. And the label propagation clustering is simplified to reduce the communication cost in the distributed clustering iterations. It also conduct extensive experiments on real-world datasets to testify the efficiency and scalability of the distributed algorithm. Experimental results show that efficient clustering performance can be achieved and it scales well under different settings.
Embedding is a crucial step for deep neural networks. Datasets, from different applications, with different structures, can all be processed through an embedding layer and transformed into a dense matrix. The transfor...
详细信息
ISBN:
(纸本)9798400706103
Embedding is a crucial step for deep neural networks. Datasets, from different applications, with different structures, can all be processed through an embedding layer and transformed into a dense matrix. The transformation must minimize both the loss of information and the redundancy of data. Extracting appropriate data features ensures the efficiency of the transformation. The co-occurrence matrix is an excellent way of representing the links between elements in a dataset. However, the dataset size becomes a problem in terms of computation power and memory footprint for using the co-occurrence matrix. In this paper, we propose a parallel and distributed approach to efficiently constructing the co-occurrence matrix in a scalable way. Our solution takes advantage of different features of boolean datasets to minimize the construction time of the co-occurrence matrix. Our experimental results show that our solution outperforms traditional approaches up to 34x. We also demonstrate the efficacy of our approach with a cost model.
暂无评论