Streaming graph processing needs to timely evaluate continuous queries. Prior systems suffer from massive redundant computations due to the irregular order of processing vertices influenced by updates. To address this...
详细信息
ISBN:
(纸本)9798350323481
Streaming graph processing needs to timely evaluate continuous queries. Prior systems suffer from massive redundant computations due to the irregular order of processing vertices influenced by updates. To address this issue, we propose ACGraph, a novel streaming graph processing approach for monotonic graph algorithms. It maintains dependence trees during runtime, and makes affected vertices processed in a top-to-bottom order in the hierarchy of the dependence trees, thus normalizing the state propagation order and coalescing of multiple propagation to the same vertices. Experimental results show that ACGraph reduces the number of updates by 50% on average, and achieves the speedup of 1.75~7.43× over state-of-the-art systems.
Segment Anything Model (SAM) has recently gained much attention for its outstanding generalization to unseen data and tasks. Despite its promising prospect, the vulnerabilities of SAM, especially to universal adversar...
In this paper, we propose efficient distributed algorithms for three holistic aggregation functions on random regular graphs that are good candidates for network topology in next-generation data *** three holistic agg...
详细信息
In this paper, we propose efficient distributed algorithms for three holistic aggregation functions on random regular graphs that are good candidates for network topology in next-generation data *** three holistic aggregation functions include SELECTION(select the k-th largest or smallest element),DISTINCT(query the count of distinct elements), MODE(query the most frequent element). We design three basic techniques — Pre-order Network Partition, Pairwise-independent Random Walk, and Random Permutation Delivery, and devise the algorithms based on the techniques. The round complexity of the distributed SELECTION is Θ(log N) which meets the lower bound where N is the number of nodes and each node holds a numeric element. The round complexity of the distributed DISTINCT and MODE algorithms are O(log3N/log log N) and O(log2N log log N) respectively. All of our results break the lower bounds obtained on general graphs and our distributed algorithms are all based on the CON GE S T model, which restricts each node to send only O(log N) bits on each edge in one round under synchronous communications.
A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to signif...
A growing number of applications are moving to serverless architectures for high elasticity and fine-grained billing. For stateful applications, however, the use of serverless architectures is likely to lead to significant performance degradation, as frequent data sharing between different execution stages involves time-consuming remote storage access. Current platforms leverage memory cache to speed up remote access. However, conventional caching strategies show limited performance improvement. We experimentally find that the reason is that current strategies overlook the stage-dependent access patterns of stateful serverless applications, i.e., data that are read multiple times across stages (denoted as multi-read data) are wrongly evicted by data that are read only once (denoted as read-once data), causing a high cache miss ***, we propose a new caching strategy, Duo, whose design principle is to cache multi-read data as long as possible. Specifically, Duo contains a large cache list and a small cache list, which act as Leader list and Wingman list, respectively. Leader list ignores the data that is read for the first time to prevent itself from being polluted by massive read-once data at each stage. Wingman list inspects the data that are ignored or evicted by Leader list, and pre-fetches the data that will probably be read again based on the observation that multi-read data usually appear periodically in groups. Compared to the state-of-the-art works, Duo improves hit ratio by 1.1×-2.1× and reduces the data sharing overhead by 25%-62%.
Optimal margin Distribution Machine (ODM) is a newly proposed statistical learning framework rooting in the latest margin theory, which demonstrates better generalization performance than the traditional large margin ...
详细信息
Graph mining aims to explore interesting structural information of a graph. Pattern-centric systems typically transform a generic-purpose graph mining problem into a series of subgraph matching problems for high perfo...
详细信息
ISBN:
(纸本)9781665442787
Graph mining aims to explore interesting structural information of a graph. Pattern-centric systems typically transform a generic-purpose graph mining problem into a series of subgraph matching problems for high performance. Existing pattern-centric mining systems reduce the substantial search space towards a single pattern by exploring a highly-optimized matching order, but inherent computational redundancies of such a matching order itself still suffer severely, leading to significant performance degradation. The key innovation of this work lies in a general redundancy criterion that characterizes computational redundancies arising in not only handing a single pattern but also matching multiple patterns simultaneously. In this paper, we present SumPA, a high-performance pattern-centric graph mining system that can sufficiently remove redundant computations for any complex graph mining problems. SumPA features three key designs: (1) a pattern abstraction technique that can simplify numerous complex patterns into a few simple abstract patterns based on pattern similarity, (2) abstraction-guided pattern matching that completely eliminates (totally and partially) redundant computations during subgraph enumeration, and (3) a suite of system optimizations to maximize storage and computation efficiency. Our evaluation on a wide variety of real-world graphs shows that SumPA outperforms the two state-of-the-art systems Peregrine and GraphPi by up to 61.89× and 8.94×, respectively. For many mining problems on large graphs, Peregrine takes hours or even days while SumPA finishes in only a few minutes.
Out-of-distribution (OOD) detection is crucial for developing trustworthy and reliable machine learning systems. Recent advances in training with auxiliary OOD data demonstrate efficacy in enhancing detection capabili...
详细信息
Genome graphs analysis has emerged as an effective means to enable mapping DNA fragments (known as reads) to the reference genome. It replaces the traditional linear reference with a graph-based representation to augm...
Genome graphs analysis has emerged as an effective means to enable mapping DNA fragments (known as reads) to the reference genome. It replaces the traditional linear reference with a graph-based representation to augment the genetic variations and diversity information, significantly improving the quality of genotyping. The in-depth characterization of genome graphs analysis uncovers that it is bottlenecked by the irregular seed index access and the intensive alignment operation, stressing both the memory system and computing *** on these observations, we propose MeG 2 , a lightweight, commodity DRAM-compliant, processing-in-memory architecture to accelerate genome graphs analysis. MeG 2 is specifically integrated with the capabilities of both near-memory processing and bitwise in-situ computation. Specifically, MeG 2 leverages the low access latency of near-memory processing with the index-centric offload mechanism to alleviate the irregular memory access in the seeding procedure, and harnesses the row-parallel capacity of in-situ computation with the distance-aware technique to exploit the intensive computational parallelism in the alignment process. Results show that MeG 2 outperforms the CPU-, GPU-, and ASIC-based genome graphs analysis solutions by 502× (30.2×), 272× (15.1× ), and 5.5× (8.3×) for short (long) reads, while reducing energy consumption by 1628× (85.6×), 1443× (77.1×), and 7.8× (11.7×), respectively. We also demonstrate that MeG 2 offers significant improvements over existing PIM-based genome sequence analysis accelerators.
Serverless platforms typically adopt an early-binding approach for function sizing, requiring developers to specify an immutable size for each function within a workflow beforehand. Accounting for potential runtime va...
详细信息
Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidt...
详细信息
ISBN:
(数字)9798350326581
ISBN:
(纸本)9798350326598
Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by $7.1 \times \sim 10.6 \times(9.4 \times$ on average) and $12.7 \times \sim 31.3 \times(22.6 \times$ on average), respectively.
暂无评论