Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidt...
详细信息
ISBN:
(数字)9798350326581
ISBN:
(纸本)9798350326598
Personalized recommendation systems have become one of the most important Internet services nowadays. A critical challenge of training and deploying the recommendation models is their high memory capacity and bandwidth demands, with the embedding layers occupying hundreds of GBs to TBs of storage. The advent of memory disaggregation technology and Compute Express Link (CXL) provides a promising solution for memory capacity scaling. However, relocating memory-intensive embedding layers to CXL memory incurs noticeable performance degradation due to its limited transmission bandwidth, which is significantly lower than the host memory bandwidth. To address this, we introduce ReCXL, a CXL memory disaggregation system that utilizes near-memory processing for scalable, efficient recommendation model training. ReCXL features a unified, hardwareefficient NMP architecture that processes the entire embedding training within CXL memory, minimizing data transfers over the bandwidth-limited CXL and enhancing internal bandwidth. To further improve the performance, ReCXL incorporates softwarehardware co-optimizations, including sophisticated dependencyfree prefetching and fine-grained update scheduling, to maximize hardware utilization. Evaluation results show that ReCXL outperforms the CPU-GPU baseline and the naïve CXL memory by $7.1 \times \sim 10.6 \times(9.4 \times$ on average) and $12.7 \times \sim 31.3 \times(22.6 \times$ on average), respectively.
Genome graphs analysis has emerged as an effective means to enable mapping DNA fragments (known as reads) to the reference genome. It replaces the traditional linear reference with a graph-based representation to augm...
Genome graphs analysis has emerged as an effective means to enable mapping DNA fragments (known as reads) to the reference genome. It replaces the traditional linear reference with a graph-based representation to augment the genetic variations and diversity information, significantly improving the quality of genotyping. The in-depth characterization of genome graphs analysis uncovers that it is bottlenecked by the irregular seed index access and the intensive alignment operation, stressing both the memory system and computing *** on these observations, we propose MeG 2 , a lightweight, commodity DRAM-compliant, processing-in-memory architecture to accelerate genome graphs analysis. MeG 2 is specifically integrated with the capabilities of both near-memory processing and bitwise in-situ computation. Specifically, MeG 2 leverages the low access latency of near-memory processing with the index-centric offload mechanism to alleviate the irregular memory access in the seeding procedure, and harnesses the row-parallel capacity of in-situ computation with the distance-aware technique to exploit the intensive computational parallelism in the alignment process. Results show that MeG 2 outperforms the CPU-, GPU-, and ASIC-based genome graphs analysis solutions by 502× (30.2×), 272× (15.1× ), and 5.5× (8.3×) for short (long) reads, while reducing energy consumption by 1628× (85.6×), 1443× (77.1×), and 7.8× (11.7×), respectively. We also demonstrate that MeG 2 offers significant improvements over existing PIM-based genome sequence analysis accelerators.
Serverless platforms typically adopt an early-binding approach for function sizing, requiring developers to specify an immutable size for each function within a workflow beforehand. Accounting for potential runtime va...
详细信息
Federated Learning (FL) is increasingly adopted in edge computing scenarios, where a large number of heterogeneous clients operate under constrained or sufficient resources. The iterative training process in conventio...
详细信息
Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Examples of such models include CodeGPT and StarCoder. These models are typically trained from a...
详细信息
Recent years have witnessed significant progress in developing deep learning-based models for automated code completion. Examples of such models include CodeGPT and StarCoder. These models are typically trained from a large amount of source code collected from open-source communities such as GitHub. Although using source code in GitHub has been a common practice for training deep-learning-based models for code completion, it may induce some legal and ethical issues such as copyright infringement. In this paper, we investigate the legal and ethical issues of current neural code completion models by answering the following question: Is my code used to train your neural code completion model? To this end, we tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks to a more challenging task of code completion. In particular, since the target code completion models perform as opaque black boxes, preventing access to their training data and parameters, we opt to train multiple shadow models to mimic their behavior. The acquired posteriors from these shadow models are subsequently employed to train a membership classifier. Subsequently, the membership classifier can be effectively employed to deduce the membership status of a given code sample based on the output of a target code completion model. We comprehensively evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models, (i.e., LSTM-based, CodeGPT, CodeGen, and StarCoder). Experimental results reveal that the LSTM-based and CodeGPT models suffer the membership leakage issue, which can be easily detected by our proposed membership inference approach with an accuracy of 0.842, and 0.730, respectively. Interestingly, our experiments also show that the data membership of current large language models of code, e.g., CodeGen and StarCoder, is difficult to detect, leaving amper space for further improvement. Finally, we also t
Transactional stream processing engines (TSPEs) have gained increasing attention due to their capability of processing real-time stream applications with transactional semantics. However, TSPEs remain susceptible to s...
详细信息
ISBN:
(数字)9798350317152
ISBN:
(纸本)9798350317169
Transactional stream processing engines (TSPEs) have gained increasing attention due to their capability of processing real-time stream applications with transactional semantics. However, TSPEs remain susceptible to system failures and power outages. Existing TSPEs mainly focus on performance improvement, but still face a significant challenge to guarantee fault tolerance while offering high-performance services. We revisit commonly-used fault tolerance approaches in stream processing and database systems, and find that these approaches do not work well on TSPEs due to complex data dependencies. In this paper, we propose a novel TSPE called MorphStreamR to achieve fast failure recovery while guaranteeing low performance overhead at runtime. The key idea of MorphStreamR is to record intermediate results of resolved dependencies at runtime, and thus eliminate data dependencies to improve task parallelism during failure recovery. MorphStreamR further mitigates the runtime overhead by selectively tracking data dependencies and incorporating workload-aware log commitment. Experimental results show that MorphStreamR can significantly reduce the recovery time by up to 3.1 x while experiencing much less performance slowdown at runtime, compared with other applicable fault tolerance approaches.
Graph convolutional network (GCN) has achieved enormous success in learning structural information from unstructured data. As graphs become increasingly large, distributed training for GCNs is severely prolonged by fr...
详细信息
ISBN:
(数字)9798350383508
ISBN:
(纸本)9798350383515
Graph convolutional network (GCN) has achieved enormous success in learning structural information from unstructured data. As graphs become increasingly large, distributed training for GCNs is severely prolonged by frequent cross-worker communications. Existing efforts to improve the training efficiency often come at the expense of GCN performance, while the communication overhead persists. In this paper, we propose PSC-GCN, a holistic pipelined framework for distributed GCN training with communication-efficient sampling and inclusion-aware caching, to address the communication bottleneck while ensuring satisfactory model performance. Specifically, we devise an asynchronous pre-fetching scheme to retrieve stale statistics (features, embedding, gradient) of boundary nodes in advance, such that the embedding aggregation and model update are pipelined with statistics transmission. To alleviate communication volume and staleness effect, we introduce a variance-reduction based sampling policy, which prioritizes inner nodes over boundary ones for reducing the access frequency to remote neighbors, thus mitigating cross-worker statistics exchange. Complementing graph sampling, a feature caching module is co-designed to buffer hot nodes with high inclusion probability, ensuring that frequently sampled nodes will be available in local memory. Extensive evaluations on real-world datasets show the superiority of PSC-GCN over state-of-the-art methods, where we can reduce training time by 72%-80% without sacrificing model accuracy.
Directed Acyclic Graph (DAG)-based blockchain (a.k.a distributed ledger) has become prevalent for supporting highly concurrent applications. Its inherent parallel data structure accelerates block generation significan...
详细信息
ISBN:
(数字)9798350317152
ISBN:
(纸本)9798350317169
Directed Acyclic Graph (DAG)-based blockchain (a.k.a distributed ledger) has become prevalent for supporting highly concurrent applications. Its inherent parallel data structure accelerates block generation significantly, shifting the bottleneck from performance to storage scalability. An intuitive solution is to apply state sharding that divides the entire ledger (i.e., transactions and states) into multiple shards. While each node only stores proportional transactions, it suffers from the challenges of storing and ensuring the processing consistency of cross-shard transactions. In this paper, we propose SharDAG, a new mechanism that leverages adaptive sharding for DAG-based blockchains to achieve high performance and strong consistency. The key idea of SharDAG is to exploit unique characteristics - silent assets - and design a lightweight processing mechanism based on avatar account caching. Furthermore, we design a Byzantine resilient cross-shard verification mechanism with a theoretically optimal number of participating nodes, which guarantees the consistency and security of avatar account aggregation. Our comprehensive evaluations on real-world workloads demonstrate that SharDAG presents up to 3.8 x throughput improvement compared to the state-of-the-art and reduces the storage overhead of cross-shard transactions.
Stream Learning (SL) requires models that can quickly adapt to continuously evolving data, posing significant challenges in both computational efficiency and learning accuracy. Effective data selection is critical in ...
详细信息
Modern scientific applications predominantly run on large-scale computing platforms, necessitating collaboration between scientific domain experts and high-performance computing (HPC) experts. While domain experts are...
详细信息
Modern scientific applications predominantly run on large-scale computing platforms, necessitating collaboration between scientific domain experts and high-performance computing (HPC) experts. While domain experts are often skilled in customizing domain-specific scientific computing routines, which often involves various matrix computations, HPC experts are essential for achieving efficient execution of these computations on large-scale platforms. This process often involves utilizing complex parallel computing libraries tailored to specific matrix computation scenarios. However, the intricate programming procedure and the need for deep understanding in both application domains and HPC poses significant challenges to the widespread adoption of scientific computing. In this research, we observe that matrix computations can be transformed into equivalent graph representations, and that by utilizing graph processing engines, HPC experts can be freed from the burden of implementing efficient scientific computations. Based on this observation, we introduce a graph engine-based scientific computing (Graph for Science) paradigm, which provides a unified graph programming interface, enabling domain experts to promptly implement various types of matrix computations. The proposed paradigm leverages the underlying graph processing engine to achieve efficient execution, eliminating the needs for HPC expertise in programming large-scale scientific applications. We evaluate the performance of the developed graph compute engine for three typical scientific computing routines. Our results demonstrate that the graph engine-based scientific computing paradigm achieves performance comparable to the best-performing implementations based on existing parallel computing libraries and bespoke implementations. Importantly, the paradigm greatly simplifies the development of scientific computations on large-scale platforms, reducing the programming difficulty for scientists and facilitating b
暂无评论