Modern scientific applications predominantly run on large-scale computing platforms, necessitating collaboration between scientific domain experts and high-performance computing (HPC) experts. While domain experts are...
详细信息
Modern scientific applications predominantly run on large-scale computing platforms, necessitating collaboration between scientific domain experts and high-performance computing (HPC) experts. While domain experts are often skilled in customizing domain-specific scientific computing routines, which often involves various matrix computations, HPC experts are essential for achieving efficient execution of these computations on large-scale platforms. This process often involves utilizing complex parallel computing libraries tailored to specific matrix computation scenarios. However, the intricate programming procedure and the need for deep understanding in both application domains and HPC poses significant challenges to the widespread adoption of scientific computing. In this research, we observe that matrix computations can be transformed into equivalent graph representations, and that by utilizing graph processing engines, HPC experts can be freed from the burden of implementing efficient scientific computations. Based on this observation, we introduce a graph engine-based scientific computing (Graph for Science) paradigm, which provides a unified graph programming interface, enabling domain experts to promptly implement various types of matrix computations. The proposed paradigm leverages the underlying graph processing engine to achieve efficient execution, eliminating the needs for HPC expertise in programming large-scale scientific applications. We evaluate the performance of the developed graph compute engine for three typical scientific computing routines. Our results demonstrate that the graph engine-based scientific computing paradigm achieves performance comparable to the best-performing implementations based on existing parallel computing libraries and bespoke implementations. Importantly, the paradigm greatly simplifies the development of scientific computations on large-scale platforms, reducing the programming difficulty for scientists and facilitating b
The emergence of large-scale dynamic sets in real applications brings severe challenges in approximate set representation structures. A dynamic set with changing cardinality requires an elastic capacity of the approxi...
详细信息
Stream Learning (SL) requires models that can quickly adapt to continuously evolving data, posing significant challenges in both computational efficiency and learning accuracy. Effective data selection is critical in ...
详细信息
Serverless computing has gained significant traction for machine learning inference applications, which are often deployed as serverless workflows consisting of multiple CPU and GPU functions with data dependency. How...
详细信息
Although cloud computing saves the cost of enterprises and individuals to manage data in various applications on public cloud servers, it also causes the problem of the leakage of critical personal information and sen...
详细信息
The rapid development of large language models (LLMs) has transformed many industries, including healthcare. However, previous medical LLMs have largely focused on leveraging general medical knowledge to provide respo...
详细信息
To improve the throughput of Byzantine Fault Tolerance (BFT) consensus protocols, the Directed Acyclic Graph (DAG) topology has been introduced to parallel data processing, leading to the development of DAG-based BFT ...
详细信息
ISBN:
(数字)9798350387117
ISBN:
(纸本)9798350387124
To improve the throughput of Byzantine Fault Tolerance (BFT) consensus protocols, the Directed Acyclic Graph (DAG) topology has been introduced to parallel data processing, leading to the development of DAG-based BFT consensus. However, existing DAG-based works heavily rely on Reliable Broadcast (RBC) protocols for block broadcasting, which introduces significant latency due to the three communication steps involved in each RBC. For instance, DAgrider, a representative DAG-based protocol, exhibits the best latency of 12 steps, considerably higher than non-DAG protocols like PBFT, which only requires 3 steps. To tackle this issue, we propose LightDAG, which replaces RBC with lightweight broadcasting protocols such as Consistent Broadcast (CBC) and Plain Broadcast (PBC). Since CBC and PBC can be implemented in two and one communication steps, respectively, LightDAG achieves low *** our proposal, we present two variants of LightDAG, namely LightDAG1 and LightDAG2, each providing a trade-off between the best latency and the expected worst latency. In LightDAG1, every block is broadcast using CBC, which exhibits a best latency of 5 steps and an expected worst latency of 14 steps. Since CBC cannot guarantee the totality property, we design a block retrieval mechanism in LightDAG1 to assist replicas in retrieving missing blocks. LightDAG2 utilizes a combination of PBC and CBC for block broadcasting, resulting in the best latency of 4 steps and an expected worst latency of 12(t+1) steps, where t represents the number of actual Byzantine replicas. Since a Byzantine replica may equivocate through PBC, LightDAG2 prohibits blocks from directly referencing contradictory blocks. To ensure liveness, we propose a mechanism to identify and exclude Byzantine replicas if they engage in equivocation attacks. Extensive experiments have been conducted to evaluate LightDAG, and the results demonstrate its feasibility and efficiency.
The proliferation of deep-learning-based mobile and IoT applications has driven the increasing deployment of edge datacenters equipped with domain-specific accelerators. The unprecedented computing power offered by th...
详细信息
Approximate nearest neighbor search (ANNS) has emerged as a crucial component of database and AI infrastructure Ever-increasing vector datasets pose significant challenges in terms of performance, cost, and accuracy f...
详细信息
Hypergraph processing has emerged as a powerful approach for analyzing complex multilateral relationships among multiple entities. Past research on building hypergraph systems suggests that changing the scheduling ord...
详细信息
ISBN:
(纸本)9781665462723
Hypergraph processing has emerged as a powerful approach for analyzing complex multilateral relationships among multiple entities. Past research on building hypergraph systems suggests that changing the scheduling order of bipartite edge tasks can improve the overlap-induced data locality in hypergraph processing. However, due to the complex intertwined connections between vertices and hyperedges, it is almost impossible to find a locality-optimal scheduling order. Thus, these task-centric hypergraph systems often suffer from substantial off-chip *** this paper, we first propose a novel data-centric Load-Trigger-Reduce (LTR) execution model to exploit fully the locality in hypergraph processing. Unlike a task-centric model that loads the required data along with a task, our LTR model invokes tasks as per the data used. Specifically, once the hypergraph data is loaded into the on-chip memory, all of its relevant computation tasks will be triggered simultaneously to output intermediate results, which are finally reduced to update the final results. Our LTR model enables all hypergraph data to be accessed once in each iteration. To fully exploit the LTR performance potential, we further architect an LTR-driven hypergraph accelerator, XuLin, which features with an adaptive data loading mechanism to minimize the loading cost via chunk merging at runtime. XuLin is also equipped with a priority-based differential data reduction scheme to reduce the impact of conflicting updates on performance. We have implemented XuLin both on a Xilinx Alveo U250 FPGA card and using a cycle-accurate simulator. The results show that XuLin outperforms the state-of-the-art hypergraph processing solutions Hygra and ChGraph by 20.47× and 8.77× on average, respectively.
暂无评论