版权所有:内蒙古大学图书馆 技术提供:维普资讯• 智图
内蒙古自治区呼和浩特市赛罕区大学西街235号 邮编: 010021
作者机构:BUPT Sch Comp Sci Natl Pilot Software Engn Sch Beijing Peoples R China BUPT Beijing Key Lab Intelligent Telecommun Software & Beijing Peoples R China BUPT Sch Econ & Management Beijing Peoples R China Peking Univ Dept Comp Sci & Technol Beijing Peoples R China Peking Univ Key Lab High Confidence Software Technol MOE Beijing Peoples R China HKUST Dept Comp Sci & Engn Hong Kong Peoples R China
出 版 物:《VLDB JOURNAL》 (国际大型数据库杂志)
年 卷 期:2021年第30卷第5期
页 面:769-797页
核心收录:
学科分类:08[工学] 0812[工学-计算机科学与技术(可授工学、理学学位)]
基 金:National Key Research and Development Program of China [2018YFB140 2600, 2018AAA0101100] NSFC [U1936104, 61902037, 61832001] CAAI-Huawei MindSpore Open Fund Beijing Academy of Artificial Intelligence (BAAI) PKU-Baidu Fund [2019BD006] Fundamental Research Funds for the Central Universities [2020RC25] Hong Kong RGC GRF Project CRF Project [C6030-18G, C1031-18G, C5026-18G] AOE Project [AoE/E-603/18] Theme-based project [TRS T41-603/20R] China NSFC Guangdong Basic and Applied Basic Research Foundation [2019B151530001] Hong Kong ITC ITF grants [ITS/044/18FX, ITS/470/18FX] Microsoft Research Asia Collaborative Research Grant Didi-HKUST joint research lab project Wechat Webank Research Grants
主 题:Random walk Memory efficient Graph algorithm Large-scale
摘 要:Second-order random walk is an important technique for graph analysis. Many applications including graph embedding, proximity measure and community detection use it to capture higher-order patterns in the graph, thus improving the model accuracy. However, the memory explosion problem of this technique hinders it from analyzing large graphs. When processing a billion-edge graph like Twitter, existing solutions (e.g., alias method) of the second-order random walk may take up 1796TB memory. Such high memory consumption comes from the memory-unaware strategies for the node sampling during the random walk. In this paper, to clearly compare the efficiency of various node sampling methods, we first design a cost model and propose two new node sampling methods: one follows the acceptance-rejection paradigm to achieve a better balance between memory and time cost, and the other is optimized for fast sampling the skewed probability distributions existed in natural graphs. Second, to achieve the high efficiency of the second-order random walk within arbitrary memory budgets, we propose a novel memory-aware framework on the basis of the cost model. The framework applies a cost-based optimizer to assign desirable node sampling method for each node or edge in the graph within a memory budget meanwhile minimizing the time cost of the random walk. Finally, the framework provides general programming interfaces for users to define new second-order random walk models easily. The empirical studies demonstrate that our memory-aware framework is robust with respect to memory and is able to achieve considerable efficiency by reducing 90% of the memory cost.