检索结果-内蒙古大学图书馆

21st IEEE International Symposium on Parallel and Distributed Processing with Applications, 13th IEEE International Conference on Big Data and Cloud Computing, 16th IEEE International Conference on Social Computing and Networking and 13th International Conference on Sustainable Computing and Communications, ISPA/BDCloud/SocialCom/SustainCom 2023

作者： Xu, Yuanchao Qian, Ruyi Wang, Yida Huo, Qirun Capital Normal University College of Information Engineering Beijing China Skl of Computer Architecture Institute of Computing Technology Cas Beijing China

ISBN: (纸本)9798350329223

The amazing success of deep neural network benefits from the rise of big data. As deep learning models are becoming more scale than ever before, their requirements for memory bandwidth are growing at a tremendous pace. Some AI accelerators adopt non-uniform memory access (NUMA) architecture to mitigate this issue and hence complicate device memory allocation. Although extensive studies have been conducted on how to mitigate resource contention and reduce latency, almost all of them target on CPU-oriented NUMA systems but not on AI accelerators where memory allocation precedes task scheduling. The current memory allocator generally adopts an interleaved memory allocation strategy, which is very easy to implement but far from *** tackle this issue, this paper proposes iNUMAlloc, an intelligent memory allocator specialized for AI accelerators with NUMA architecture by combining program behavior and predictable hardware resources altogether. Preliminary evaluation shows that it can help to improve the accuracy and efficiency of memory allocation, thereby achieving stable execution time. © 2023 IEEE.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

EagerReuse: An Efficient Memory Reuse Approach for Complex Computational Graph 29

EagerReuse: An Efficient Memory Reuse Approach for Complex C...

引用

29th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2023

作者： Qian, Ruyi Cao, Bojun Gao, Mengjuan Shi, Qinwen Wang, Yida Xu, Yuanchao Huo, Qirun Qiu, Keni Capital Normal University College of Information Engineering Beijing China Institute of Computing Technology Cas Skl of Computer Architecture Beijing China

ISBN: (纸本)9798350330717

Memory reuse is a promising approach for deep neural network (DNN) to reduce memory consumption because it does not introduce any additional runtime overhead. We observe that existing memory reuse algorithms consider only the effect of an individual data feature (either tensor size or tensor lifetime) on memory reuse and ignore the relative position relationship (RPR) among tensors. As computational graphs grow slightly more complex, the mining of memory reuse becomes insufficient. To address this issue, we propose a new memory reuse algorithm - EagerReuse, which can exploit more memory reuse opportunities by analyzing RPR among tensors and reusing them as quickly as possible. We evaluated the algorithms with inference models in TensorFlow Model Garden, and the results show that the EagerReuse outperforms the state-of-the-art algorithms in three out of seven cases. For more complex computational graphs, EagerReuse can achieve better memory usage with slightly higher but acceptable overhead. © 2023 IEEE.

关键词： computational graph memory optimization memory reuse memory usage

来源：评论

学校读者我要写书评

暂无评论

HiStore: Rethinking Hybrid Index in RDMA-based Key-Value Store

arXiv

引用

arXiv 2022年

作者： Han, Shukai Zhang, Mi Jiang, Dejun Xiong, Jin SKL Computer Architecture ICT CAS China

RDMA (Remote Direct Memory Access) is widely exploited in building key-value stores to achieve ultra low latency. In RDMA-based key-value stores, the indexing time takes a large fraction (up to 74%) of the overall operation latency as RDMA enables fast data accesses. However, the single index structure used in existing RDMA-based key-value stores, either hash-based or sorted index, fails to support range queries efficiently while achieving high performance for single-point operations. In this paper, we reconsider the adoption of hybrid index in the key-value stores based on RDMA, to combine the benefits of hash table and sorted index. We propose HiStore, an RDMA-based key-value store using hash table for single-point lookups and leveraging skiplist for range queries. To maintain strong consistency in a lightweight and efficient approach, HiStore introduces index groups where a skiplist corresponds to a hash table, and asynchronously applies updates to the skiplist within a group. Guided by previous work on using RDMA for key-value services, HiStore dedicatedly chooses different RDMA primitives to optimize the read and write performance. Furthermore, HiStore tolerates the failures of servers that maintain index structures with index replication for high availability. Our evaluation results demonstrate that HiStore improves the performance of both GET and SCAN operations (by up to 2.03x) with hybrid index. © 2022, CC BY.

关键词：

来源：评论

学校读者我要写书评

暂无评论

CDFGNN: a Systematic Design of Cache-based Distributed Full-Batch Graph Neural Network Training with Communication Reduction

arXiv

引用

arXiv 2024年

作者： Zhang, Shuai Jiang, Zite You, Haihang Meituan Beijing China SKL Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China

Graph neural network training is mainly categorized into mini-batch and full-batch training methods. The mini-batch training method samples subgraphs from the original graph in each iteration. This sampling operation introduces extra computation overhead and reduces the training accuracy. Meanwhile, the full-batch training method calculates the features and corresponding gradients of all vertices in each iteration, and therefore has higher convergence accuracy. However, in the distributed cluster, frequent remote accesses of vertex features and gradients lead to huge communication overhead, thus restricting the overall training efficiency. In this paper, we introduce the cached-based distributed full-batch graph neural network training framework (CDFGNN). We propose the adaptive cache mechanism to reduce the remote vertex access by caching the historical features and gradients of neighbor vertices. Besides, we further optimize the communication overhead by quantifying the messages and designing the graph partition algorithm for the hierarchical communication architecture. Experiments show that the adaptive cache mechanism reduces remote vertex accesses by 63.14% on average. Combined with communication quantization and hierarchical GP algorithm, CDFGNN outperforms the state-of-the-art distributed full-batch training frameworks by 30.39% in our experiments. Our results indicate that CDFGNN has great potential in accelerating distributed full-batch GNN training tasks. © 2024, CC BY.

关键词： Graph neural networks

来源：评论

学校读者我要写书评

暂无评论

NEURAL PROGRAM SYNTHESIS WITH QUERY 10

NEURAL PROGRAM SYNTHESIS WITH QUERY

引用

10th International Conference on Learning Representations, ICLR 2022

作者： Huang, Di Zhang, Rui Hu, Xing Zhang, Xishan Jin, Pengwei Li, Nan Du, Zidong Guo, Qi Chen, Yunji SKL of Computer Architecture Institute of Computing Technology CAS China University of Chinese Academy of Sciences China University of Science and Technology of China China Cambricon Technologies

Aiming to find a program satisfying the user intent given input-output examples, program synthesis has attracted increasing interest in the area of machine learning. Despite the promising performance of existing methods, most of their success comes from the privileged information of well-designed input-output examples. However, providing such input-output examples is unrealistic because it requires the users to have the ability to describe the underlying program with a few input-output examples under the training distribution. In this work, we propose a query-based framework that trains a query neural network to generate informative input-output examples automatically and interactively from a large query space. The quality of the query depends on the amount of the mutual information between the query and the corresponding program, which can guide the optimization of the query framework. To estimate the mutual information more accurately, we introduce the functional space (F-space) which models the relevance between the input-output examples and the programs in a differentiable way. We evaluate the effectiveness and generalization of the proposed query-based framework on the Karel task and the list processing task. Experimental results show that the query-based framework can generate informative input-output examples which achieve and even outperform well-designed input-output examples. © 2022 ICLR 2022 - 10th International Conference on Learning Representationss. All rights reserved.

关键词：

来源：评论

学校读者我要写书评

暂无评论

PR-Sketch: Monitoring per-key aggregation of streaming data with nearly full accuracy 47th

PR-Sketch: Monitoring per-key aggregation of streaming data ...

引用

47th International Conference on Very Large Data Bases, VLDB 2021

作者： Sheng, Siyuan Huang, Qun Wang, Sa Bao, Yungang University of Chinese Academy of Sciences SKL of Computer Architecture ICT CAS China Peking University China

Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like anomaly detection, attack prevention, and online diagnosis. Even though many algorithms have been proposed for per-key aggregation in stream processing, their accuracy guarantees only cover a small portion of keys. In this paper, we aim to achieve nearly full accuracy with limited resource usage. We follow the line of sketch-based techniques. We observe that existing methods suffer from high errors for most keys. The reason is that they track keys by complicated mechanism in the update phase and simply calculate per-key aggregation from some specific counter in the recovery phase. Therefore, we present PR-Sketch, a novel sketching design to address the two limitations. PR-Sketch builds linear equations between counter values and per-key aggregations to improve accuracy, and records keys in the recovery phase to reduce resource usage in the update phase. We also provide an extension called fast PR-Sketch to improve processing rate further. We derive space complexity, time complexity, and guaranteed error probability for both PR-Sketch and fast PR-Sketch. We conduct trace-driven experiments under 100K keys and 1M items to compare our algorithms with multiple state-of-the-art methods. Results demonstrate the resource efficiency and nearly full accuracy of our algorithms. © by the owner/author(s).

关键词： Anomaly detection

来源：评论

学校读者我要写书评

暂无评论

A Transpose-free Three-dimensional FFT Algorithm on ARM CPUs 23

A Transpose-free Three-dimensional FFT Algorithm on ARM CPUs

引用

23rd IEEE International Conference on High Performance Computing and Communications, 7th IEEE International Conference on Data Science and Systems, 19th IEEE International Conference on Smart City and 7th IEEE International Conference on Dependability in Sensor, Cloud and Big Data Systems and Applications, HPCC-DSS-SmartCity-DependSys 2021

作者： Chen, Tun Jia, Haipeng Li, Zhihao Li, Chendi Zhang, Yunquan Skl of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China University of Chinese Academy of Sciences Beijing China Huawei Technologies Co. Ltd Shenzhen China

ISBN: (纸本)9781665494571

According to the traditional multi-dimensional FFT, memory layouts of high-dimensional data are discontinuous. Transposition is introduced to keep high-dimensional data continuous in memory. However, transposition increases memory access and is a hot spot for multi-dimensional FFT. This paper proposes an optimization framework to eliminate explicit transpositions and optimize the three-dimensional (3D) FFT. This framework includes three research points. 1) combines the width-first and breadth-first search to optimize the butterfly network of one-dimensional (1D) FFT;2) adopts a column-order algorithm to eliminate data transposition;3) adopts a blocking algorithm of cache-aware to better use the hardware resources of ARM architecture. Based on this optimized framework, a multi-dimensional FFT library named MDFFT is implemented. The experiments demonstrate that MDFFT generally performs better than FFTW and ARMPL on ARM CPUs. © 2021 IEEE.

关键词： Three-dimensional displays Smart cities Layout Parallel processing Libraries Hardware Optimization

来源：评论

学校读者我要写书评

暂无评论

Progressive Join Algorithms Considering User Preference 11

Progressive Join Algorithms Considering User Preference

引用

11th Annual Conference on Innovative Data Systems Research, CIDR 2021

作者： Ding, Mengsu Chen, Shimin Makrynioti, Nantia Manegold, Stefan SKL of Computer Architecture ICT CAS University of Chinese Academy of Sciences China CWI Amsterdam Netherlands

Progressive query processing is a new attractive paradigm for exploratory data analysis. This paper considers the case where users want to receive results ordered according to their preference, and specifically focuses on the design of join algorithms. We investigate the use of contour lines in progressive algorithms with user preferences, and propose ContourJoin to reduce sorting overhead of progressive preference-aware joins. Experimental results show that compared with the naïve blocking algorithm and the top-k RankJoin algorithm, ContourJoin has superior performance in both early result generation and total result computation. © CIDR *** rights reserved

关键词：

来源：评论

学校读者我要写书评

暂无评论

Density-optimized Intersection-free Mapping and Matrix Multiplication for Join-Project Operations (extended version)

arXiv

引用

arXiv 2022年

作者： Huang, Zichun Chen, Shimin SKL of Computer Architecture ICT CAS University of Chinese Academy of Sciences China

A Join-Project operation is a join operation followed by a duplicate eliminating projection operation. It is used in a large variety of applications, including entity matching, set analytics, and graph analytics. Previous work proposes a hybrid design that exploits the classical solution (i.e., join and deduplication), and MM (matrix multiplication) to process the sparse and the dense portions of the input data, respectively. However, we observe three problems in the state-of-the-art solution: 1) The outputs of the sparse and dense portions overlap, requiring an extra deduplication step;2) Its table-to-matrix transformation makes an over-simplified assumption of the attribute values;and 3) There is a mismatch between the employed MM in BLAS packages and the characteristics of the Join-Project operation. In this paper, we propose DIM3, an optimized algorithm for the Join-Project operation. To address 1), we propose an intersection-free partition method to completely remove the final deduplication step. For 2), we develop an optimized design for mapping attribute values to natural numbers. For 3), we propose DenseEC and SparseBMM algorithms to exploit the structure of Join-Project for better efficiency. Moreover, we extend DIM3 to consider partial result caching and support Join-op queries, including Join-Aggregate and MJP (Multi-way Joins with Projection). Experimental results using both real-world and synthetic data sets show that DIM3 outperforms previous Join-Project solutions by a factor of 2.3×-18×. Compared to RDBMSs, DIM3 achieves orders of magnitude speedups. © 2022, CC BY.

关键词： Mapping

来源：评论

学校读者我要写书评

暂无评论

EagerReuse: An Efficient Memory Reuse Approach for Complex Computational Graph

EagerReuse: An Efficient Memory Reuse Approach for Complex C...

引用

International Conference on Parallel and Distributed Systems (ICPADS)

作者： Ruyi Qian Bojun Cao Mengjuan Gao Qinwen Shi Yida Wang Yuanchao Xu Qirun Huo Keni Qiu College of Information Engineering Capital Normal University Beijing China SKL of Computer Architecture Institute of Computing Technology CAS Beijing China

Memory reuse is a promising approach for deep neural network (DNN) to reduce memory consumption because it does not introduce any additional runtime overhead. We observe that existing memory reuse algorithms consider only the effect of an individual data feature (either tensor size or tensor lifetime) on memory reuse and ignore the relative position relationship (RPR) among tensors. As computational graphs grow slightly more complex, the mining of memory reuse becomes insufficient. To address this issue, we propose a new memory reuse algorithm—EagerReuse, which can exploit more memory reuse opportunities by analyzing RPR among tensors and reusing them as quickly as possible. We evaluated the algorithms with inference models in TensorFlow Model Garden, and the results show that the EagerReuse outperforms the state-of-the-art algorithms in three out of seven cases. For more complex computational graphs, EagerReuse can achieve better memory usage with slightly higher but acceptable overhead.

关键词：

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：