The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpecte...
详细信息
The ability of servers to effectively execute tasks within Cloud datacenters varies due to heterogeneous CPU and memory capacities, resource contention situations, network configurations and operational age. Unexpectedly slow server nodes (node-level stragglers) result in assigned tasks becoming task-level stragglers, which dramatically impede parallel job execution. However, it is currently unknown how slow nodes directly correlate to task straggler manifestation. To address this knowledge gap, we propose a method for node performance modeling and ranking in Cloud datacenters based on analyzing parallel job execution tracelog data. By using a production Cloud system as a case study, we demonstrate how node execution performance is driven by temporal changes in node operation as opposed to node hardware capacity. Different sample sets have been filtered in order to evaluate the generality of our framework, and the analytic results demonstrate that node abilities of executing parallel tasks tend to follow a 3-parameter-loglogistic distribution. Further statistical attribute values such as confidence interval, quantile value, extreme case possibility, etc. can also be used for ranking and identifying potential straggler nodes within the cluster. We exploit a graph-based algorithm for partitioning server nodes into five levels, with 0.83% of node-level stragglers identified. Our work lays the foundation towards enhancing scheduling algorithms by avoiding slow nodes, reducing task straggler occurrence, and improving parallel job performance.
With the increasing adoption of graph neural networks (GNNs) in the graph-based deep learning community, various graph programming frameworks and models have been developed to improve the productivity of GNNs. The cur...
详细信息
ISBN:
(纸本)9781665443326
With the increasing adoption of graph neural networks (GNNs) in the graph-based deep learning community, various graph programming frameworks and models have been developed to improve the productivity of GNNs. The current GNN frameworks choose GPU as an essential tool to accelerate GNN training. However, it is still challenging to train GNNs on large graphs with limited GPU memory. Unlike traditional neural networks, generating mini-batch data by sampling in GNNs requires some complicated tasks such as traversing the graph to select neighboring nodes and gathering their features. This process takes up most of the training and we find the main bottleneck comes from transferring nodes features from CPU to GPU through limited bandwidth. In this paper, We propose a method Reusing Batch Data for the problem of data transmission. This method utilizes the similarity between adjacent mini-batches to reduce repeated data transmission from CPU to GPU. Furthermore, to reduce the overhead introduced by this method, we design a fast algorithm based on GPU to detect repeated nodes’ data and achieve shorter additional computation time. Evaluations on three representative GNN models show that our method can reduce transmission time by up to 60% and speed the end-to-end GNN training by up to 1.79× over the state-ofthe-art baselines. Besides, Reusing Batch Data can effectively save GPU memory footprint by about 19% to 40% while still reducing the training time compared to the static cache strategy.
Graph neural networks (GNNs) have been becoming important tools for processing structured graph data and successfully applied to multiple graph-based application scenarios. The existing GNN systems adopt sample-based ...
详细信息
Graph neural networks (GNNs) have been becoming important tools for processing structured graph data and successfully applied to multiple graph-based application scenarios. The existing GNN systems adopt sample-based training on large-scale graphs over multiple GPUs. Although they support large-scale graph training, large data loading overhead of transferring vertex features between CPUs and GPUs is still a bottleneck. In this work, we propose SCGraph, a method that supports GPU high-speed feature caching. SCGraph classifies the graph vertices sorted by out-degrees. For high out-degree vertices, SCGraph sets grading caches via different GPUs to increase the overall cache content through NVLink high-speed data transmission between them. For low out-degree vertices, SCGraph expands training vertices' neighborhood in advance to regenerate cache. We evaluate SCGraph against two state-of-the-art industrial GNN frameworks, i.e., DGL and PaGraph on various benchmarks. Experimental results show that SCGraph improves the cache hit rate over GPUs up to 23.6%, and achieves up to 1.71x performance speedup over the state-of-the-art baselines while the convergence almost constant.
With the continuous deepening of Artificial Neural Network(ANN)research,ANN model structure and function are improving towards diversification and ***,the model is more evaluated from the pros and cons of the problem-...
详细信息
With the continuous deepening of Artificial Neural Network(ANN)research,ANN model structure and function are improving towards diversification and ***,the model is more evaluated from the pros and cons of the problem-solving results and the lack of evaluation from the biomimetic aspect of imitating neural networks is not inclusive ***,a new ANN models evaluation strategy is proposed from the perspective of bionics in response to this problem in the ***,four classical neural network models are illustrated:Back Propagation(BP)network,Deep Belief Network(DBN),LeNet5 network,and olfactory bionic model(KIII model),and the neuron transmission mode and equation,network structure,and weight updating principle of the models are analyzed *** analysis results show that the KIII model comes closer to the actual biological nervous system compared with other models,and the LeNet5 network simulates the nervous system in ***,evaluation indexes of ANN are constructed from the perspective of bionics in this paper:small-world,synchronous,and chaotic ***,the network model is quantitatively analyzed by evaluation indexes from the perspective of *** experimental results show that the DBN network,LeNet5 network,and BP network have synchronous *** the DBN network and LeNet5 network have certain chaotic characteristics,but there is still a certain distance between the three classical neural networks and actual biological neural *** KIII model has certain small-world characteristics in structure,and its network also exhibits synchronization characteristics and chaotic *** with the DBN network,LeNet5 network,and the BP network,the KIII model is closer to the real biological neural network.
In this paper, we introduce a generic model to deal with the event matching problem of content-based publish/subscribe systems over structured P2P overlays. In this model, we claim that there are three methods (event-...
详细信息
In this paper, we introduce a generic model to deal with the event matching problem of content-based publish/subscribe systems over structured P2P overlays. In this model, we claim that there are three methods (event-oriented, subscription-oriented and hybrid) to make all the matched pairs (event, subscription) meet in a system. By theoretically analyzing the inherent problem of both event-oriented and subscription-oriented methods, we propose PEM (Popularity-based Event Matching), a variant of hybrid method. PEM can achieve better trade-off between event processing load and subscription storage load of a system. PEM has been verified through both mathematical and simulation-based evaluation.
I/O performance remains a weakness of parallelcomputing systems today. While this weakness is partly attributed to rapid advances in other system components, I/O interfaces available to programmers and the I/O method...
详细信息
ISBN:
(纸本)9780769519197
I/O performance remains a weakness of parallelcomputing systems today. While this weakness is partly attributed to rapid advances in other system components, I/O interfaces available to programmers and the I/O methods supported by file systems have traditionally not matched efficiently with the types of I/O operations that scientific applications perform, particularly noncontiguous accesses. The MPI-IO interface allows for rich descriptions of the I/O patterns desired for scientific applications and implementations such as ROMIO have taken advantage of this ability while remaining limited by underlying file system methods. A method of noncontiguous data access, list I/O, was recently implemented in the parallel Virtual File System (PVFS). We implement support for this interface in the ROMIO MPI-IO implementation. Through a suite of noncontiguous I/O tests we compared ROMIO list I/O to current methods of ROMIO noncontiguous access and found that the list I/O interface provides performance benefits in many noncontiguous cases.
We present a high performance and memory efficient hardware implementation of matrix multiplication for dense matrices of any size on the FPGA devices. By applying a series of transformations and optimizations on the ...
详细信息
We present a high performance and memory efficient hardware implementation of matrix multiplication for dense matrices of any size on the FPGA devices. By applying a series of transformations and optimizations on the original serial algorithm, we can obtain an I/O and memory optimized block algorithm for matrix multiplication on FPGAs. A linear array of processing elements (PEs) is proposed to implement this block algorithm. We show significant reduction in hardware resources consuming compared to the related work while increasing clock frequency. Moreover, the memory requirement can be reduced to O(S) from O(S 2 ), where S is the block size. Therefore, more PEs can be integrated into the same FPGA devices.
In this paper, we present an automatic synthesis framework to map loop nests to processor arrays with local memories on FPGAs. An affine transformation approach is firstly proposed to address space-time mapping proble...
详细信息
In this paper, we present an automatic synthesis framework to map loop nests to processor arrays with local memories on FPGAs. An affine transformation approach is firstly proposed to address space-time mapping problem. Then a data-driven architecture model is introduced to enable automatic generation of processor arrays by extracting this data-driven architecture model from transformed loop nests. Some techniques including memory allocation, communication generation and control generation are presented. Synthesizable RTL codes can be easily generated from the architecture model built by these techniques. A preliminary synthesis tool is implemented based on PLUTO, an automatic polyhedral source-to-source transformation and parallelization framework.
Grid computing presents a new trend to distributed computation and Internet applications, which can construct a virtual single image of heterogeneous resources, provide uniform application interface and integrate wide...
详细信息
In online social networks, social influence of a user reflects his or her reputation or importance in the whole network or to a personalized user. Social influence analysis can be used in many real applications, such ...
详细信息
ISBN:
(纸本)9781479909735
In online social networks, social influence of a user reflects his or her reputation or importance in the whole network or to a personalized user. Social influence analysis can be used in many real applications, such as link prediction, friend recommendation and personalized searching. Personalized Page Rank, which ranks nodes according to the probabilities that a random walk starting from a personalized node stops at all nodes, is one of the most popular metrics for influence analysis. In this paper, we study the problem of inverse influence in online social networks. Different from Personalized Page Rank, the inverse influence for a personalized node ranks nodes according to the probabilities that all nodes stop at the personalized node in limited steps. We propose two computation models for inverse influence, i.e., the random walk based and the path based. Both of the models have high computation complexity, and cannot be used in large graphs, so we propose a Monte Carlo based approximation algorithm. Experiments from synthetic and real world datasets show that, our algorithm has equivalent or even better accuracy than related researches in link prediction, and thus can be used in friend recommendation in online social networks.
暂无评论