The scale of model parameters and the amount of training data is exponentially increasing. It requires more GPU memory with the exponential increasement of model parameters. Recomputation and swapping are two main mem...
The scale of model parameters and the amount of training data is exponentially increasing. It requires more GPU memory with the exponential increasement of model parameters. Recomputation and swapping are two main memory optimization methods that have been extensively studied, and there are also optimization strategies that combine the two methods. However, most of them are based on heuristic search strategies, which do not explore the complete solution space and can’t guarantee the optimality of the solution results. An optimal search strategy with tensor-level recomputation and swapping is expected in large-scale model training. In this paper, we propose an optimal strategy searching algorithm combining tensor-based recomputation and swapping. Specifically, the memory swapping strategy is reformulated as an optimization problem, which converts the memory constraints into mixed integer programming, to find the optimal memory optimization strategy. By leveraging the advantages of both recomputation and swapping, this approach minimizes computation consumption without exceeding the available memory limitation. Experimental results show that our method exhibits about 60% reduction in memory requirements during the training process. Furthermore, our method can reduce the overall training time beyond the existing algorithms. Compared to Checkmate, our approach achieves about 0.3–0.9% reduction in computation cost per iteration.
When a large-scale distributed interactive simulation system is running on WAN, the sites usually disperse over a wide area in geography, which results in the simulation clock of each site is hardly to be accurately s...
详细信息
When a large-scale distributed interactive simulation system is running on WAN, the sites usually disperse over a wide area in geography, which results in the simulation clock of each site is hardly to be accurately synchronized with that of other sites. The asynchronous clocks and large transmission latency on WAN bring on a problem for the large-scale simulations to preserve the real-time causal order delivery of received events at each site. In this article, we analyze the indirect way to compare the values of asynchronous simulation clocks at first, and then propose a novel scheme which can select the reconstructible causal control information for each message so as to ensure the causal ordering of events in real time. Experiments demonstrate that the scheme can weaken the effect of network latency, reduce the overhead of the transmission amount of control information and improve the causal order consistency in asynchronous distributed simulations.
Large-scale floating-point matrix multiplication is a fundamental kernel in many scientific and engineering applications. Most existing work only focus on accelerating matrix multiplication on FPGA by adopting a linea...
详细信息
Deep clustering aims to cluster unlabeled data by embedding them into a subspace based on deep model. The key challenge of deep clustering is to learn discriminative representations for input data with high dimensions...
详细信息
Deep clustering aims to cluster unlabeled data by embedding them into a subspace based on deep model. The key challenge of deep clustering is to learn discriminative representations for input data with high dimensions. In this paper, we present a deep discriminative clustering network for clustering the real-world images. We use a convolutional auto-encoder stacked with a softmax layer to predict clustering assignments. To learn a discriminative representations, the proposed approach adds discriminative loss as embedded regularization with relative entropy minimization. With the discriminative loss, the network can not only produce clustering assignments, but also learn discriminative features by reducing intra-cluster distance and increasing inter-cluster distance. We evaluate the proposed method on three datasets: MNIST-full, YTF and FRGC-v2.0. We outperform state-of-the-art results on MNIST-full and FRGC-v2.0 and achieve competitive result on YTF. The source code has been made publicly available at .
DSP processor can be used to solve the high performance computation problems, which has the characteristics of high computing performance and low power. Matrix multiplication algorithm is the kernel of many scientific...
详细信息
DSP processor can be used to solve the high performance computation problems, which has the characteristics of high computing performance and low power. Matrix multiplication algorithm is the kernel of many scientific and technology computation, so it is of importance for theorem and practice. Based on general purpose DSP (GPDSP), a new parallel algorithm for matrix multiplication was proposed. And a peak performance model for matrix multiplication was built. From the peak performance model, an architecture of GPDSP was set up, and the parameter of GPDSP with Tflops was given, which includes the number of pipe-line, the number of SIMD registers, the breadth and latency for the hierarchical memories.
In this paper, we present OpenMedIA, an open-source toolbox library containing a rich set of deep learning methods for medical image analysis under heterogeneous Artificial Intelligence (AI) computing platforms. Vario...
详细信息
Document-level event extraction task has achieved significant progress based on template generation methods. However, there is no reasonable regulation and restriction in the existing template-based generation methods...
Document-level event extraction task has achieved significant progress based on template generation methods. However, there is no reasonable regulation and restriction in the existing template-based generation methods, which results in the uncontrollability of the generation results. In some scenarios, model generates entities that do not belong to the input text, or generate template content repeatedly. It is determined by the nature of the extraction task and the generation task. To this end, we propose a controllable template generation event extraction model. According to the characteristics of template generation and event extraction tasks, the model devises copy mechanism, inhibition mechanism and rejection mechanism under the appropriately constructed template. Our model achieves state-of-the-art result on MUC-4 dataset, and finally through experimental analysis, it demonstrates the effectiveness of each mechanism we proposed.
How to preserve causal and totally ordered event delivery is an important issue in real-time serverless DVE(distributed Virtual Environment). However, most of the related works are designed to maintain causal order me...
详细信息
How to preserve causal and totally ordered event delivery is an important issue in real-time serverless DVE(distributed Virtual Environment). However, most of the related works are designed to maintain causal order merely or time stamped order with intensive computation and bandwidth overhead. In this paper, we proposed a novel distributed algorithm to maintain the before-and-after relationship between events, both causal and concurrent, of DVE at each individual node. Several simulation experiments are carried out to evaluate the performance of our algorithm and the results demonstrate that the algorithm is effective in preserving causal and totally ordered event delivery and more efficient than the previous algorithms.
We present Fast-Downsampling MobileNet (FD-MobileNet), an efficient and accurate network for very limited computational budgets (e.g., 10-140 MFLOPs). Our key idea is applying a fast down-sampling strategy to MobileNe...
详细信息
We present Fast-Downsampling MobileNet (FD-MobileNet), an efficient and accurate network for very limited computational budgets (e.g., 10-140 MFLOPs). Our key idea is applying a fast down-sampling strategy to MobileNet framework. In FD-MobileNet, we perform 32× downsampling within 12 layers, only half the layers in the original MobileNet. This design brings three advantages: (i) It remarkably reduces the computational cost. (ii) It increases the information capacity and achieves significant performance improvements. (iii) It is engineering-friendly and provides fast actual inference speed. Experiments on ILSVRC 2012 and PASCAL VOC datasets demonstrate that FD-MobileNet consistently outperforms MobileNet and achieves comparable results with ShuffleNet under different computational budgets, for instance, surpassing Mobile-Net by 5.5% on the ILSVRC 2012 top-1 accuracy and 8.3% on the VOC 2007 mAP under a complexity of 12 MFLOPs. On an ARM-based device, FD-MobileNet achieves 1.11× inference speedup over MobileNet and 1.82× over ShuffleNet under the same complexity.
Due to the large message transmission latency in distributed Virtual Environments(DVEs) on Wide Area Net-work(WAN), the effectiveness of causality consistency control of message ordering is determined by not only caus...
详细信息
Due to the large message transmission latency in distributed Virtual Environments(DVEs) on Wide Area Net-work(WAN), the effectiveness of causality consistency control of message ordering is determined by not only causal order of messages but also the real-timeness. If merely causal order is considered, the real-time property of DVEs may not be ensured because of the unlimited waiting time for the delayed messages. While if only real-timeness is emphasized, there may be too many delayed messages, which have to be discarded, to maintain the quality of causal message ordering. Therefore, a trade-off between the quality of causal order delivery and real-timeness is necessary for DVEs. In this article, a novel causality based message ordering approach is presented. In general, this new approach dynamically balances the demands of causal order delivery and real-timeness. Experiment results demonstrate the approach can enhance the quality of causality, while simultaneously keep the real-time property of DVEs.
暂无评论