Communication usage is a bottleneck of scaling workers for distributed deep learning. One solution is to compress the exchanged gradients into sparse format with gradient sparsification. We found that the send cost of...
详细信息
ISBN:
(纸本)9781450365536
Communication usage is a bottleneck of scaling workers for distributed deep learning. One solution is to compress the exchanged gradients into sparse format with gradient sparsification. We found that the send cost of server, which is the aggregated size of sparse gradient, can be reduced by the gradient selection from workers. Following an observation that only a few gradients are significantly large and in a short period of time, we proposed several gradient selection algorithms based on different metrics. Experiment showed that our proposed method can reduce the aggregated size for server, and the reduction in time per iteration can make the convergence rate faster than traditional sparsification.
this paper presents new multi-objectives scheduling strategies implemented in Docker SwarmKit. Docker SwarmKit is a container toolkit for orchestrating distributed systems at any scale. Currently, Docker SwarmKit has ...
详细信息
ISBN:
(数字)9783030050573
ISBN:
(纸本)9783030050573;9783030050566
this paper presents new multi-objectives scheduling strategies implemented in Docker SwarmKit. Docker SwarmKit is a container toolkit for orchestrating distributed systems at any scale. Currently, Docker SwarmKit has one scheduling strategy called Spread. Spread is based only on one objective to select from a set of cloud nodes, one node to execute a container. However, the containers submitted by users to be scheduled in Docker SwarmKit are configured according to multi-objectives criteria, as the number of CPUs and the memory size. To better address the multi-objectives configuration problem of containers, we introduce the concept and the implementation of new multi-objectives scheduling strategies adapted for Cloud Computing environments and implemented in Docker SwarmKit. the principle of our multi-objectives strategies consist to select a node which has a good compromise between multi-objectives criteria to execute a container. the proposed scheduling strategies are based on a combinaison of PROMEthEE and Kung multi-objectives decision algorithms in order to place containers. the implementation in Docker SwarmKit and experiments of our new strategies demonstrate the potential of our approach under different scenarios.
Caffe is a deep learning framework, originally developed at UC Berkeley and widely used in large-scale industrial applications such as vision, speech, and multimedia. It supports many different types of deep learning ...
详细信息
ISBN:
(纸本)9781538647882
Caffe is a deep learning framework, originally developed at UC Berkeley and widely used in large-scale industrial applications such as vision, speech, and multimedia. It supports many different types of deep learning architectures such as CNNs (convolutional neural networks) geared towards image classification and image recognition. In this paper we develop a platform for the efficient deployment and acceleration of Caffe framework on embedded systems that are based on the Zynq SoC. the most computational intensive part of image classification is the processing of the convolution layers of the deep learning algorithms and more specifically the GEMM (general matrix multiplication) function calls. In the proposed framework, a hardware accelerator has been implemented, validated and optimized using Xilinx SDSoC Development Environment to perform the GEMM function. the accelerator that was developed achieves up to 98x speed-up compared withthe simple ARM CPU implementation. the results showed that the mapping of Caffe on the FPGA-based Zynq takes advantage of the low-power, customizable and programmable fabric and ultimately reduces time and power consumption of image classification.
Visibility computing is a basic problem in computer graphics, and is often the bottleneck in realistic rendering algorithms. Some of the most common include the determination of the objects visible from a viewpoint, v...
详细信息
ISBN:
(纸本)9783030050511;9783030050504
Visibility computing is a basic problem in computer graphics, and is often the bottleneck in realistic rendering algorithms. Some of the most common include the determination of the objects visible from a viewpoint, virtual reality, real-time simulation and 3D interactive design. As one technique to accelerate the rendering speed, the research on visibility computing has gained great attention in recent years. Traditional visibility computing on single processor machine has been unable to meet more and more large-scale and complex scenes due to lack parallelism. However, it will face many challenges to design parallelalgorithms on a cluster due to imbalance workload among compute nodes, the complicated mathematical model and different domain knowledge. In this paper, we propose an efficient and highly scalable framework for visibility computing on Tianhe-2 supercomputer. Firstly, a new technique called hemispheric visibility computing is designed, which can overcome the visibility missing of traditional perspective algorithm. Secondly, a distributed parallel algorithm for visibility computing is implemented, which is based on the master-worker architecture. Finally, we discuss the issue of granularity of visibility computing and some optimization strategies for improving overall performance. Experiments on Tianhe-2 supercomputer show that our distributed parallel visibility computing framework almost reaches linear speedup by using up to 7680 CPU cores.
Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of small payloads to accelerate the simulation of spiking neurons in neural networks. In this paper, we demonstrate that this...
详细信息
ISBN:
(纸本)9781728103594
Neuromorphic hardware like SpiNNaker offers massive parallelism and efficient communication of small payloads to accelerate the simulation of spiking neurons in neural networks. In this paper, we demonstrate that this hardware is also beneficial for other for applications which require massive parallelism and the large-scale exchange of small messages. More specifically, we study the scalability of PageRank on SpiNNaker and compare it to an implementation on traditional hardware. In our experiments, we show that PageRank on SpiNNaker scales better than on traditional multicore architectures.
Data movement is an important bottleneck against efficiency and energy consumption in large-scale sparse matrix computations that are commonly used in linear solvers, eigensolvers and graph analytics. We introduce a n...
详细信息
Data movement is an important bottleneck against efficiency and energy consumption in large-scale sparse matrix computations that are commonly used in linear solvers, eigensolvers and graph analytics. We introduce a novel task-parallel sparse solver framework, named DeepSparse, which adopts a fully integrated task-parallel approach. DeepSparse framework differs from existing work in that it adopts a holistic approach that targets all computational steps in a sparse solver rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). We present the implementation details of DeepSparse and demonstrate its merit in two popular eigensolvers, LOBPCG and Lanczos algorithms. We observe that DeepSparse achieves 2× - 16× fewer cache misses across different cache layers (L1, L2 and L3) over implementations of the same solvers based on optimized library function calls. We also achieve 2× - 3.9× improvement in execution time when using DeepSparse over the same library versions.
In the context of commonsense reasoning, spreading activation is used to select relevant concepts in a graph of commonsense knowledge. When such a graph starts growing, however, the number of relevant concepts selecte...
详细信息
ISBN:
(纸本)9783319754772;9783319754765
In the context of commonsense reasoning, spreading activation is used to select relevant concepts in a graph of commonsense knowledge. When such a graph starts growing, however, the number of relevant concepts selected during spreading activation tends to diminish. In the literature, such an issue has been addressed in different ways but two other important issues have been rather under-researched, namely: performance and scalability. Both issues are caused by the fact that many new nodes, i.e., natural language concepts, are continuously integrated into the graph. Both issues can be solved by means of GPU accelerated computing, which offers unprecedented performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. To this end, we propose a GPU-friendly method, termed GpSense, which is designed for massively parallelarchitectures to accelerate the tasks of commonsense querying and reasoning via subgraph matching. We show that GpSense outperforms the state-of-the-art algorithms and efficiently answers subgraph queries on a large commonsense graph.
Gaussian convolutions computation is required in several scientific fields and, to this aim, efficient approximation methods, based on Recursive Filters (RFs), have been developed recently. Among them, Gaussian Recurs...
详细信息
ISBN:
(数字)9781728156866
ISBN:
(纸本)9781728156873
Gaussian convolutions computation is required in several scientific fields and, to this aim, efficient approximation methods, based on Recursive Filters (RFs), have been developed recently. Among them, Gaussian Recursive Filters (RFs) are designed to approximate the Gaussian convolution in a very efficient way. the accuracy of these methods, as is well known, can be improved by means of the use of the so-called K-iterated Gaussian recursive filters, that is in the repeated application of the basic RF. To improve the provided accuracy, K-iterated versions of these methods are also considered. Since it is often necessary to handle large size one-dimensional input signals, a parallel approach becomes mandatory. Recently, we proposed a parallel algorithm for the implementation of the K-iterated first-order Gaussian RF on multicore architectures. Here, using a similar parallelization strategy, based on a domain decomposition with overlapping, we propose a new implementation that would exploit, in terms of both accuracy and performance, the GPU (Graphics processing Unit) capabilities on CUDA environment. Tests and experiments confirm the reliability and the efficiency of the proposed implementation.
Striped variation of the Smith-Waterman algorithm is known as extremely efficient and easily adaptable for the SIMD architectures. However, the potential for improvement has not been exhausted yet. the popular Lazy-F ...
详细信息
ISBN:
(纸本)9781728146188
Striped variation of the Smith-Waterman algorithm is known as extremely efficient and easily adaptable for the SIMD architectures. However, the potential for improvement has not been exhausted yet. the popular Lazy-F loop heuristic requires additional memory access operations, and the worst-case performance of the loop could be as bad as the nonvectorized version. We demonstrate the progression of the lazy-F loop transformations that improve the loop performance, and ultimately eliminate the loop completely. Our algorithm achieves the best asymptotic performance of all scan-based SW algorithms O(n/p+log(p)), and is very efficient in practice.
In order to realize human-computer interface, the architecture specification should be based not only on the functional aspects of the cognitive processes but also on an emotional evaluation, such as the inferences ga...
详细信息
ISBN:
(纸本)9783319959726;9783319959719
In order to realize human-computer interface, the architecture specification should be based not only on the functional aspects of the cognitive processes but also on an emotional evaluation, such as the inferences gained from the language processing of the model. In this paper, we discuss the use of cognitive architectures in order to solve the problems that arise from rigid models based on AI.
暂无评论