Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance compu...
详细信息
ISBN:
(纸本)9781450340922
Deep neural networks (DNN) have recently achieved extraordinary results in domains like computer vision and speech recognition. An essential element for this success has been the introduction of high performance computing (HPC) techniques in the critical step of training the neural network. This paper describes the implementation and analysis of a network-agnostic and convergence-invariant coarse-grain parallelization of the DNN training algorithm. The coarse-grain parallelization is achieved through the exploitation of the batch-level parallelism. This strategy is independent from the support of specialized and optimized libraries. Therefore, the optimization is immediately available for accelerating the DNN training. The proposal is compatible with multi-GPU execution without altering the algorithm convergence rate. The parallelization has been implemented in Caffe, a state-of-the-art DNN framework. The paper describes the code transformations for the parallelization and we also identify the limiting performance factors of the approach. We show competitive performance results for two state-of-the-art computer vision datasets, MNIST and CIFAR-10. In particular, on a 16-core Xeon E5-2667v2 at 3.30GHz we observe speedups of 8x over the sequential execution, at similar performance levels of those obtained by the GPU optimized Caffe version in a NVIDIA K40 GPU.
We are on the cusp of the emergence of a new wave of nonvolatile memory technologies that are projected to become the dominant type of main memory in the near future. A key property of these new memory technologies is...
详细信息
ISBN:
(纸本)9781450339643
We are on the cusp of the emergence of a new wave of nonvolatile memory technologies that are projected to become the dominant type of main memory in the near future. A key property of these new memory technologies is their asymmetric read-write costs: Writes can be an order of magnitude or more higher energy, higher latency, and lower (per module) bandwidth than reads. This high cost for writes motivates a rethinking of algorithm design towards "write efficient" algorithms and data structures that reduce their number of writes [1, 2, 3, 4, 5, 6]. Many popular techniques for sequential, distributed, and parallel algorithms are tuned to the setting where reads and writes cost the same, and hence need to be revisited. Prior work on reducing writes to contended cache lines in shared memory algorithms can be useful here, but with the new technologies, even writes to uncontended memory is costly. Moreover, the new technologies are unlikely to replace the fastest cache memory, motivating the study of a multi-level memory hierarchy comprised of smaller symmetric level(s) and a larger asymmetric level. Lower bounds, too, need to be revisited in light of asymmetric costs. This talk provides background on these emerging memory technologies, highlights the progress to date on these exciting research questions, and touches on a few of the many open problems.
Parallelism in linear algebra libraries is a common approach to accelerate numerical and scientific applications. Matrix-matrix multiplication is one of the most widely used computations in scientific and numerical al...
详细信息
ISBN:
(纸本)9781450313087
Parallelism in linear algebra libraries is a common approach to accelerate numerical and scientific applications. Matrix-matrix multiplication is one of the most widely used computations in scientific and numerical algorithms. Although a number of matrix multiplication algorithms exist for distributed memory environments (e.g., Cannon, Fox, PUMMA, SUMMA), matrix-matrix multiplication algorithms for sharedmemory and SMP architectures have not been extensively studied. In this paper, we present a fast matrix-matrix multiplication algorithm for multi-core and SMP architectures using the MapReduce framework. memory-resident linear algebra algorithms suffer performance losses on modern multi-core architectures because of the increasing performance gap between the CPU and main memory. To allow such compute-intensive algorithms to exploit the full potential of the program's inherent instruction level parallelism, the adverse effect of the processor-memory performance gap should be minimized. We present a cache-sensitive MapReduce matrix multiplication algorithm that fully exploits memory bandwidth and minimize cache misses and conflicts. Our experimental results show that the two algorithms outperform existing matrix multiplication algorithms for shared-memory architectures such as those given in the Phoenix, PLASMA and LAPACK libraries.
Mutual Exclusion is a fundamental problem in distributed computing, and the problem of proving upper and lower bounds on the RMR complexity of this problem has been extensively studied. Here, we give matching lower an...
详细信息
ISBN:
(纸本)9781450312455
Mutual Exclusion is a fundamental problem in distributed computing, and the problem of proving upper and lower bounds on the RMR complexity of this problem has been extensively studied. Here, we give matching lower and upper bounds on how RMR complexity trades off with space. Two implications of our results are that constant RMR complexity is impossible with subpolynomial space and subpolynomial RMR complexity is impossible with constant space for cache-coherent multiprocessors, regardless of how strong the hardware synchronization operations are. To prove these results we show that the complexity of mutual exclusion, which can be "messy" to analyze because of system details such as asynchrony and cache coherence, is captured precisely by a simple and purely combinatorial game that we design. We then derive lower and upper bounds for this game, thereby obtaining corresponding bounds for mutual exclusion. The lower bounds for the game are proved using potential functions.
This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible ...
详细信息
This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages, and therefore, this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for sharedmemory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multicore machines, show that both algorithms provide a significant speedup.
Despite the large amount of Byzantine fault-tolerant algorithms for message-passing systems designed through the years, only recent algorithms for the coordination of processes subject to Byzantine failures using shar...
详细信息
Despite the large amount of Byzantine fault-tolerant algorithms for message-passing systems designed through the years, only recent algorithms for the coordination of processes subject to Byzantine failures using sharedmemory have appeared. This paper presents a new computing model in which sharedmemory objects are protected by fine-grained access policies, and a new sharedmemory object, the Policy-Enforced Augmented Tuple Space (PEATS). We show the benefits of this model by providing simple and efficient consensus algorithms. These algorithms are much simpler and require less sharedmemory operations, using also less memory bits than previous algorithms based on access control lists (ACLs) and sticky bits. We also prove that PEATS objects are universal, i.e., that they can be used to implement any other sharedmemory object, and present lock-free and wait-free universal constructions.
暂无评论