An important inherent limitation of dynamic multiphase reactor flow simulations is the computational time requirements, making long time statistics intractable. A parallel CFD model has therefore been developed intend...
详细信息
An important inherent limitation of dynamic multiphase reactor flow simulations is the computational time requirements, making long time statistics intractable. A parallel CFD model has therefore been developed intended for the simulation of multi-phase reactors. The present version of the model simulates 2D reactive flows in a fixed bed reactor. The simulations are performed on two grids of different resolutions. The predicted profiles are in accordance with results reported in the literature. parallelization and performance optimization of the model have been performed to reduce the computational time. Further reductions have been achieved by applying compiler optimization. The most expensive part of the numerical solution algorithm is the implicit solution of the Poisson equation for the pressure. To solve the Poisson equation a TDMA-algorithm with and without a global block correction procedure, several variations of the conjugated gradients-algorithm and a bi-orthogonal conjugate gradient-algorithm were tested. The optimization work performed has shown that, compared to the serial non-optimized version of the code, the computational time spend solving the model has been reduced by more than an order of magnitude by using an optimized algorithm combined with optimal compiler options. Further reductions in computational time has been achieved by parallelizing the program. With this type of model performance optimization, the multiphase reactive flow systems in chemical reactors are expected to be simulated within feasible time limits. (C) 2004 Elsevier Ltd. All rights reserved.
Recently, Deep Neural Networks (DNNs) have recorded significant success in handling medical and other complex classification tasks. However, as the sizes of DNN models and the available datasets increase, the training...
详细信息
Recently, Deep Neural Networks (DNNs) have recorded significant success in handling medical and other complex classification tasks. However, as the sizes of DNN models and the available datasets increase, the training process becomes more complex and computationally intensive, usually taking longer to complete. In this work, we have proposed a generic full end-to-end hybrid parallelization approach combining model and data parallelism for efficiently distributed and scalable training of DNN models. We have also proposed a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for optimal distribution of partitions on the available GPUs for computing performance optimization. We have applied our proposed approach to a real use case based on 3D Residual Attention Deep Neural Network (3D-ResAttNet) for efficient Alzheimer Disease (AD) diagnosis on multiple GPUs and compared with the existing state-of-the-art parallel methods. The experimental evaluation shows that our proposed approach is 20% averagely better than existing parallel methods in terms of training time and achieves almost linear speedup with little or no differences in accuracy performance when compared with the existing non-parallel DNN models.
Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple de...
详细信息
ISBN:
(纸本)9783031160929;9783031160912
Modern neural networks require long training to reach decent performance on massive datasets. One common approach to speed up training is model parallelization, where large neural networks are split across multiple devices. However, different device placements of the same neural network lead to different training times. Most of the existing device placement solutions treat the problem as sequential decisionmaking by traversing neural network graphs and assigning their neurons to different devices. This work studies the impact of neural network graph traversal orders on device placement. In particular, we empirically study how different graph traversal orders of neural networks lead to different device placements, which in turn affects the training time of the neural network. Our experiment results show that the best graph traversal order depends on the type of neural networks and their computation graphs features. In this work, we also provide recommendations on choosing effective graph traversal orders in device placement for various neural network families to improve the training time in model parallelization.
Seismic wave models provide a way to study the consequences of future earthquakes. When modeling a restricted region, these models require a boundary condition to absorb the energy that goes out of the simulated domai...
详细信息
ISBN:
(纸本)9781479927289
Seismic wave models provide a way to study the consequences of future earthquakes. When modeling a restricted region, these models require a boundary condition to absorb the energy that goes out of the simulated domain. To parallelize these models, the domain is decomposed into a grid of smaller subdomains which are mapped to different tasks. Due to the boundary condition, this division gives rise to load imbalance between the tasks that simulate border regions and those assigned center subdomains. To deal with this imbalance, and therefore improve the simulation's performance, we propose the use of dynamic load balancing. To evaluate our solution, we ported a seismic wave simulator to Adaptive MPI to profit from its load balancing framework. By using dynamic load balancers, we improved the performance of the application by 23.85% when compared to the original MPI implementation. We also show that load balancers are able to adapt to the variation of load imbalance during the application's execution.
As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high...
详细信息
ISBN:
(纸本)9781450367356
As the number of layers and the amount of training data increases, the trend is to train deep neural networks in parallel across devices. In such scenarios, neural network training is increasingly bottlenecked by high memory requirements posed by intermediate results, or feature maps, that are produced during the forward pass and consumed during the backward pass. We recognize that the best-performing device parallelization configurations should consider memory usage in addition to the canonical metric of computation time. Towards this we introduce MemFlow, an optimization framework for distributed deep learning that performs joint optimization over memory usage and computation time when searching for a parallelization strategy. MemFlow consists of: (i) a task graph with memory usage estimates; (ii) a memory-aware execution simulator; and (iii) a Markov Chain Monte Carlo search algorithm that considers various degrees of recomputation i.e., discarding feature maps during the forward pass and recomputing them during the backward pass. Our experiments demonstrate that under memory constraints, MemFlow can readily locate valid and superior parallelization strategies unattainable with previous frameworks.
Deep Neural Networks (DNNs) have established themselves as a fundamental tool in numerous computational modeling applications, overcoming the challenge of defining use-case-specific feature extraction processing by in...
详细信息
Deep Neural Networks (DNNs) have established themselves as a fundamental tool in numerous computational modeling applications, overcoming the challenge of defining use-case-specific feature extraction processing by incorporating this stage into unified end-to-end trainable models. Despite their capabilities in modeling, training large-scale DNN models is a very computation-intensive task that most single machines are often incapable of accomplishing. To address this issue, different parallelization schemes were proposed. Nevertheless, network overheads as well as optimal resource allocation pose as major challenges, since network communication is generally slower than intra-machine communication while some layers are more computationally expensive than others. In this work, we consider a novel multimodal DNN based on the Convolutional Neural Network architecture and explore several different ways to optimize its performance when training is executed on an Apache Spark Cluster. We evaluate the performance of different architectures via the metrics of network traffic and processing power, considering the case of land cover classification from remote sensing observations. Furthermore, we compare our architectures with an identical DNN architecture modeled after a data parallelization approach by using the metrics of classification accuracy and inference execution time. The experiments show that the way a model is parallelized has tremendous effect on resource allocation and hyperparameter tuning can reduce network overheads. Experimental results also demonstrate that proposed model parallelization schemes achieve more efficient resource use and more accurate predictions compared to data parallelization approaches.
暂无评论