parallelism and Optimization are two disciplines that are used together in numerous applications. Solving complex problems in optimization often means to face complex search landscapes, what needs time-consuming opera...
详细信息
Cumulative performance profiling is a fast and lightweight method for gaining summary information about where and how communication time in parallel MPI applications is spent. MPI provides mechanisms for implementing ...
详细信息
ISBN:
(纸本)9781665497473
Cumulative performance profiling is a fast and lightweight method for gaining summary information about where and how communication time in parallel MPI applications is spent. MPI provides mechanisms for implementing such profilers that can be transparently used withapplications. Existing profilers typically profile on a process basis and record the frequency, total time, and volume of MPI operations per process. this can lead to grossly misleading cumulative information for applicationsthat make use of MPI features for partitioning the processes into different communicators. We present a novel MPI profiler, mpisee, for communicator-centric profiling that separates and records collective and pointto-point communication information per communicator in the application. We discuss the implementation of mpisee which makes significant use of the MPI attribute mechanism. We evaluate our tool by measuring its overhead and profiling a number of standard applications. Our measurements withthirteen MPI applications show that the overhead of mpisee is less than 3 %. Moreover, using mpisee, we investigate in detail two particular MPI applications, SPLATT and GROMACS, to obtain information on the various MPI operations for the different communicators of these applications. Such information is not available by other, state-of-the-art profilers. We use the communicator-centric information to improve the performance of SPLATT resulting in a significant runtime decrease when run with 1024 processes.
In recent years, the global data center industry has entered a large-scale planning and construction stage, bringing huge power consumption and high operating costs. In this paper, a multi-tenant resource allocation o...
详细信息
ISBN:
(纸本)9781538637906
In recent years, the global data center industry has entered a large-scale planning and construction stage, bringing huge power consumption and high operating costs. In this paper, a multi-tenant resource allocation optimization model for geo-distributed data centers is proposed to reduce the energy consumption and operating costs of cloud providers. Firstly, the resource allocation model is constructed by considering the heterogeneous electricity prices of geo-distributed data centers. then an immune-based algorithm is proposed to solve the allocation optimization problem with variable users' demand. the biological immune system is a highly parallel, distributed, self-adaptive and self-organizing system with a strong ability to learn, recognize, and memory. It searches for the optimal resource allocation scheme with low computational complexity. Extensive simulations driven by large-scale parallel Workloads Archive demonstrate the feasibility and performance of our immune-based optimization algorithm.
Tightly-coupled parallelapplications in cloud systems may suffer from significant performance degradation because of the resource over-commitment issue. In this paper, we propose a dynamic approach based on the adapt...
详细信息
ISBN:
(纸本)9781509021406
Tightly-coupled parallelapplications in cloud systems may suffer from significant performance degradation because of the resource over-commitment issue. In this paper, we propose a dynamic approach based on the adaptive control over time-slice for virtual clusters, in order to mitigate the performance degradation for parallelapplications in cloud and avoid the negative impact effectively on other non-parallelapplications meanwhile. the key idea is to reduce the synchronization overhead inside and across virtual machines (VMs) in cloud systems, by dynamically adjusting the time-slices of VMs in terms of the spinlock latency at runtime. Such a design is motivated by our experimental finding that VM's time slice is a key factor determining the synchronization overhead as well as the parallel execution performance. We perform the evaluation on a real cluster environment deployed with XEN, using five well-known benchmarks with 10+ applications. Experiments show that our approach obtains 1.5-10x performance gain for running parallelapplications, than other state-of-the-art solutions (including Credit Scheduling of Xen and the well-known methods like Co-Scheduling and Balance Scheduling), with nearly unaffected impact on the performance of non-parallelapplications.
the correctness and robustness of the neural network model are usually proportional to its depth and width. Currently, the neural network models become deeper and wider to cope with complex applications, which leads t...
详细信息
ISBN:
(纸本)9781665435741
the correctness and robustness of the neural network model are usually proportional to its depth and width. Currently, the neural network models become deeper and wider to cope with complex applications, which leads to high memory capacity requirement and computer capacity requirements of the training process. the multi-accelerator parallelism is a promising choice for the two challenges, which deploys multiple accelerators in parallel for training neural networks. Among them, the pipeline parallel scheme has a great advantage in training speed, but its memory capacity requirements are relatively higher than other parallel schemes. Aiming at solving this challenge of pipeline parallel scheme, we propose a data transfer mechanism, which effectively reduces the peak memory usage of the training process by real-time data transferring. In the experiment, we implement our design and apply it to Pipedream, a mature pipeline parallel scheme. the memory requirement of training process is reduced by up to 48.5%, and the speed loss is kept within a reasonable range.
In this paper, we propose a new architecture for parallel and distributedprocessing framework, "Jobcast", which enables data processing on a cloud style KVS database. Nowadays, lots of KVS (as known as Key ...
详细信息
ISBN:
(纸本)9780769547374;9781467320016
In this paper, we propose a new architecture for parallel and distributedprocessing framework, "Jobcast", which enables data processing on a cloud style KVS database. Nowadays, lots of KVS (as known as Key Value Store) systems exist which achieve high scalability for data spaces among a huge number of computers. Some of KVS implementations use "consistent hash" algorithm to identify the backend data node to store a pair of key and value. Jobcast also uses consistent hash algorithm for a distribution strategy and has a capability to store key and value pairs into huge number of computers as a KVS system. Furthermore, Jobcast also distributes "jobs" into data nodes for parallel and distributedprocessing. In this paper, we introduce a basic architecture of Jobcast and evaluate a data processing performance for a typical example.
Nowadays, common systems in the area of high performance computing exhibit highly hierarchical architectures. As a result, achieving satisfactory;application performance demands an adaptation of the respective paralle...
详细信息
ISBN:
(纸本)9781424437511
Nowadays, common systems in the area of high performance computing exhibit highly hierarchical architectures. As a result, achieving satisfactory;application performance demands an adaptation of the respective parallel algorithm to such systems. this, in turn, requires knowledge about the actual hardware structure even at the application level. However, the prevalent Message Passing Interface (MPI) standard (at least in its current version 2.1) intentionally hides heterogeneity from the application programmer in order to assure portability In this paper, we introduce the MPIXternal library which tries to Circumvent this obvious semantic gap within the current MPI standard. For this pur pose, the library offers the programmer additional features that should help to adapt applications to today's hierarchical systems in a convenient and portable way.
there are relatively few studies of distributed GPU graph analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider t...
详细信息
ISBN:
(纸本)9781728168760
there are relatively few studies of distributed GPU graph analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication. In this paper, we present the first detailed analysis of graph analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.
GPUs offer an order of magnitude higher compute power and memory bandwidththan CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways;e...
详细信息
ISBN:
(纸本)9780769552071
GPUs offer an order of magnitude higher compute power and memory bandwidththan CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways;e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to write GPGPU programs which efficiently process data that does not fit in GPU memory, partly because of the intricacies of GPU hardware architecture and programming models, and partly because of the limited bandwidth available between GPUs and CPUs. In this paper, we propose BigKernel, a scheme that provides pseudo-virtual memory to GPU applications and is implemented using a 4-stage pipeline with automated prefetching to (i) optimize CPU-GPU communication and (ii) optimize GPU memory accesses. BigKernel simplifies the programming model by allowing programmers to write kernels using arbitrarily large data structures that can be partitioned into segments where each segment is operated on independently;these kernels are transformed into BigKernel using straight-forward compiler transformations. Our evaluation on six data-intensive benchmarks shows that BigKernel achieves an average speedup of 1.7 over state-of-the-art double-buffering techniques and an average speedup of 3.0 over corresponding multi-threaded CPU implementations.
暂无评论