parallel and distributedprocessing is employed to accelerate training for many deep-learning applications with large models and inputs. As it reduces synchronization and communication overhead by tolerating stale gra...
详细信息
ISBN:
(纸本)9781538610428
parallel and distributedprocessing is employed to accelerate training for many deep-learning applications with large models and inputs. As it reduces synchronization and communication overhead by tolerating stale gradient updates, asynchronous stochastic gradient descent (ASGD), derived from stochastic gradient descent (SGD), is widely used. Recent theoretical analyses show ASGD converges with linear asymptotic speedup over SGD. Oftentimes glossed over in theoretical analysis are communication overhead and practical learning rates that are critical to the performance of ASGD. After analyzing the communication performance and convergence behavior of ASGD using the Downpour algorithm as an example, we demonstrate the challenges for ASGD to achieve good practical speedup over SGD. We propose a distributed, bulk-synchronous stochastic gradient descent algorithm that allows for sparse gradient aggregation from individual learners. The communication cost is amortized explicitly by a gradient aggregation interval, and global reductions are used instead of a parameter server for gradient aggregation. We prove its convergence and show that it has superior communication performance and convergence behavior over popular ASGD implementations such as Downpour and EAMSGD for deep-learning applications.
Cloud computing is one of the most popular technologies nowadays because of its wide utilities and various benefits in several IT companies all over the world. However, in front of the increasing users' requests f...
详细信息
ISBN:
(纸本)9781538637906
Cloud computing is one of the most popular technologies nowadays because of its wide utilities and various benefits in several IT companies all over the world. However, in front of the increasing users' requests for computing services, cloud providers are encouraged to deploy large data centers, which consumes very large amount of energy and contribute to high operational costs. Among the effects, carbon dioxide emission rate is growing each day due to the huge amount of power consumption. This energy efficiency is an important issue in cloud computing, mainly due to the required electrical power to run these systems and to cool them. Therefore, energy consumption has become a major concern for the widespread deployment of Cloud data centers. The growing importance for parallelapplications in the Cloud introduces significant challenges in reducing energy consumption from hosted servers. This paper addresses the problem of placing independent applications on the physical servers (hosts) of a Cloud infrastructure. We proposed a novel heuristic to allocate applications so that total energy consumption is reduced. Our proposal respects various constraints e.g. the machines availability, capability and the duplication of applications. Experiments are illustrated to validate the potential of our approach.
Task migration and load sharing algorithms are two load balancing strategies that are essential in distributed memory multiprocessor as well as in multi-computer environments. Dynamic load balancing is more suitable i...
详细信息
ISBN:
(纸本)1932415262
Task migration and load sharing algorithms are two load balancing strategies that are essential in distributed memory multiprocessor as well as in multi-computer environments. Dynamic load balancing is more suitable in heterogeneous systems. Various load sharing and global centralized algorithms have been proposed in the literature. These algorithms demand careful investigation about their suitability in different applications. In this research paper we focus on the performance evaluation of two algorithms implemented on SPMD model based on their controlling parameters. A network of workstations has been chosen and PVM libraries have been used for implementation. Matrix multiplication has been selected as the application. The two algorithms investigated are: variable granularity (guided self scheduling) and one global centralized task migration algorithm.
Redundant arrays of independent disks (RAID) have been widely used for providing a mass storage with high performance and reliability. Among RAID architectures, RAID-1 and RAID-5 are most popular. But RAID-1 means exc...
详细信息
ISBN:
(纸本)1892512416
Redundant arrays of independent disks (RAID) have been widely used for providing a mass storage with high performance and reliability. Among RAID architectures, RAID-1 and RAID-5 are most popular. But RAID-1 means excessive redundancy, and RAID-5 shows poor write performance. Recently SMDA (Stripped Mirroring Disk Array) was proposed to overcome small-write problem of disk array. SMDA stores the original data in two ways, one on a single disk and the other on a plurality of disks in RAID-0 by stripping [2]. In this paper, we propose a new disk array architecture, called distributed Sparing-Stripped Mirroring Disk Array (ds-SMDA), that adds distributed on-line spares to SMDA. With ds-SMDA, we can increase parallelism of small-size read and write operations in the normal state. And ds-SMDA enables us to reduce seek time during the recovery time. Moreover, we can recover from any double disk failures.
The concept of Global virtual time (GVT) has become an essential element of optimistic time management algorithms (TMA) that provide synchronization in a parallel and distributed computing environment. The performance...
详细信息
ISBN:
(纸本)1601320841
The concept of Global virtual time (GVT) has become an essential element of optimistic time management algorithms (TMA) that provide synchronization in a parallel and distributed computing environment. The performance of this optimistic TMA is optimal since it gives accurate GVT approximation. However, this accurate GVT approximation comes at the expense of slower execution rate which results a high GVT latency. Since this resultant GVT latency is not only high but also widely varied, the multiple processors involve in communication remain idle during that period of time. This paper examines the potential use of tress and butterflies barriers with the Mattern's optimistic TMA [1] using a ring structure. Our Simulation and numerical results verify that the use of tree barriers with the Mattern's GVT structure can significantly improve the latency time and thus increase the overall throughput of the parallel and scalable distributed simulation systems.
The efficient scheduling of large mixed parallelapplications is challenging. Most existing algorithms utilize scheduling heuristics and approximation algorithms to determine a good schedule as basis for an efficient ...
详细信息
ISBN:
(纸本)9780769535449
The efficient scheduling of large mixed parallelapplications is challenging. Most existing algorithms utilize scheduling heuristics and approximation algorithms to determine a good schedule as basis for an efficient execution in large scale scientific computing. This paper concentrates on the scheduling of mixed parallelapplications represented by task graphs with parallel tasks and precedence constraints between them. Layer-based scheduling algorithms for homogeneous target platforms are improved by adding a move-blocks phase that further reduces the resulting parallel runtime. The layer-based scheduling approach is described and the move-blocks algorithm is introduced in detail. The move-blocks extension provides better scheduling results for small as well as for large problems but has only a small increase in runtime. This is shown by a comparison of the modified and the original algorithms over a wide range of test cases.
Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have en...
详细信息
ISBN:
(纸本)9780769552071
Emerging applications in areas such as bioinformatics, data analytics, semantic databases and knowledge discovery employ datasets from tens to hundreds of terabytes. Currently, only distributed memory clusters have enough aggregate space to enable in-memory processing of datasets of this size. However, in addition to large sizes, the data structures used by these new application classes are usually characterized by unpredictable and fine-grained accesses: i.e., they present an irregular behavior. Traditional commodity clusters, instead, exploit cache-based processor and high-bandwidth networks optimized for locality, regular computation and bulk communication. For these reasons, irregular applications are inefficient on these systems, and require custom, hand-coded optimizations to provide scaling in both performance and size. Lightweight software multithreading, which enables tolerating data access latencies by overlapping network communication with computation, and aggregation, which allows reducing overheads and increasing bandwidth utilization by coalescing fine-grained network messages, are key techniques that can speed up the performance of large scale irregular applications on commodity clusters. In this paper we describe GMT (Global Memory and Threading), a runtime system library that couples software multithreading and message aggregation together with a Partitioned Global Address Space (PGAS) data model to enable higher performance and scaling of irregular applications on multi-node systems. We present the architecture of the runtime, explaining how it is designed around these two critical techniques. We show that irregular applications written using our runtime can outperform, even by orders of magnitude, the corresponding applications written using other programming models that do not exploit these techniques.
With the development of information technology, real-time data stream processing(RTDSP) has become a popular research topic. The first step of RTDSP is collecting data, requiring a data collector to receive data from ...
详细信息
ISBN:
(纸本)9781538637906
With the development of information technology, real-time data stream processing(RTDSP) has become a popular research topic. The first step of RTDSP is collecting data, requiring a data collector to receive data from the source and send them to the sink. Apache Flume, a distributed and reliable framework, used for this purpose, has some limitations and drawbacks on load balancing and storage. In this paper, we aim to improve performance and availability for collecting unstable real-time big data stream. So we propose a new load balancing strategy based on the free memory size and a storage strategy of integration memory channel with the multi-file channel to reduce the overhead of disk and network. Finally, the experimental results show that the availability and performance are improved under the condition of a poor network, high availability requirements, intense competition in memory resources and large data size. Specifically, the availability is higher than 99.999%, and the performance can be improved by 10%-50% under different conditions.
The analysis of reliability consists on using the times to failure and model them through a regression Statistics model. The Weibull distribution model is an example of this. This model allows to find the estimatives ...
详细信息
ISBN:
(纸本)1892512416
The analysis of reliability consists on using the times to failure and model them through a regression Statistics model. The Weibull distribution model is an example of this. This model allows to find the estimatives of the mean time to failure and the percentual to failure, which are very important ones to evaluate the reliability of products, i.e., to improve their quality. A sequential simulator based on this model was developed. Depending on the input parameters, the simulation time becomes prohibitive. So, this motivated the creation of a distributed version of this simulator, in order to decrease the simulation time. The goal of this paper is to introduce this preliminary version of this simulator and to show its speedup for 5000 replics (number of times of the main loop), 300 iterations of Bootstrap routine and maximum number of 50 iterations (Newton-Raphson).
Anomaly diagnosis for distributed service plays an important role in communication network information system. Log analysis is the main method to undertake anomaly detection. In order to reduce the manual detection, w...
详细信息
ISBN:
(纸本)9781538637906
Anomaly diagnosis for distributed service plays an important role in communication network information system. Log analysis is the main method to undertake anomaly detection. In order to reduce the manual detection, we propose an anomaly detection method based on the time-weighted control flaw graph model. The border is split by a discrete degree strategy based on analyzing the time interval distribution and the time weight is selected to be k-means. Experiments show that our algorithm has good precision and recall in anomaly diagnosis. In real-world scenarios, it has a precision of 80% and a recall rate of 65% on average.
暂无评论