distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find tha...
详细信息
ISBN:
(纸本)9781450397339
distributed data-parallel training has been widely adopted for deep neural network (DNN) models. Although current deep learning (DL) frameworks scale well for dense models like image classification models, we find that these DL frameworks have relatively low scalability for sparse models like natural language processing (NLP) models that have highly sparse embedding tables. Most existing works overlook the sparsity of model parameters thus suffering from significant but unnecessary communication overhead. In this paper, we propose EmbRace, an efficient communication framework to accelerate communications of distributed training for sparse models. EmbRace introduces Sparsity-aware Hybrid Communication, which integrates AlltoAll and model parallelism into data-parallel training, so as to reduce the communication overhead of highly sparse parameters. To effectively overlap sparse communication with both backward and forward computation, EmbRace further designs a 2D Communication Scheduling approach which optimizes the model computation procedure, relaxes the dependency of embeddings, and schedules the sparse communications of each embedding row with a priority queue. We have implemented a prototype of EmbRace based on PyTorch and Horovod, and conducted comprehensive evaluations with four representative NLP models. Experimental results show that EmbRace achieves up to 2.41x speedup compared to the state-of-the-art distributed training baselines.
Given an n x n binary image of white and black pixels, we present an optimal parallel algorithm for computing the distance transform and the nearest feature transform using the Euclidean metric. The algorithm employs ...
详细信息
Given an n x n binary image of white and black pixels, we present an optimal parallel algorithm for computing the distance transform and the nearest feature transform using the Euclidean metric. The algorithm employs the systolic computation to achieve O(n) running time on a linear array of n processors.
Domain Adaptation for semantic segmentation is of vital significance since it enables effective knowledge transfer from a labeled source domain (i.e., synthetic data) to an unlabeled target domain (i.e., real images),...
详细信息
ISBN:
(纸本)9781665435741
Domain Adaptation for semantic segmentation is of vital significance since it enables effective knowledge transfer from a labeled source domain (i.e., synthetic data) to an unlabeled target domain (i.e., real images), where no effort is devoted to annotating target samples. Prior domain adaptation methods are mainly based on image-to-image translation model to minimize differences in image conditions between source and target domain. However, there is no guarantee that feature representations from different classes in the target domain can be well separated, resulting in poor discriminative representation. In this paper, we propose a unified learning pipeline, called image Translation and Representation Alignment (ITRA), for domain adaptation of segmentation. Specifically, it firstly aligns an image in the source domain with a reference image in the target domain using image style transfer technique (e.g., CycleGAN) and then a novel pixelcentroid triplet loss is designed to explicitly minimize the intraclass feature variance as well as maximize the inter-class feature margin. When the style transfer is finished by the former step, the latter one is easy to learn and further decreases the domain shift. Extensive experiments demonstrate that the proposed pipeline facilitates both image translation and representation alignment and significantly outperforms previous methods in both GTA5 -> Cityscapes and SYNTHIA -> Cityscapes scenarios.
Several parallel and distributed data mining algorithms have been proposed in literature to perform large scale data analysis, overcoming the bottleneck of traditional methods on a single machine. However, although th...
详细信息
ISBN:
(纸本)9798350363074;9798350363081
Several parallel and distributed data mining algorithms have been proposed in literature to perform large scale data analysis, overcoming the bottleneck of traditional methods on a single machine. However, although the master-worker approach greatly simplifies the synchronization of all nodes since only the master is in charge to do that, it also presents several problematic issues for large-scale data analysis tasks (involving thousands or millions of nodes). This paper presents a hierarchical (or multi-level) master-worker framework for iterative parallel data analysis algorithms, to overcome the scalability issues affecting classic master-worker solutions. Specifically, the framework is composed of (more than one) merger and worker nodes organized in a k-tree structure, in which the workers are on the leaves and the mergers are on the root and the internal nodes in the tree.
Segmentation algorithms are widely used in imageprocessing. These methods have different complexity values and the choice of reasonable methods decreases on large images. Especially on the medical images with large s...
详细信息
ISBN:
(纸本)9781467355636;9781467355629
Segmentation algorithms are widely used in imageprocessing. These methods have different complexity values and the choice of reasonable methods decreases on large images. Especially on the medical images with large size, it may take days to perform segmentation in some methods. However, parallel implementation may eliminate the drawback of these algorithms to some extent. In this study, we propose to implement segmentation algorithms in parallel using Graphical processing Unit. Using the proposed implementation, the computation time of the K-centers, K-means and DBSCAN algorithms were decreases 87, 642 and 2 times, respectively.
Different parallelization methods vary in their system requirements, programming styles, efficiency of exploring parallelism, and the application characteristics they can handle. Different applications can exhibit tot...
详细信息
ISBN:
(纸本)0769516807
Different parallelization methods vary in their system requirements, programming styles, efficiency of exploring parallelism, and the application characteristics they can handle. Different applications can exhibit totally different performance gains depending on the parallelization method used. This paper compares OpenMP, MPI, and Strings(A distributed shared memory)for parallelizing a complicated tribology problem. The problem size and computing infrastructure are changed and their impacts on the parallelization methods are studied. All of the methods studied exhibit good performance improvements. This paper exhibits the benefits that are the result of applying parallelization techniques to applications in this field.
Advancements in satellite imaging and sensor technologies result in capturing of large amount of spatial data. Many parallelprocessing techniques based on data or control parallelism have been attempted during the pa...
详细信息
Advancements in satellite imaging and sensor technologies result in capturing of large amount of spatial data. Many parallelprocessing techniques based on data or control parallelism have been attempted during the past 2 decades to provide performance improvement in imageprocessing applications such as urban sprawl, weather prediction and crop estimation. These techniques have used block-based distributed file processing or the more modern MapReduce-based programming for implementation which still have gaps between optimal and best processing in terms of resource scheduling, data distribution and ease of programming. In this paper, we present a layered framework for parallel data processing to improve storage, retrieval and processing performance of spatial data on an underlying distributed file system. The paper presents a data placement strategy across a distributed HDFS cluster in a way to optimize spatial data retrieval and processing. The presence of neighborhood pixels local to the processing node in a distributed environment reduces network latencies and improves the efficiency of applications such as object recognition, change detection and site selection. We evaluate the data placement strategy on a four-node HDFS cluster and show that it can deliver good performance benefits by way of reading blocks of data at almost 10-12 times the default, which contributes to the improvement in efficiency of the various applications that use region growing methods.
Some datasets and computing environments are inherently distributed. For example, image data may be gathered and stored at different locations. Although data parallelism is a well-known computational model, there are ...
详细信息
ISBN:
(纸本)0769521983
Some datasets and computing environments are inherently distributed. For example, image data may be gathered and stored at different locations. Although data parallelism is a well-known computational model, there are few programming systems that are both easy to program (for simple applications) and can work across administrative domains. We have designed and implemented a simple programming system, called Trellis-SDP, that facilitates the rapid development of data-intensive applications. Trellis-SDP is layered on top of the Trellis infrastructure, a software system for creating overlay metacomputers: user-level aggregations of computer systems. Trellis-SDP provides a master-worker programming framework where the worker components can run self-contained, new or existing binary applications. We describe two interface functions, namely trellis_scano() and trellis_gather(), and show how easy it is to get reasonable performance with simple data-parallel applications, such as Content Based image Retrieval (CBIR) and parallel Sorting by Regular Sampling (PSRS).
The 3D surface reconstruction is critical for various applications, demanding efficient computational approaches. Traditional Radial Basis Functions (RBFs) methods are limited by increasing data points, leading to slo...
详细信息
ISBN:
(纸本)9798350363074;9798350363081
The 3D surface reconstruction is critical for various applications, demanding efficient computational approaches. Traditional Radial Basis Functions (RBFs) methods are limited by increasing data points, leading to slower execution times. Addressing this, our study introduces an experimental parallelization effort using Julia, as well-known for high-performance scientific computing. We developed an initial sequential RBF algorithm in Julia, then expanded it to a parallel model, exploiting Multi-Threading to enhance execution speed while maintaining accuracy. This initial exploration into Julia's parallel computing capabilities shows marked performance gains in 3D surface reconstruction, offering promising directions for future research. Our findings affirm Julia's potential in computationally intensive tasks, with test results confirming the expected time efficiency improvements.
Aiming at TB-scale time-varying scientific datasets, this paper presents a novel static load balancing scheme based on information entropy to enhance the efficiency of the parallel adaptive volume rendering algorithm....
详细信息
ISBN:
(纸本)9789898533388
Aiming at TB-scale time-varying scientific datasets, this paper presents a novel static load balancing scheme based on information entropy to enhance the efficiency of the parallel adaptive volume rendering algorithm. An information-theory model is proposed firstly, and then the information entropy is calculated for each data patch, which is taken as a pre-estimation of the computational amount of ray sampling. According to their computational amounts, the data patches are distributed to the processing cores balancedly, and accordingly load imbalance in parallel rendering is decreased. Compared with the existing methods such as random assignment and ray estimation, the proposed entropy-based load balancing scheme can achieve a rendering speedup ratio of 1.23 similar to 2.84. It is the best choice in interactive volume rendering due to its speedup performance and view independence.
暂无评论