Barrier synchronization is a fundamental concurrency issue encountered in a large number of concurrent and parallelapplications that involve parallel processes cooperatively solving complex problems. Barrier synchron...
详细信息
ISBN:
(纸本)9781538655559
Barrier synchronization is a fundamental concurrency issue encountered in a large number of concurrent and parallelapplications that involve parallel processes cooperatively solving complex problems. Barrier synchronization essentially forces these parallel processes to wait until each one of them has reached a certain point in execution. Therefore, barrier synchronization has been considered as an inevitable synchronization construct for most parallelapplications. Hence barrier synchronization or its variant has become an integral component of the most widely used parallel programming models such as OpenMP, MPI, MapReduce, and the bulk synchronous parallel (BSP) models. In [2], the authors stated that the performance of BSP model depends on 4 parameters the number of nodes, the speed of each node, the communication cost, and the synchronization cost. Among these, the synchronization cost is considered critical in improving the performance of any BSP implementation [2]. From parallel algorithm design and implementation perspective, we consider that the synchronization component is the most important one and its cost is the most critical performance factor for almost all the parallel programming models including BSP. This paper presents a class of simple barrier synchronization algorithms for shared memory systems. It includes general, efficient, and universal algorithms that are appealing from both practical and theoretical point of view. The correctness of the algorithms are proved. The algorithms are briefly analyzed to expose their strengths and weaknesses.
MrBayes is a popular bioinformatics software that is widely used in phylogenetic analysis. The core algorithm of Mrbayes is Metropolis Coupled Markov Chain Monte Carlo (MC3). However, when dealing with large data sets...
详细信息
ISBN:
(纸本)9781538637906
MrBayes is a popular bioinformatics software that is widely used in phylogenetic analysis. The core algorithm of Mrbayes is Metropolis Coupled Markov Chain Monte Carlo (MC3). However, when dealing with large data sets, MC3 algorithm is too slow to meet researcher's requirements. Although several parallelizations have been proposed for MrBayes, such as MPI (Message Passing Interface) based MrBayes, GPU (Graphics processing Unit) based MrBayes, there is still no efficient parallel algorithm to fully utilize computing power of modern CPU and computer architecture. This paper (a) presents a new three-level hybrid parallel algorithm, include data-level parallelism (DLP), thread-level parallelism (TLP), and process-level parallelism (PLP), which can be used on most modern multi-core computers;(b) compares the performance of different combinations of parallel strategies on real-world protein data sets. The experimental results show that, this hybrid parallel algorithm does convert more computing powers into higher speedup. Furthermore, the proposed algorithm's speedup is near the speedup on one GPU at the same data sets. This algorithm is fit for practical use in phylogenetic inferences.
Automating the execution of applications in grid computing environments is a complicated task due to the heterogeneity of computing resources, resource usage policies, and application requirements. applications differ...
详细信息
ISBN:
(纸本)9781424437511
Automating the execution of applications in grid computing environments is a complicated task due to the heterogeneity of computing resources, resource usage policies, and application requirements. applications differ in memory, usage, performance, scalability and storage usage. Having knowledge of this information can aid in matching jobs to resources and in selecting appropriate configuration parameters such as the number of processors to run on and memory, requirements for those resources. This paper presents an application memory usage model that can be used to aid in selecting appropriate job configurations for different resources. The model can be used to represent how memory scales with the number of processors, the memory usage of different types of processes, and changes in memory, usage during execution. It builds on a previously, developed information model used for describing resources, resource usage policies and limited information on applications. An analysis of the memory, usage model illustrating its use towards automating job execution in grid computing environments is also presented.
FFT has been a classic computation engine for numerous applications. The bandwidth-intensive nature of FFT capped its performance on off-the-shelf parallel machines that are bandwidth-limited, and forced application r...
详细信息
ISBN:
(纸本)9781509036820
FFT has been a classic computation engine for numerous applications. The bandwidth-intensive nature of FFT capped its performance on off-the-shelf parallel machines that are bandwidth-limited, and forced application researchers into seeking easier-to-speedup alternatives to FFT, even when inferior to FFT. But, what if effective support of FFT is feasible? Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster.
Analyzing parallel programs has become increasingly difficult due to the immense amount of information collected on large systems. In this scenario, cluster analysis has been proved to be a useful technique to reduce ...
详细信息
ISBN:
(纸本)9780769546766
Analyzing parallel programs has become increasingly difficult due to the immense amount of information collected on large systems. In this scenario, cluster analysis has been proved to be a useful technique to reduce the amount of data to analyze. A good example is the use of the density-based cluster algorithm DBSCAN to identify similar single program multiple data (SPMD) computing phases in message-passing applications. This structure detection simplifies the analyst work as the whole information available is reduced to a small set of clusters. However, DBSCAN presents two major problems: it is very sensitive to its parametrization and is not capable of correctly detect clusters when the data set has different densities across the data space. In this paper, we introduce the Aggregative Cluster Refinement, an iterative algorithm that produces more accurate structure detections of SPMD phases than DBSCAN. In addition, it is able to detect clusters with different densities.
As the number of cores per machine increases, memory architectures are being redesigned to avoid bus contention and sustain higher throughput needs. The emergence of Non-Uniform Memory Access (NUMA) constraints has ca...
详细信息
ISBN:
(纸本)9781424437511
As the number of cores per machine increases, memory architectures are being redesigned to avoid bus contention and sustain higher throughput needs. The emergence of Non-Uniform Memory Access (NUMA) constraints has caused affinities between threads and buffers to become an important decision criterion for schedulers. Memory migration dynamically enables the joint distribution of work and data across the machine but requires high-performance data transfers as well as a convenient programming interface. We present improvements of the LINUX migration primitives and the implementation of a Next-touch policy in the kernel to provide multithreaded applications with an easy way to dynamically maintain thread-data affinity. Microbenchmarks show that our work enables a high-performance, synchronous and lazy memory migration within multithreaded applications. A threaded LU factorization then reveals the large improvement that our Next-touch policy model may bring in applications with complex access patterns.
Container technologies are seeing wider use at advanced computing facilities for managing highly complex applications that must execute at multiple sites. However, in a distributed high throughput computing setting, t...
详细信息
ISBN:
(纸本)9781728168760
Container technologies are seeing wider use at advanced computing facilities for managing highly complex applications that must execute at multiple sites. However, in a distributed high throughput computing setting, the unrestricted use of containers can result in the container explosion problem. If a new container image is generated for each variation of a job dispatched to a site, shared storage is soon exceeded. On the other hand, if a single large container image is used to meet multiple needs, the size of that container may become a problem for storage and transport. To address this problem, we observe that many containers have an internal structure generated by a structured package manager, and this information could be used to strategically combine and share container images. We develop LANDLORD to exploit this property and evaluate its performance through a combination of simulation studies and empirical measurement of high energy physics applications.
With the development of information technology, real-time data stream processing(RTDSP) has become a popular research topic. The first step of RTDSP is collecting data, requiring a data collector to receive data from ...
详细信息
ISBN:
(纸本)9781538637906
With the development of information technology, real-time data stream processing(RTDSP) has become a popular research topic. The first step of RTDSP is collecting data, requiring a data collector to receive data from the source and send them to the sink. Apache Flume, a distributed and reliable framework, used for this purpose, has some limitations and drawbacks on load balancing and storage. In this paper, we aim to improve performance and availability for collecting unstable real-time big data stream. So we propose a new load balancing strategy based on the free memory size and a storage strategy of integration memory channel with the multi-file channel to reduce the overhead of disk and network. Finally, the experimental results show that the availability and performance are improved under the condition of a poor network, high availability requirements, intense competition in memory resources and large data size. Specifically, the availability is higher than 99.999%, and the performance can be improved by 10%-50% under different conditions.
Anomaly diagnosis for distributed service plays an important role in communication network information system. Log analysis is the main method to undertake anomaly detection. In order to reduce the manual detection, w...
详细信息
ISBN:
(纸本)9781538637906
Anomaly diagnosis for distributed service plays an important role in communication network information system. Log analysis is the main method to undertake anomaly detection. In order to reduce the manual detection, we propose an anomaly detection method based on the time-weighted control flaw graph model. The border is split by a discrete degree strategy based on analyzing the time interval distribution and the time weight is selected to be k-means. Experiments show that our algorithm has good precision and recall in anomaly diagnosis. In real-world scenarios, it has a precision of 80% and a recall rate of 65% on average.
In this paper, we present the clustering boundary cutting trie algorithm in order to solve the problem of huge time consumption in existing trie based algorithms. In the proposed solution, there are two stages. The fi...
详细信息
ISBN:
(纸本)9781538637906
In this paper, we present the clustering boundary cutting trie algorithm in order to solve the problem of huge time consumption in existing trie based algorithms. In the proposed solution, there are two stages. The first stage is the density-based rule clustering process. The rules are represented as a range between 0 and 1 according to the prefixes of the packet fields. When the number of the rules in a range reaches to a certain density, the corresponding rules are formed in a cluster. The second stage is the trie construction process based on these clusters. Compared with traditional packet classification algorithms, the searching time of our algorithm increases by 47.05% -73.76% and keep the high accuracy of 69.83%-93.17%. The experiment demonstrates that our algorithm can effectively keep high accuracy as well as keeping stable high-throughput, and it is suitable for actual deployment.
暂无评论