Analyzing parallel programs has become increasingly difficult due to the immense amount of information collected on large systems. The use of clustering techniques has been proposed to analyze applications. However wh...
详细信息
ISBN:
(纸本)9781424437511
Analyzing parallel programs has become increasingly difficult due to the immense amount of information collected on large systems. The use of clustering techniques has been proposed to analyze applications. However while the objective of previous works is focused on identifying groups of processes with similar characteristics, we target a much finer granularity in the application behavior. In this paper, we present a tool that automatically characterizes the different computation regions between communication primitives in message-passing applications. This study shows how some of the clustering algorithms which may be applicable at a coarse grain are no longer adequate at this level. Density-based clustering algorithms applied to the performance counters offered by modern processors are more appropriate in this context. This tool automatically generates accurate displays of the structure of the application as well as detailed reports on a broad range of metrics for each individual region detected.
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. We present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix and reduc...
详细信息
ISBN:
(纸本)0818684038
Vector prefix and reduction are collective communication primitives in which all processors must cooperate. We present two parallel algorithms, the direct algorithm and the split algorithm, for vector prefix and reduction computation on coarse-grained, distributed-memory parallel machines. Our algorithms are relatively architecture independent and can be used effectively in many applications such as Pack/Unpack, Array Prefix/Reduction Functions, and Array Combining Scatter Functions, which are defined in Fortran 90 and in High Performance Fortran. Experimental results on the CM-5 are presented.
Data Stream processing (DSP) applications, which generate real-time analytics on continuous data flows, have become prevalent recently. For the deployment of DSP applications, task placement is an important and essent...
详细信息
ISBN:
(纸本)9781665435741
Data Stream processing (DSP) applications, which generate real-time analytics on continuous data flows, have become prevalent recently. For the deployment of DSP applications, task placement is an important and essential part. As determining the optimal task placement is an NP-hard problem, several efficient heuristics have been designed and Deep Reinforcement Learning (DRL) was used to train the scheduling agent. Current DRL-based approach assumes all resources including CPU, memory and networking are homogeneous. However, the available computation and network resources are heterogeneous in many scenarios. To deal with it, we devise a general DRL-based resource-aware framework, which models resources using graph embedding and attention mechanism to predict the placement. Furthermore, in order to accelerate the training process and improve the throughput, we propose an efficient throughput estimation tool, which can estimate the throughput with high accuracy. We integrated our scheduling heuristic framework into Apache Flink and conducted comprehensive testings using multiple synthetic and real DSP applications. The experimental results show that our framework increases the throughput by 64%, 42%, 29% on average respectively compared with three state-of-the-art strategies.
The importance of high-performance graph processing to solve big data problems targeting high-impact applications is greater than ever before. Recent graph processing frameworks target different hardware platforms (e....
详细信息
ISBN:
(纸本)9781538655559
The importance of high-performance graph processing to solve big data problems targeting high-impact applications is greater than ever before. Recent graph processing frameworks target different hardware platforms (e.g., shared memory systems, accelerators such as GPUs, and distributed systems) and differ with respect to the programming model they adopt (e.g., based on linear algebra formulations of graph algorithms or enabling direct access to the graph structure). To better understand the impact of these choices, this paper, presents a comparative study of five state-of-the-art graph processing frameworks: two CPU-only frameworks - GraphMat and Galois, two GPU-based frameworks - Nvgraph and Gunrock;and Totem, a hybrid (CPU+GPU) framework. We use three popular graph algorithms (PageRank, Single Source Shortest Path, and Breadth-First Search), and massive scale graphs with up to billions of edges. Our evaluation focuses on three performance metrics: (i) execution time, (ii) scalability and (iii) energy consumption.
Graph coloring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to...
详细信息
ISBN:
(纸本)9781479986484
Graph coloring is used to identify subsets of independent tasks in parallel scientific computing applications. Traditional coloring heuristics aim to reduce the number of colors used as that number also corresponds to the number of parallel steps in the application. However, if the color classes produced have a skew in their sizes, utilization of hardware resources becomes inefficient, especially for the smaller color classes. Equitable coloring is a theoretical formulation of coloring that guarantees a perfect balance among color classes, and its practical relaxation is referred to as balanced coloring. In this paper, we revisit the problem of balanced coloring in the context of parallel computing. The goal is to achieve a balanced coloring of an input graph without increasing the number of colors that an algorithm oblivious to balance would have used. We propose and study multiple heuristics that aim to achieve such a balanced coloring, present parallelization approaches for multi-core and manycore architectures, and cross-evaluate their effectiveness with respect to the quality of balance achieved and performance. Furthermore, we study the impact of the proposed balanced coloring heuristics on a concrete application - viz. parallel community detection, which is an example of an irregular application. The thorough treatment of balanced coloring presented in this paper from algorithms to application is expected to serve as a valuable resource to parallel application developers who seek to improve parallel performance of their applications using coloring.
作者:
Wang, YanWang, XinFudan Univ
Sch Comp Sci Shanghai Key Lab Intelligent Informat Proc Shanghai 200433 Peoples R China
distributed storage systems (DSS) play an important role in data storage applications, since they provide high reliability for huge data storage requirement. As node failures are frequent in a large distributed storag...
详细信息
ISBN:
(纸本)9780769546766
distributed storage systems (DSS) play an important role in data storage applications, since they provide high reliability for huge data storage requirement. As node failures are frequent in a large distributed storage system, the performance of repairing node failure causes many researchers' interests. In this paper, we propose a distributed storage code to minimize the coding complexity during the repairing process, at a cost of inducing larger redundancy. Our code construction is based on regular graphs and exploits simple look-up repair. We analyze the performance of the proposed code, and compare them with existing distributed storage codes. Analytical results show that the proposed code outperforms the others in terms of low repair complexity and disk I/O overhead.
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can ...
详细信息
ISBN:
(数字)9781665497992
ISBN:
(纸本)9781665497992
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can represent dependencies and parallelism of task processing, have attracted much attention as solutions to this problem. Previous studies on scheduling DAGs on multi-core processors have attempted to reduce the makespan (i.e., time it takes for a task to complete) by increasing the number of processes that can be executed in parallel. However, in self-driving systems, such as those utilizing clustered many-core processors, it is impossible to sufficiently increase the utilization of processor cores due to high-load processing. In this paper, a scheduling method is proposed to improve the utilization of processor cores by parallel executing high-load processes in parallel across multiple cores. The proposed method can reduce the makespan of DAGs performing high-load processing on clustered many-core processors.
OLAP (online analytical processing) applications are based on a variety of aggregate queries on large-scale data. As aggregation is always performed on columns, traditional row-oriented storage, in which all the colum...
详细信息
ISBN:
(纸本)9780769546766
OLAP (online analytical processing) applications are based on a variety of aggregate queries on large-scale data. As aggregation is always performed on columns, traditional row-oriented storage, in which all the columns of a data row are stored together, has seriously restricted its performance. This paper proposes a dimension-oriented storage model based on HBase, and a new parallel aggregation technique, which accomplishes aggregation operations with parallel MapReduce jobs. Finally, compared with Hive on standard TPC-H data set, our technique is demonstrated to improve performance of core aggregate operations significantly.
Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel sorting implement...
详细信息
ISBN:
(纸本)9780769561493
Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel sorting implementation, however achieving desired performance on different types of the architectures with large number of processors is still a challenging issue. Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalance and waiting time due to memory latencies. In this paper, we present a distributed sorting algorithm implemented in PGX.D, a fast distributed graph processing system, which outperforms the Spark's distributed sorting implementation by around 2x-3x by hiding communication latencies and minimizing unnecessary overheads. Furthermore, it shows that the proposed PGX.D sorting method handles dataset containing many duplicated data entries efficiently and always results in keeping balanced workloads for different input data distribution types.
Emerging digital television applications and the conventional MPSoC architectures encounter drastically increasing performance and flexibility requirement. To display high quality of images on the display devices, sev...
详细信息
ISBN:
(纸本)9780769546766
Emerging digital television applications and the conventional MPSoC architectures encounter drastically increasing performance and flexibility requirement. To display high quality of images on the display devices, several image processing has to be performed. However, these algorithms are nonstandard and change case by case. It is difficult to achieve real time processing by using general purpose processor or DSP. In this paper, we present a reconfigurable Application Specific Instruction-set Processor (ASIP) which can perform several image processing algorithms by using the same datapath. It can complete several 1D filtering processing within 8 cycle/pixel;performing 16 times higher performance compare to conventional RISC processor. the performance of this ASIP can achieve the requirement of FullHD(1920x1080) application.
暂无评论