In this paper, we propose and investigate a task mapping algorithm for big data applications. As a critical resource, data are produced faster than ever before. parallel programs that process these data on massive par...
详细信息
ISBN:
(纸本)9781467368322
In this paper, we propose and investigate a task mapping algorithm for big data applications. As a critical resource, data are produced faster than ever before. parallel programs that process these data on massive parallel systems are widely adopted. The task mapping algorithm however, has not been well optimized for these applications. We explore the characteristics of big data applications based on a shared cache/memory multicore processor. The latencies of cache and memory sub-systems are analysed. The proposed algorithm is designed to optimize the cache/memory latency, as well as intra-application latency. We introduce an efficient greedy algorithm to calculate the mapping result based on the congregate degree of nodes. Different numbers of search spaces are discussed and evaluated. Experiments are conducted based on synthetic simulation and running real applications on a full system simulation environment. Results confirmed the effectiveness of the proposed algorithm. Average execution time of five selected big data applications is reduced by 8% compared with the first fit algorithm.
With the wide diffusion of parallel architectures parallelism has become an indispensable factor in the application design. However, the cost of the parallelization process of existing applications is still too high i...
详细信息
Extending expertise in parallel computing is critical to all those using high performance computing to gain insights into science and engineering problems. Many campuses do not offer such a course because of course lo...
详细信息
Graphics Processing Units (GPUs) are becoming increasingly popular not only across various scientific communities, but also as integrated data-parallel accelerators on existing multicore processors. Support for massiv...
详细信息
Graphics Processing Units (GPUs) are becoming increasingly popular not only across various scientific communities, but also as integrated data-parallel accelerators on existing multicore processors. Support for massive fine-grained parallelism in contemporary GPUs provides a tremendous amount of computing power. GPUs support thousands of lightweight threads to deliver high computational throughput. Popularity of GPUs is facilitated by easy-to-adopt programming models such as CUDA and OpenCL that aim to ease programmers' efforts while developing parallel GPU applications. However, designing and implementing correct and efficient GPU programs is still challenging since programmers must consider interaction between thousands of parallel threads. Therefore, addressing these challenges is essential for improving programmers' productivity as well as software reliability. Towards this end, this dissertation proposes mechanisms for improving programmability of irregular applications and ensuring correctness of compute kernels. Some applications possess abundant data-level parallelism, but are unable to take advantage of GPU's parallelism. They exhibit irregular memory access patterns to the shared data structures. programming such applications on GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity. Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads. On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts. To overcome these challenges, we propose transactional memory (TM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts. Transactional execution can incur runtime overheads due to activities such as detecting confl
We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates...
详细信息
ISBN:
(纸本)9781450335867
We present a scalable parallel solver for numerical constraint satisfaction problems (NCSPs). Our parallelization scheme consists of homogeneous worker solvers, each of which runs on an available core and communicates with others via the global load balancing (GLB) method. The search tree of the branch and prune algorithm is split and distributed through the two phases of GLB: a random workload stealing phase and a workload distribution and termination phase based on a hyper-cube-shaped graph called lifeline. The parallel solver is simply implemented with X10 that provides an implementation of GLB as a library. In experiments, NCSPs from the literature were solved and attained up to 516-fold speedup using 600 cores of the TSUBAME2.5 supercomputer. Optimal GLB configurations are analyzed.
Teaching supercomputing technologies is very important and complex task. A wide variety of computer systems and technologies complicate training process. Educational and research systems can help. We are presenting tw...
详细信息
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance...
详细信息
ISBN:
(纸本)9781450332057
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) today's programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This paper presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer's insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB, but more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations. Copyright 2015 ACM.
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be config...
详细信息
ISBN:
(纸本)9781450332057
We present a new, concurrent, lock-free priority queue that relaxes the delete-min operation to allow deletion of any of the ρ+1 smallest keys instead of only a minimal one, where ρ is a parameter that can be configured at runtime. It is built from a logarithmic number of sorted arrays, similar to log-structured merge-trees (LSM). For keys added and removed by the same thread the behavior is identical to a non-relaxed priority queue. We compare to state-of-the-art lock-free priority queues with both relaxed and non-relaxed semantics, showing high performance and good scalability of our approach.
Nowadays, more and more program analysis tools adopt pro- filing approaches in order to obtain data dependences because of their ability of tracking dynamically allocated memory, pointers, and array indices. However, ...
详细信息
We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a w...
详细信息
ISBN:
(纸本)9781450332385
We present Consequence, a deterministic multi-threading library. Consequence achieves deterministic execution via store buffering and strict ordering of synchronization operations. To ensure high performance under a wide variety of conditions, the ordering of synch operations is based on a deterministic clock [25], and store buffering is implemented using version-controlled memory [23]. Recent work on deterministic concurrency [14, 19] has proposed relaxing the consistency model beyond total store order (TSO). Through novel optimizations, Consequence achieves the same or better performance on the Phoenix, PARSEC and SPLASH-2 benchmark suites, while retaining TSO memory consistency. Across 19 benchmark programs, Consequence incurs a worst-case slowdown of 3.9× vs. pthreads, with 14 out of 19 programs at or below 2.5×. We believe this performance improvement takes parallel programming one step closer to "determinism by default.".
暂无评论