Branch-and-Bound (B&B) algorithms are well-known tree-based exploratory methods for solving to optimality NP-hard discrete optimization problems. The construction of the B&B tree and its exploration are perfor...
详细信息
Branch-and-Bound (B&B) algorithms are well-known tree-based exploratory methods for solving to optimality NP-hard discrete optimization problems. The construction of the B&B tree and its exploration are performed using four operators: branching, bounding, selection and pruning. Such algorithms are irregular which makes challenging their parallel design and implementation on GPU accelerators. Among the few existing related works, we have recently revisited on GPU the bounding operator. The reported results show that speedups up to × 100 can be obtained on recent GPU cards. In this paper, we address the GPU-based design and implementation of B&B algorithms considering the branching and pruning operators as well as the bounding one. The proposed template transforms the unpredictable and irregular workload associated to the explored B&B tree into regular data-parallel kernels optimized for the SIMD-based execution model of GPUs. Thread divergence and uncoalesced memory accesses are considered in the optimization process. The proposed approach has been experimented on the Flow-Shop scheduling problem and compared to another GPU-based strategy and to a cluster of workstations (COWs) based approach. The reported results demonstrate the efficiency of the proposed approach over the two other ones. Speedups up to × 160 are obtained for large problem instances using an Nvidia Tesla C2050 hardware configuration.
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent a...
详细信息
ISBN:
(数字)9783642311253
ISBN:
(纸本)9783642311246;9783642311253
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this problem. Our approach is based on a construct called the conflict graph, which (i) provides insight into the amount of parallelism that can be extracted from an irregular algorithm, and (ii) can be used to address the processor allocation problem for irregular algorithms. We show that this problem is related to a generalization of the unfriendly seating problem and, by extending Turan's theorem, we obtain a worst-case class of problems for optimistic parallelization, which we use to derive a lower bound on the exploitable parallelism. Finally, using some theoretically derived properties and some experimental facts, we design a quick and stable control strategy for solving the processor allocation problem heuristically.
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent a...
详细信息
ISBN:
(纸本)9781450307437
Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution. Therefore, determine how many processors should be allocated to execute (the processor allocation problem) for irregular algorithms is very difficult. In this work, we outline the first systematic strategy for addressing this problem.
irregular or pointer-based structures such as graphs and trees are commonly used in algorithms dealing with sparse data. Given their reliance on pointers, these algorithms are difficult to analyze and the structure of...
详细信息
ISBN:
(纸本)9781450306980
irregular or pointer-based structures such as graphs and trees are commonly used in algorithms dealing with sparse data. Given their reliance on pointers, these algorithms are difficult to analyze and the structure of their memory accesses is obfuscated which makes the extraction of parallelism difficult. In this work, we present a framework that is capable of reasoning about the semantics of the dynamic data footprints of operations to determine their potential overlap. We leverage the knowledge the programmer has about access patterns for the algorithm but is currently unable to express. This knowledge allows our runtime to make either a parallelization decision or throttle concurrency to improve performance in Software Transactional Memories (STMs) [6]. Our framework relies on programmer-supplied predicates that are appropriately evaluated at runtime and utilized to probabilistically assert certain properties about data footprints. We present simple abstractions and a low-overhead runtime to support our framework. We demonstrate our work by parallelizing a graph-coloring benchmark and by improving the transactional performance of benchmarks from the STAMP suite.
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large num...
详细信息
ISBN:
(纸本)9781605587080
Many large-scale parallel programs follow a bulk synchronous parallel (BSP) structure with distinct computation and communication phases. Although the communication phase in such programs may involve all (or large numbers) of the participating processes, the actual communication operations are usually sparse in nature. As a result, communication phases are typically expressed explicitly using point-to-point communication operations or collective operations. We define the dynamic sparse data-exchange (DSDE) problem and derive bounds in the well known LogGP model. While current approaches work well with static applications, they run into limitations as modern applications grow in scale, and as the problems that are being solved become increasingly irregular and dynamic. To enable the compact and efficient expression of the communication phase, we develop suitable sparse communication protocols for irregular applications at large scale. We discuss different irregular applications and show the sparsity in the communication for real-world input data. We discuss the time and memory complexity of commonly used protocols for the DSDE problem and develop NBX-a novel fast algorithm with constant memory overhead for solving it. Algorithm NBX improves the runtime of a sparse data-exchange among 8,192 processors on BlueGene/P by a factor of 5.6. In an application study, we show improvements of up to a factor of 28.9 for a parallel breadth first search on 8,192 BlueGene/P processors.
Clusters of symmetric multiprocessors (SMPs) are popular platforms for parallel programming since they provide large computational power for a reasonable price. For irregular application programs with dynamically chan...
详细信息
Clusters of symmetric multiprocessors (SMPs) are popular platforms for parallel programming since they provide large computational power for a reasonable price. For irregular application programs with dynamically changing computation and data access behavior, a flexible programming model is needed to achieve efficiency. In this paper we propose Task Pool Teams as, a hybrid parallel programming environment to realize irregular algorithms on clusters of SMPs. Task Pool Teams combine task pools on single cluster nodes by an explicit message passing layer. They offer load balance together with multi-threaded, asynchronous communication. Appropriate communication protocols and task pool implementations are provided and accessible by an easy-to-use application programmer interface. As application examples we present a branch and bound algorithm and the hierarchical radiosity algorithm. Copyright (c) 2006 John Wiley & Sons, Ltd.
Task pools can be used to achieve the dynamic load balancing that is required for an efficient parallel implementation of irregular applications. However, the performance strongly depends on a task pool implementation...
详细信息
ISBN:
(纸本)9781424416936
Task pools can be used to achieve the dynamic load balancing that is required for an efficient parallel implementation of irregular applications. However, the performance strongly depends on a task pool implementation that is well suited for the specific application. This paper introduces an adaptive task pool implementation that enables a step-wise transition between the common strategies of central and distributed task pools. The influence of the task size on the parallel performance is investigated and it is shown that the adaptive implementation provides the flexibility to adapt to different situations. Performance results from benchmark programs and from an irregular application for anomalous diffusion simulation are presented to demonstrate the need for an adaptive strategy. It is shown that profiling information about the overhead of the task pool implementation can be used to determine an optimal task pool strategy.
Since a static work distribution does not allow for satisfactory speed-ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of t...
详细信息
Since a static work distribution does not allow for satisfactory speed-ups of parallel irregular algorithms, there is a need for a dynamic distribution of work and data that can be adapted to the runtime behavior of the algorithm. Task pools are data structures which can distribute tasks dynamically to different processors where each task specifies computations to be performed and provides the data for these computations. This paper discusses the characteristics of task-based algorithms and describes the implementation of selected types of task pools for shared-memory multiprocessors. Several task pools have been implemented in C with POSIX threads and in Java. The task pools differ in the data structures to store the tasks, the mechanism to achieve load balance, and the memory manager used to store the tasks. Runtime experiments have been performed on three different shared-memory systems using a synthetic algorithm, the hierarchical radiosity method, and a volume rendering algorithm. Copyright (C) 2004 John Wiley Sons, Ltd.
暂无评论