Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for genera...
详细信息
ISBN:
(纸本)9781467355247
Problem domains are commonly decomposed hierarchically to fully utilize parallel resources in modern microprocessors. Such decompositions can be provided as library routines, written by experienced experts, for general algorithmic patterns. But such APIs tend to be constrained to certain architectures or data sizes. Integrating them with application code is often an unnecessarily daunting task, especially when these routines need to be closely coupled with user code to achieve better performance. This paper contributes HiDP, a high-level hierarchical data parallel language. The purpose of HiDP is to improve the coding productivity of integrating hierarchical data parallelism without significant loss of performance. HiDP is a source-to-source compiler that converts a very concise data parallel language into CUDA C++ source code. Internally, it performs necessary analysis to compose user code with efficient and architecture-aware code snippets. This paper discusses various aspects of HiDP systematically: the language, the compiler and the run-time system with built-in tuning capabilities. They enable HiDP users to express algorithms in less code than low-level SDKs require for native platforms. HiDP also exposes abundant computing resources of modern parallelarchitectures. Improved coding productivity tends to come with a sacrifice in performance. Yet, experimental results show that the generated code delivers performance very close to handcrafted native GPU code.
Subgraph isomorphism is a fundamental property of graphs that requires checking whether the network structure of one graph can be found (embedded) within another graph. It has numerous applications and is a computatio...
详细信息
ISBN:
(纸本)9781450361842
Subgraph isomorphism is a fundamental property of graphs that requires checking whether the network structure of one graph can be found (embedded) within another graph. It has numerous applications and is a computationally challenging problem: it is NP-complete and known algorithms explore an exponentially large search space. Even though it has been studied extensively, relatively little is known about whether subgraph isomorphism accepts a theoretically and practically efficient parallel solution. In this paper, we present our ongoing work on designing a parallel algorithm for the subgraph (and graph) isomorphism problem, which addresses challenges commonly faced when attempting to obtain a parallel algorithm for isomorphism. Our algorithm appears to scale well up to the 70 cores of our empirical machine.
A practical implementation of high performance instruction level parallelarchitectures is constrained by the difficulty to build a large monolithic multi-ported register file (RF). A solution is to partition the RF i...
详细信息
A practical implementation of high performance instruction level parallelarchitectures is constrained by the difficulty to build a large monolithic multi-ported register file (RF). A solution is to partition the RF into smaller RFs while keeping the total number of registers and ports equal. This paper applies RF partitioning to transport triggered architectures (TTAs); these architectures are of the VLIW type. One may expect that partitioning increases the number of executed cycles because it constrains the number of ports per RF. It is shown that these performance losses are small; e.g. partitioning an RF with 24 registers and four read and four write ports into four RFs with 6 registers and one read and one write port gives a performance loss of only 5.8%. Partitioned RFs consume less area than monolithic RFs with the same number of ports and registers. Experiments show that, if the area saved by partitioning is spent on extra registers, partitioning does, on average, not reduce the performance; it may even result in a small performance gain.
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. Th...
详细信息
ISBN:
(纸本)9781510836358
The Lovasz Local Lemma (LLL) is a cornerstone principle in the probabilistic method of combinatorics, and a seminal algorithm of Moser & Tardos (2010) provides an efficient randomized algorithm to implement it. This algorithm can be parallelized to give an algorithm that uses polynomially many processors and runs in O(log~3 n) time, stemming from O(log n) adaptive computations of a maximal independent set (MIS). Chung et al. (2014) developed faster local and parallelalgorithms, potentially running in time O(log~2 n), but these algorithms work under significantly more stringent conditions than the LLL. We give a new parallel algorithm that works under essentially the same conditions as the original algorithm of Moser & Tardos but uses only a single MIS computation, thus running in O(log~2 n) time. This conceptually new algorithm also gives a clean combinatorial description of a satisfying assignment which might be of independent interest. Our techniques extend to the deterministic LLL algorithm given by Chandrasekaran et al. (2013) leading to an NC-algorithm running in time O(log~2 n) as well. We also provide improved bounds on the run-times of the sequential and parallel resampling-based algorithms originally developed by Moser & Tardos. Our bounds extend to any problem instance in which the tighter Shearer LLL criterion is satisfied. We also improve on the analysis of Kolipaka & Szegedy (2011) to give tighter concentration results.
We present an optimal parallel selection algorithm on the EREW PRAM. This algorithm runs in O(log n) time with n/log n processors. This complexity matches the known lower bound for parallel selection on the EREW PRAM ...
详细信息
ISBN:
(纸本)9780898715385
We present an optimal parallel selection algorithm on the EREW PRAM. This algorithm runs in O(log n) time with n/log n processors. This complexity matches the known lower bound for parallel selection on the EREW PRAM model. We therefore close this problem which has been open for more than a decade.
A dominant cost for query evaluation in modern massively distributed systems is the number of communication rounds. For this reason, there is a growing interest in single-round multiway join algorithms where data are ...
详细信息
A dominant cost for query evaluation in modern massively distributed systems is the number of communication rounds. For this reason, there is a growing interest in single-round multiway join algorithms where data are first reshuffled over many servers and then evaluated in a parallel but communication-free way. The reshuffling itself is specified as a distribution policy. We introduce a correctness condition, called parallel-correctness, for the evaluation of queries w.r.t. a distribution policy. We study the complexity of parallel-correctness for conjunctive queries as well as transferability of parallel-correctness between queries. We also investigate the complexity of transferability for certain families of distribution policies, including the Hyper-cube distribution policies.
暂无评论