The latest direction in cache-aware/cache-efficient algorithms is to use cache-oblivious algorithms based on the cache-oblivious model, which is an improvement of the external-memory model. The cache-oblivious model u...
详细信息
The latest direction in cache-aware/cache-efficient algorithms is to use cache-oblivious algorithms based on the cache-oblivious model, which is an improvement of the external-memory model. The cache-oblivious model utilizes memory hierarchies without knowing memories' parameters in advance since algorithms of this model are automatically tuned according to the actual memory parameters. As a result, cache-oblivious algorithms are particularly applied to multi-level caches with changing parameters and to environments in which the amount of available memory for an algorithm can fluctuate. This paper shows the state of the art in cache-oblivious algorithms and data structures;each with its complexity concerning cache misses, which is called cache complexity. Additionally, this paper introduces an extension to minimize the cache complexity of neural networks by applying an appropriate cache-oblivious approach to neural networks.
For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache com...
详细信息
ISBN:
(纸本)9781450307437
For nested-parallel computations with low depth (span, critical path length) analyzing the work, depth, and sequential cache complexity suffices to attain reasonably strong bounds on the parallel runtime and cache complexity on machine models with either shared or private caches. These bounds, however, do not extend to general hierarchical caches, due to limitations in (i) the cache-oblivious (CO) model used to analyze cache complexity and (ii) the schedulers used to map computation tasks to processors. This paper presents the parallel cache-oblivious (PCO) model, a relatively simple modification to the CO model that can be used to account for costs on a broad range of cache hierarchies. The first change is to avoid capturing artificial data sharing among parallel threads, and the second is to account for parallelism-memory imbalances within tasks. Despite the more restrictive nature of PCO compared to CO, many algorithms have the same asymptotic cache complexity bounds. The paper then describes a new scheduler for hierarchical caches, which extends recent work on "space-bounded schedulers" to allow for computations with arbitrary work imbalance among parallel subtasks. This scheduler attains provably good cache performance and runtime on parallel machine models with hierarchical caches, for nested-parallel computations analyzed using the PCO model. We show that under reasonable assumptions our scheduler is "work efficient" in the sense that the cost of the cache misses are evenly balanced across the processors-i.e., the runtime can be determined within a constant factor by taking the total cost of the cache misses analyzed for a computation and dividing it by the number of processors. In contrast, to further support our model, we show that no scheduler can achieve such bounds (optimizing for both cache misses and runtime) if work, depth, and sequential cache complexity are the only parameters used to analyze a computation.
The objective of high performance computing (HPC) is to ensure that the compu- tational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Im...
详细信息
The objective of high performance computing (HPC) is to ensure that the compu- tational power of hardware resources is well utilized to solve a problem. Various techniques are usually employed to achieve this goal. Improvement of algorithm to reduce the number of arithmetic operations, modifications in accessing data or rear- rangement of data in order to reduce memory traffic, code optimization at all levels, designing parallel algorithms with smaller span or reduced overhead are some of the attractive areas that HPC researchers are working on. In this thesis, we investigate HPC techniques for the implementation of basic routines in computer algebra targeting hardware acceleration technologies. We start with a sorting algorithm and its application to sparse matrix-vector multiplication for which we focus on work on cache complexity issues. Since basic routines in computer algebra often provide a lot of fine grain parallelism, we then turn our attention to many-core architectures on which we consider dense polynomial and matrix operations ranging from plain to fast arithmetic. Most of these operations are combined within a bivariate system solver running entirely on a graphics processing unit (GPU).
We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. complexity estimates and experimental comparisons demonstrate...
详细信息
ISBN:
(纸本)9781509057078
We propose a new algorithm for multiplying dense polynomials with integer coefficients in a parallel fashion, targeting multi-core processor architectures. complexity estimates and experimental comparisons demonstrate the advantage of this new approach.
暂无评论