near-data accelerators play an important role in satisfying the ever growing demand for compute resources. However, for an efficient integration of near-data computing resources into applications, a flexible programmi...
详细信息
ISBN:
(纸本)9781728152684
near-data accelerators play an important role in satisfying the ever growing demand for compute resources. However, for an efficient integration of near-data computing resources into applications, a flexible programming model and suitable abstractions on the operating system level are required. This paper presents Metal FS, a framework that enables users and applications to orchestrate computations on a NVMe+FPGA near-data computing device through standard shell syntax, including the pipe operator. A user-space NVMe file system interface exposes the storage resources of the NVMe+FPGA accelerator. Computation pipelines expressed on the shell are mapped to pre-defined functional elements of a coarse-grained FPGA overlay, enabling data transformations to be performed in proximity to the data source. Overall, Metal FS greatly increases developer productivity for applications targeting near-data computing accelerators.
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniq...
详细信息
ISBN:
(纸本)9783981926323
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniques. PIM is a technique to increase performance while reducing energy consumption when dealing with large amounts of data. Despite several designs of PIM are available in the literature, their effective implementation still burdens the programmer. Also, various PIM instances are required to take advantage of the internal 3D-stacked memories, which further increases the challenges faced by the programmers. In this way, this work presents the Processing-In-Memory cOmpiler (PRIMO). Our compiler is able to efficiently exploit large vector units on a PIM architecture, directly from the original code. PRIMO is able to automatically select suitable PIM operations, allowing its automatic offloading. Moreover, PRIMO concerns about several PIM instances, selecting the most suitable instance while reduces internal communication between different PIM units. The compilation results of different benchmarks depict how PRIMO is able to exploit large vectors, while achieving a near-optimal performance when compared to the ideal execution for the case study PIM. PRIMO allows a speedup of 38x for specific kernels, while on average achieves 11.8x for a set of benchmarks from PolyBench Suite.
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper we show the need of enhancing system s...
详细信息
ISBN:
(数字)9783319751788
ISBN:
(纸本)9783319751788;9783319751771
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper we show the need of enhancing system schedulers to differentiate between compute- and data-oriented applications to minimise interferences between storage and application traffic. These interferences can be especially harmful in systems featuring fully distributed storage systems together with unified interconnects, such as our custom-made architecture ExaNeSt. We analyse several data-aware allocation strategies, and found that such strategies are essential to maintain performance in distributed storage systems.
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the s...
详细信息
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the storage subsystem of a novel exascale-capable system with special emphasis on how allocation strategies would affect the overall performance. We consider several aspects of data-aware allocation such as the effect of spatial and temporal locality, the affinity of data to storage sources, and the network-level traffic prioritization for different types of flows. In our experimental set-up, temporal locality can have a substantial effect on application runtime (up to a 10% reduction), whereas spatial locality can be even more significant (up to one order of magnitude faster with perfect locality). The use of structured access patterns to the data and the allocation of bandwidth at the network level can also have a significant impact (up to 20% and 17% reduction of runtime, respectively). These results suggest that scheduling policies exposing data-locality information can be essential for the appropriate utilization of future large-scale systems. Finally, we found that the distributed storage system we are implementing can outperform traditional SAN architectures, even with a much smaller (in terms of I/O servers) back-end.
Traditionally, researchers have attempted to address the memory wall by building a deep memory hierarchy. Another solution is to move computation closer to memory, which is often referred to as processing in memory (P...
详细信息
Traditionally, researchers have attempted to address the memory wall by building a deep memory hierarchy. Another solution is to move computation closer to memory, which is often referred to as processing in memory (PIM). Past PIM solutions tried to move computing logic near memory by integrating DRAM with a logic die using 3D stacking. This helps reduce data movement energy and increase bandwidth; however, the functionality and design of memory itself remains unchanged. An even more exciting technology is one that dissolves the line that distinguishes memory from computational units. nearly three-fourths of silicon in processor and main memory dies is simply to store and access data. Harnessing this silicon area by repurposing it to perform computation can lead to massively parallel computational processing. Furthermore, we naturally save the vast amounts of energy spent in shuffling data back and forth between computational and storage units, and memory bandwidth becomes a meaningless metric. [ABSTRACT FROM AUTHOR]
data access costs dominate the execution times of most parallel applications and they are expected to be even more important in the future. To address this, recent research has focused on neardata Processing (NDP) as...
详细信息
ISBN:
(纸本)9781450349529
data access costs dominate the execution times of most parallel applications and they are expected to be even more important in the future. To address this, recent research has focused on neardata Processing (NDP) as a new paradigm that tries to bring computation to data, instead of bringing data to computation (which is the norm in conventional computing). This paper explores the potential of compiler support in exploiting NDP in the context of emerging manycore systems. To that end, we propose a novel compiler algorithm that partitions the computations in a given loop nest into subcomputations and schedules the resulting subcomputations on different cores with the goal of reducing the distance-to-data on the on-chip network. An important characteristic of our approach is that it exploits NDP while taking advantage of data locality. Our experiments with 12 multithreaded applications running on a state-of-the-art commercial manycore system indicate that the proposed compiler-based approach significantly reduces data movements on the on-chip network by taking advantage of NDP, and these benefits lead to an average execution time improvement of 18.4%.
Diverse areas of science and engineering are increasingly driven by high-throughput automated data capture and analysis. Modern acquisition technologies, used in many scientific applications (e.g., astronomy, physics,...
详细信息
ISBN:
(纸本)9781450347556
Diverse areas of science and engineering are increasingly driven by high-throughput automated data capture and analysis. Modern acquisition technologies, used in many scientific applications (e.g., astronomy, physics, materials science, geology, biology, and engineering) and often running at gigabyte per second data rates, quickly generate terabyte to petabyte datasets that must be stored, shared, processed and analyzed at similar rates. The largest datasets are often multidimensional, such as volumetric and time series data derived from various types of image capture. Costeffective and timely processing of these data require system and software architectures that incorporate on-the-fly processing to minimize I/O traffic and avoid latency limitations. In this paper we present the Virtual Volume File System, a new approach to on-demand processing with file system semantics, combining these principles into a versatile and powerful data pipeline for dealing with some of the largest 3D volumetric datasets. We give an example of how we have started to use this approach in our work with massive electron microscopy image stacks. We end with a short discussion of current and future challenges.
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limite...
详细信息
ISBN:
(纸本)9781467383202
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limited interconnection bandwidth between main memory and processing units. near-data processing has started to gain acceptance as an accelerator device due to the technology constraints and high costs associated with data transfer. However, previous approaches to near-data computing do not provide general-purpose processing, or require large amounts of logic and do not fully use the potential of the DRAM devices. These issues limited its wide adoption. In this paper, we present the Memory Vector Extensions (MVX), which implement vector instructions directly inside the DRAM devices, therefore avoiding data movement between memory and processing units, while requiring a lower amount of logic than previous approaches. MVX is able to obtain up to 211x increase in performance for application kernels with a high spatial locality and a low temporal locality. Comparing to an embedded processor with 8 cores and 2 memory channels that supports AVX-512 instructions, MVX performs 24x faster on average for three well known algorithms.
暂无评论