The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper we show the need of enhancing system s...
详细信息
ISBN:
(数字)9783319751788
ISBN:
(纸本)9783319751788;9783319751771
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper we show the need of enhancing system schedulers to differentiate between compute- and data-oriented applications to minimise interferences between storage and application traffic. These interferences can be especially harmful in systems featuring fully distributed storage systems together with unified interconnects, such as our custom-made architecture ExaNeSt. We analyse several data-aware allocation strategies, and found that such strategies are essential to maintain performance in distributed storage systems.
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniq...
详细信息
ISBN:
(纸本)9783981926323
Although not a new technique, due to the advent of 3D-stacked technologies, the integration of large memories and logic circuitry able to compute large amount of data has revived the Processing-in-Memory (PIM) techniques. PIM is a technique to increase performance while reducing energy consumption when dealing with large amounts of data. Despite several designs of PIM are available in the literature, their effective implementation still burdens the programmer. Also, various PIM instances are required to take advantage of the internal 3D-stacked memories, which further increases the challenges faced by the programmers. In this way, this work presents the Processing-In-Memory cOmpiler (PRIMO). Our compiler is able to efficiently exploit large vector units on a PIM architecture, directly from the original code. PRIMO is able to automatically select suitable PIM operations, allowing its automatic offloading. Moreover, PRIMO concerns about several PIM instances, selecting the most suitable instance while reduces internal communication between different PIM units. The compilation results of different benchmarks depict how PRIMO is able to exploit large vectors, while achieving a near-optimal performance when compared to the ideal execution for the case study PIM. PRIMO allows a speedup of 38x for specific kernels, while on average achieves 11.8x for a set of benchmarks from PolyBench Suite.
Modern remote sensing (RS) image application systems often distribute image processing tasks among multiple data centers and then gather the processed images from each center to efficiently synthesize the final produc...
详细信息
ISBN:
(纸本)9781728182544
Modern remote sensing (RS) image application systems often distribute image processing tasks among multiple data centers and then gather the processed images from each center to efficiently synthesize the final product. In this paper, we exploit the edge-cloud architecture to design and implement a novel RS image service system, called RS-pCloud, which leverages the Peer-to-Peer (P2P) model to integrate multiple data centers and their associated edge networks. The data center as cloud platform is responsible for the storage and processing of original RS images, as well as the storage of partial processed images while the edge network is mainly for caching and sharing the processed images. With this design, RS-pCloud not only achieves the load sharing of processing works but also attains the data efficiency among the edges at the same time, which in turn improves the performance of the image processing and reduce the cost of the data transmission as well. RS-pCloud is designed to be used in a transparent way where it receives a query task from the user through a certain cloud platform, split the task into different sub-tasks, according to the location of the data they required, and then distribute the sub-tasks to corresponding clouds for near-data processing, the returned results from each cloud are first cached in specific edge for further sharing and then gathered at the client to synthesize the final product. We implemented and deployed RS-pCloud on three clusters in conjunction with an edge network to show its performance advantages over traditional single-cluster systems.
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limite...
详细信息
ISBN:
(纸本)9781467383202
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limited interconnection bandwidth between main memory and processing units. near-data processing has started to gain acceptance as an accelerator device due to the technology constraints and high costs associated with data transfer. However, previous approaches to near-data computing do not provide general-purpose processing, or require large amounts of logic and do not fully use the potential of the DRAM devices. These issues limited its wide adoption. In this paper, we present the Memory Vector Extensions (MVX), which implement vector instructions directly inside the DRAM devices, therefore avoiding data movement between memory and processing units, while requiring a lower amount of logic than previous approaches. MVX is able to obtain up to 211x increase in performance for application kernels with a high spatial locality and a low temporal locality. Comparing to an embedded processor with 8 cores and 2 memory channels that supports AVX-512 instructions, MVX performs 24x faster on average for three well known algorithms.
This work defines the concept of collective affinity. It is claimed that collective affinity has more potential than single core-centric affinity, for data locality optimization in manycores. The reason is that collec...
详细信息
ISBN:
(纸本)9781450380751
This work defines the concept of collective affinity. It is claimed that collective affinity has more potential than single core-centric affinity, for data locality optimization in manycores. The reason is that collective affinity captures the potential benefits of transferring computations originally assigned to one core to other cores. Next, building upon the collective affinity concept and a cache content estimation strategy, it presents a computation-to-core mapping strategy, specifically tuned for exploiting neardatacomputing by reducing distance-to-data.
Automata processing is an efficient computation model for regular expressions and other forms of sophisticated pattern matching. The demand for high-throughput and real-time pattern matching in many applications, incl...
详细信息
ISBN:
(纸本)9781450385572
Automata processing is an efficient computation model for regular expressions and other forms of sophisticated pattern matching. The demand for high-throughput and real-time pattern matching in many applications, including network intrusion detection and spam filters, has motivated several in-memory architectures for automata processing. Existing in-memory architectures focus on accelerating the pattern-matching kernel, but either fail to support a practical reporting solution or optimistically assume that the reporting stage is not the performance bottleneck. However, gathering and processing the reports can be the major bottleneck, especially when the reporting frequency is high. Moreover, all the existing in-memory architectures work with a fixed processing rate (mostly 8-bit/cycle), and they do not adjust the input consumption rate based on the properties of the applications, which can lead to throughput and capacity loss. To address these issues, we present Sunder, an in-SRAM pattern matching architecture, to processes a reconfigurable number of nibbles (4-bit symbols) in parallel, instead of fixed-rate processing, by adopting an algorithm/architecture methodology to perform hardware-aware transformations. Inspired by prior work, we transform the commonly-used 8-bit processing to nibble-processing (4-bit processing) to reduce hardware requirements exponentially and achieve higher information density. This frees up space for storing reporting data in place, which significantly eliminates host communication and reporting overhead. Our proposed reporting architecture supports in-place report summarization and provides an easy access mechanism to read the reporting data. As a result, Sunder enables a low-overhead, high-performance, and flexible in-memory pattern-matching and reporting solution. Our results confirm that Sunder reporting architecture has zero performance overhead for 95% of the applications and incurs only 2% additional hardware overhead.
Dynamic graph traversals (DGTs) currently are widely used in many important application domains, especially in this big-data era that urgently demands high-performance graph processing and analysis. Unlike static grap...
详细信息
Dynamic graph traversals (DGTs) currently are widely used in many important application domains, especially in this big-data era that urgently demands high-performance graph processing and analysis. Unlike static graph traversals, DGTs in real-world application scenarios require not only fast traversal acceleration itself but also, more importantly, a runtime strategy that can effectively accommodate the ever-evolving nature of the graph structure updates followed by a diverse range of graph traversal algorithms. Because of these special features, state-of-the-art designs on conventional compute-centric architectures (e.g., CPU and GPU) struggle to provide sufficient acceleration for DGT processing due to the dominating irregularmemory access patterns in graph traversal algorithms and inefficient platform-specific update mechanisms. In this article, we explore the algorithmic features and runtime requirements of real-world DGTs and identify their unique opportunities of acceleration on the recent Micron Automata Processor (AP), an in-situ memory-centric pattern-matching architecture. These features include the natural mapping between traversal algorithms' path exploration pattern to classic non-deterministic finite automata processing, AP's architectural and compilation support for DGTs' evolving traversal operations, and its inherent hardware fitness. However, despite these benefits, enabling highly efficient DGT execution on AP is non-trivial and faces several major challenges. To tackle them, we propose DynamAP, the first AP framework design that enables fast processing for general DGTs. DynamAP is oblivious to periodical traversal algorithm changes and can address the significant overhead caused by frequent graph updates and AP recompilation through our novel hybrid macro designs and associated efficient updating strategies. We evaluate DynamAP against the current DGT designs on a CPU, GPU, and AP with a range of widely adopted DGT algorithms and real-world grap
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the s...
详细信息
The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper, we investigate alternatives for the storage subsystem of a novel exascale-capable system with special emphasis on how allocation strategies would affect the overall performance. We consider several aspects of data-aware allocation such as the effect of spatial and temporal locality, the affinity of data to storage sources, and the network-level traffic prioritization for different types of flows. In our experimental set-up, temporal locality can have a substantial effect on application runtime (up to a 10% reduction), whereas spatial locality can be even more significant (up to one order of magnitude faster with perfect locality). The use of structured access patterns to the data and the allocation of bandwidth at the network level can also have a significant impact (up to 20% and 17% reduction of runtime, respectively). These results suggest that scheduling policies exposing data-locality information can be essential for the appropriate utilization of future large-scale systems. Finally, we found that the distributed storage system we are implementing can outperform traditional SAN architectures, even with a much smaller (in terms of I/O servers) back-end.
暂无评论