Current microprocessors contain SIMD execution units (also called multimedia or vector extensions) that allow the data-parallel execution of operations on several subwords packed in 64-bit or 128-bit registers. They c...
详细信息
Current microprocessors contain SIMD execution units (also called multimedia or vector extensions) that allow the data-parallel execution of operations on several subwords packed in 64-bit or 128-bit registers. They can accelerate not only typical multimedia applications but also many other algorithms based on vector and matrix operations. In this paper the results of a detailed experimental study about the suitability of such units for the fast simulation of neural networks are presented. It is shown that a speedup in the range from 2.0 to 8.6 compared to sequential implementations can be achieved. A performance counter analysis is provided to explain several effects by features of the processor architecture. (C) 2003 Elsevier Science B.V. All rights reserved.
Coarse-grained operators such as map and reduce have been widely used for large-scale dataprocessing. While they are easy to master, over-simplified APIs sometimes hinder programmers from fine-grained control on how ...
详细信息
ISBN:
(纸本)9781450341974
Coarse-grained operators such as map and reduce have been widely used for large-scale dataprocessing. While they are easy to master, over-simplified APIs sometimes hinder programmers from fine-grained control on how computation is performed and hence designing more efficient algorithms. On the other hand, resorting to domain-specific languages (DSLs) is also not a practical solution, since programmers may need to learn how to use many systems that can be very different from each other, and the use of low-level tools may even result in bug-prone programming. In [7], we proposed Husky which provides a highly expressive API to solve the above dilemma It allows developers to program in a variety of patterns, such as MapReduce, GAS, vertex-centric programs, and even asynchronous machine learning. While the Husky C++ engine provides great performance, in this demo proposal we introduce PyHusky and ScHusky, which allow users (e.g., data scientists) without system knowledge and low-level programming skills to leverage the performance of Husky and build high-level applications with ease using Python and Scala.
Function-as-a-Service (FaaS) serverless computing enables a simple programming model with almost unbounded elasticity. Unfortunately, current FaaS platforms achieve this flexibility at the cost of lower performance fo...
详细信息
ISBN:
(纸本)9781450394871
Function-as-a-Service (FaaS) serverless computing enables a simple programming model with almost unbounded elasticity. Unfortunately, current FaaS platforms achieve this flexibility at the cost of lower performance for data-intensive applications compared to a serverful deployment. The ability to have computation close to data is a key missing feature. We introduce Palette load balancing, which offers FaaS applications a simple mechanism to express locality to the platform, through hints we term "colors". Palette maintains the serverless nature of the service - users are still not allocating resources - while allowing the platform to place successive invocations related to each other on the same executing node. We compare a prototype of the Palette load balancer to a state-of-the-art locality-oblivious load balancer on representative examples of three applications. For a serverless web application with a local cache, Palette improves the hit ratio by 6x. For a serverless version of Dask, Palette improves run times by 46% and 40% on Task Bench and TPC-H, respectively. On a serverless version of NumS, Palette improves run times by 37%. These improvements largely bridge the gap to serverful implementation of the same systems.
Distributed dataprocessing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their dataprocessing jobs, wh...
详细信息
ISBN:
(纸本)9783030483401;9783030483395
Distributed dataprocessing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their dataprocessing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded.
In this paper we investigate the effects of wide vector instructions on modern processor caches. On the one hand, contemporary processors have large, highly associative caches, which greatly benefit applications that ...
详细信息
ISBN:
(纸本)9781728166773
In this paper we investigate the effects of wide vector instructions on modern processor caches. On the one hand, contemporary processors have large, highly associative caches, which greatly benefit applications that can exploit spatial or temporal data locality. On the other hand, vector instructions operate on wide lists of operands, and moving this data through the cache hierarchy can fill it up quickly. We use a selection of mini-apps representative of a range of scientific application classes to investigate the behaviour of caches in two state-of-the-art Arm-based processors, the Marvell ThunderX2 and the Fujitsu A64FX. We compile the application to target the Arm Scalable Vector Extension (SVE) and we model the caches of these two processors using a newly developed cache simulator. We then vary a number of cache parameters and show how these choice influence application behaviour at a range of SVE widths between 128 and 2048 bits. We observed a correlation between higher cache associativity and lower miss rate. For the first cache level, at higher line sizes an increase in associativity was necessary to decrease miss rate compared to a cache with the same total size but smaller line size;for the second level, higher associativity did not always result in better performance with long cache lines. As the SVE width was scaled, data was evicted from cache quicker, an effect which was more noticeable at smaller line sizes. Larger cache lines also allowed non-contiguous requests to be fulfilled with fewer loads, because each cache lines covers more memory space.
Unlike standard accelerators, the performance of Near-dataprocessing (NDP) devices highly depends on the operation of the surrounding system, namely, the Central processing Unit (CPU) and the memory hierarchy. Theref...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
Unlike standard accelerators, the performance of Near-dataprocessing (NDP) devices highly depends on the operation of the surrounding system, namely, the Central processing Unit (CPU) and the memory hierarchy. Therefore, to accurately evaluate the gain provided by such devices, the entire processing system must be considered. Recent proposals redesigned existing architectural simulators to estimate the performance of NDP devices. However, the conclusions that can be drawn from using these frameworks are limited, and they fail to provide full support to simulate these devices (e.g., most simulators do not allow simultaneous operation of the CPU and the NDP device). In this paper, a novel framework (called gem5-ndp) based on the gem5 architectural simulator is proposed, providing full support to the development, validation, and evaluation of novel NDP architectures. To illustrate the process of developing and integrating an NDP device with a processing system using the proposed framework, as well as to demonstrate its viability and benefits, two case studies are also proposed and thoroughly discussed. gem5-ndp significantly improves the performance evaluation confidence of NDP devices with results showing that classical approaches lead to a deviation of up to 54.9% when compared with results obtained with gem5-ndp.
暂无评论