检索结果-内蒙古大学图书馆

On the suitability of SIMD extensions for neural network simulation

MICROPROCESSORS AND MICROSYSTEMS 2003年第7期27卷 341-351页

作者： Strey, A Univ Ulm Dept Neural Informat Proc D-89069 Ulm Germany

Current microprocessors contain SIMD execution units (also called multimedia or vector extensions) that allow the data-parallel execution of operations on several subwords packed in 64-bit or 128-bit registers. They can accelerate not only typical multimedia applications but also many other algorithms based on vector and matrix operations. In this paper the results of a detailed experimental study about the suitability of such units for the fast simulation of neural networks are presented. It is shown that a speedup in the range from 2.0 to 8.6 compared to sequential implementations can be achieved. A performance counter analysis is provided to explain several effects by features of the processor architecture. (C) 2003 Elsevier Science B.V. All rights reserved.

关键词： microprocessors SIMD extensions data-parallel processing neural networks

来源：评论

学校读者我要写书评

暂无评论

The Best of Both Worlds: Big data Programming with Both Productivity and Performance 17

The Best of Both Worlds: Big Data Programming with Both Prod...

引用

ACM International Conference on Management of data

作者： Yang, Fan Huang, Yuzhen Zhao, Yunjian Li, Jinfeng Jiang, Guanxian Cheng, James Chinese Univ Hong Kong Dept Comp Sci & Engn Hong Kong Peoples R China

ISBN: (纸本)9781450341974

Coarse-grained operators such as map and reduce have been widely used for large-scale data processing. While they are easy to master, over-simplified APIs sometimes hinder programmers from fine-grained control on how computation is performed and hence designing more efficient algorithms. On the other hand, resorting to domain-specific languages (DSLs) is also not a practical solution, since programmers may need to learn how to use many systems that can be very different from each other, and the use of low-level tools may even result in bug-prone programming. In [7], we proposed Husky which provides a highly expressive API to solve the above dilemma It allows developers to program in a variety of patterns, such as MapReduce, GAS, vertex-centric programs, and even asynchronous machine learning. While the Husky C++ engine provides great performance, in this demo proposal we introduce PyHusky and ScHusky, which allow users (e.g., data scientists) without system knowledge and low-level programming skills to leverage the performance of Husky and build high-level applications with ease using Python and Scala.

关键词： Distributed system data-parallel processing Programming model

来源：评论

学校读者我要写书评

暂无评论

Palette Load Balancing: Locality Hints for Serverless Functions 23

Palette Load Balancing: Locality Hints for Serverless Functi...

引用

18th European Conference on Computer Systems (EuroSys)

作者： Abdi, Mania Ginzburg, Samuel Lin, Charles Faleiro, Jose M. Goiri, Inigo Chaudhry, Gohar Bianchini, Ricardo Berger, Daniel S. Fonseca, Rodrigo Northeastern Univ Boston MA 02115 USA Princeton Univ Princeton NJ USA Anyscale Inc San Francisco CA USA Azure Syst Res Redmond WA USA

ISBN: (纸本)9781450394871

Function-as-a-Service (FaaS) serverless computing enables a simple programming model with almost unbounded elasticity. Unfortunately, current FaaS platforms achieve this flexibility at the cost of lower performance for data-intensive applications compared to a serverful deployment. The ability to have computation close to data is a key missing feature. We introduce Palette load balancing, which offers FaaS applications a simple mechanism to express locality to the platform, through hints we term "colors". Palette maintains the serverless nature of the service - users are still not allocating resources - while allowing the platform to place successive invocations related to each other on the same executing node. We compare a prototype of the Palette load balancer to a state-of-the-art locality-oblivious load balancer on representative examples of three applications. For a serverless web application with a local cache, Palette improves the hit ratio by 6x. For a serverless version of Dask, Palette improves run times by 46% and 40% on Task Bench and TPC-H, respectively. On a serverless version of NumS, Palette improves run times by 37%. These improvements largely bridge the gap to serverful implementation of the same systems.

关键词： Cloud Computing Serverless Computing Caching data-parallel processing

来源：评论

学校读者我要写书评

暂无评论

Hugo: A Cluster Scheduler that Efficiently Learns to Select Complementary data-parallel Jobs 25th

Hugo: A Cluster Scheduler that Efficiently Learns to Select ...

引用

25th International Conference on parallel and Distributed Computing (Euro-Par)

作者： Thamsen, Lauritz Verbitskiy, Ilya Nedelkoski, Sasho Vinh Thuy Tran Meyer, Vinicius Xavier, Miguel G. Kao, Odej De Rose, Cesar A. F. TU Berlin Berlin Germany Pontificia Univ Catolica Rio Grande do Sul Porto Alegre RS Brazil

ISBN: (纸本)9783030483401;9783030483395

Distributed data processing systems like MapReduce, Spark, and Flink are popular tools for analysis of large datasets with cluster resources. Yet, users often overprovision resources for their data processing jobs, while the resource usage of these jobs also typically fluctuates considerably. Therefore, multiple jobs usually get scheduled onto the same shared resources to increase the resource utilization and throughput of clusters. However, job runtimes and the utilization of shared resources can vary significantly depending on the specific combinations of co-located jobs. This paper presents Hugo, a cluster scheduler that continuously learns how efficiently jobs share resources, considering metrics for the resource utilization and interference among co-located jobs. The scheduler combines offline grouping of jobs with online reinforcement learning to provide a scheduling mechanism that efficiently generalizes from specific monitored job combinations yet also adapts to changes in workloads. Our evaluation of a prototype shows that the approach can reduce the runtimes of exemplary Spark jobs on a YARN cluster by up to 12.5%, while resource utilization is increased and waiting times can be bounded.

关键词： data-parallel processing Cluster scheduling Resource management Distributed dataflows Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

The Effects of Wide Vector Operations on Processor Caches 22

The Effects of Wide Vector Operations on Processor Caches

引用

IEEE International Conference on Cluster Computing

作者： Poenaru, Andrei McIntosh-Smith, Simon Univ Bristol Dept Comp Sci Bristol Avon England

ISBN: (纸本)9781728166773

In this paper we investigate the effects of wide vector instructions on modern processor caches. On the one hand, contemporary processors have large, highly associative caches, which greatly benefit applications that can exploit spatial or temporal data locality. On the other hand, vector instructions operate on wide lists of operands, and moving this data through the cache hierarchy can fill it up quickly. We use a selection of mini-apps representative of a range of scientific application classes to investigate the behaviour of caches in two state-of-the-art Arm-based processors, the Marvell ThunderX2 and the Fujitsu A64FX. We compile the application to target the Arm Scalable Vector Extension (SVE) and we model the caches of these two processors using a newly developed cache simulator. We then vary a number of cache parameters and show how these choice influence application behaviour at a range of SVE widths between 128 and 2048 bits. We observed a correlation between higher cache associativity and lower miss rate. For the first cache level, at higher line sizes an increase in associativity was necessary to decrease miss rate compared to a cache with the same total size but smaller line size;for the second level, higher associativity did not always result in better performance with long cache lines. As the SVE width was scaled, data was evicted from cache quicker, an effect which was more noticeable at smaller line sizes. Larger cache lines also allowed non-contiguous requests to be fulfilled with fewer loads, because each cache lines covers more memory space.

关键词： Cache Memory Cache Simulation Vector Instructions SVE data-parallel processing Non-Contiguous Memory Access

来源：评论

学校读者我要写书评

暂无评论

gem5-ndp: Near-data processing Architecture Simulation From Low Level Caches to DRAM 34

gem5-ndp: Near-Data Processing Architecture Simulation From ...

引用

34th IEEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

作者： Vieira, Joao Roma, Nuno Falcao, Gabriel Tomas, Pedro Univ Lisbon Inst Super Tecn INESC ID Lisbon Portugal Univ Coimbra Inst Telecomunicacoes Coimbra Portugal

ISBN: (数字)9781665451550

ISBN: (纸本)9781665451550

Unlike standard accelerators, the performance of Near-data processing (NDP) devices highly depends on the operation of the surrounding system, namely, the Central processing Unit (CPU) and the memory hierarchy. Therefore, to accurately evaluate the gain provided by such devices, the entire processing system must be considered. Recent proposals redesigned existing architectural simulators to estimate the performance of NDP devices. However, the conclusions that can be drawn from using these frameworks are limited, and they fail to provide full support to simulate these devices (e.g., most simulators do not allow simultaneous operation of the CPU and the NDP device). In this paper, a novel framework (called gem5-ndp) based on the gem5 architectural simulator is proposed, providing full support to the development, validation, and evaluation of novel NDP architectures. To illustrate the process of developing and integrating an NDP device with a processing system using the proposed framework, as well as to demonstrate its viability and benefits, two case studies are also proposed and thoroughly discussed. gem5-ndp significantly improves the performance evaluation confidence of NDP devices with results showing that classical approaches lead to a deviation of up to 54.9% when compared with results obtained with gem5-ndp.

关键词： Near-data processing Multi-Level Memory Hierarchies Simulation Framework data-parallel processing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：