检索结果-内蒙古大学图书馆

22nd ACM International Symposium on High-Performance parallel and Distributed computing, HPDC 2013

作者： Ramos, Sabela Hoefler, Torsten Computer Architecture Group University of A Coruña A Coruña Spain Scalable Parallel Computing Lab. ETH Zurich Zurich Switzerland

ISBN: (纸本)9781450319102

Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers between cores, complicating the understanding of performance properties. We developed an intuitive performance model for cache-coherent architectures and demonstrate its use with the currently most scalable cache-coherent many-core architecture, Intel Xeon Phi. Using our model, we develop several optimal and optimized algorithms for complex parallel data exchanges. All algorithms that were developed with the model beat the performance of the highly-tuned vendor-specific Intel OpenMP and MPI libraries by up to a factor of 4.3. The model can be simplified to satisfy the tradeoff between complexity of algorithm design and accuracy. We expect that our model can serve as a vehicle for advanced algorithm design. © 2013 ACM.

关键词： Electronic data interchange

来源：评论

学校读者我要写书评

暂无评论

Ternary neural networks with fine-grained quantization

arXiv

引用

arXiv 2017年

作者： Mellempudi, Naveen Kundu, Abhisek Mudigere, Dheevatsa Das, Dipankar Kaul, Bharat Dubey, Pradeep Parallel Computing Lab Intel Labs Bangalore Parallel Computing Lab Intel Labs Santa ClaraCA

We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate minimal loss in classification accuracy on state-of-the-art topologies without additional training. We provide an improved theoretical formulation that forms the basis for a higher quality solution using FGQ. Our method involves ternarizing the original weight tensor in groups of N weights. Using N = 4, we achieve Top-1 accuracy within 3.7% and 4.2% of the baseline full precision result for Resnet-101 and Resnet-50 respectively, while eliminating 75% of all multiplications. These results enable a full 8/4-bit inference pipeline, with best reported accuracy using ternary weights on ImageNet dataset, with a potential of 9× improvement in performance. Also, for smaller networks like AlexNet, FGQ achieves state-of-the-art results. We further study the impact of group size on both performance and accuracy. With a group size of N = 64, we eliminate ≈ 99% of the multiplications;however, this introduces a noticeable drop in accuracy, which necessitates fine tuning the parameters at lower precision. We address this by fine-tuning Resnet-50 with 8-bit activations and ternary weights at N = 64, improving the Top-1 accuracy to within 4% of the full precision result with Copyright © 2017, The Authors. All rights reserved.

关键词： Pipelines

来源：评论

学校读者我要写书评

暂无评论

Mixed precision training with 8-bit floating point

arXiv

引用

arXiv 2019年

作者： Mellempudi, Naveen Srinivasan, Sudarshan Das, Dipankar Kaul, Bharat Parallel Computing Lab Intel Labs

Reduced precision computation for deep neural networks is one of the key areas addressing the widening 'compute gap' driven by an exponential growth in model size. In recent years, deep learning training has largely migrated to 16-bit precision, with significant gains in performance and energy efficiency. However, attempts to train DNNs at 8-bit precision have met with significant challenges because of the higher precision and dynamic range requirements of back-propagation. In this paper, we propose a method to train deep neural networks using 8-bit floating point representation for weights, activations, errors, and gradients. In addition to reducing compute precision, we also reduced the precision requirements for the master copy of weights from 32-bit to 16-bit. We demonstrate state-of-the-art accuracy across multiple data sets (imagenet-1K, WMT16) and a broader set of workloads (Resnet-18/34/50, GNMT, Transformer) than previously reported. We propose an enhanced loss scaling method to augment the reduced subnormal range of 8-bit floating point for improved error propagation. We also examine the impact of quantization noise on generalization and propose a stochastic rounding technique to address gradient noise. As a result of applying all these techniques, we report slightly higher validation accuracy compared to full precision baseline. Copyright © 2019, The Authors. All rights reserved.

关键词： Deep neural networks

来源：评论

学校读者我要写书评

暂无评论

Running Simulations in HPC and Cloud Resources by Implementing Enhanced TOSCA Workflows

Running Simulations in HPC and Cloud Resources by Implementi...

引用

International Conference on High Performance computing & Simulation (HPCS)

作者： Javier Carnero Francisco Javier Nieto Advanced Parallel Computing Lab ATOS Seville Spain Advanced Parallel Computing Lab ATOS Bilbao Spain

ISBN: (纸本)9781538678800

In general, one of the complexities of large simulations is related to the usage of the heterogeneous computational resources that are needed to execute them. The definition of workflows, usually linked to concrete orchestrations solutions, has reduced most of that complexity. These solutions are oriented to High Performance computing (HPC) or just deals with services managed remotely. This paper presents a novel solution we propose for running simulations in a hybrid HPC and Cloud infrastructure, exploiting the performance and power of HPC systems and benefiting from the fast and flexible provision of Cloud resources. We provide our vision about the typical simulation workflows and the kind of computational resources that fits best in each phase. In line with such vision, we describe the research done in order to enable the definition of the workflows extending the TOSCA standard (originally focused on Cloud solutions), used by our orchestrator and other solutions. We propose several extensions (types, relationships, compute properties and job properties), compatible with the standard definition, so Cloud and HPC tasks can be processed as expected. The paper also shows a use case implemented with the proposed approach, highlighting some of the benefits found so far.

关键词： Mathematical model Computational modeling Task analysis Cloud computing Complexity theory Tools Software

来源：评论

学校读者我要写书评

暂无评论

Large-Scale Energy-Efficient Graph Traversal: A Path to Efficient Data-Intensive Supercomputing 12

Large-Scale Energy-Efficient Graph Traversal: A Path to Effi...

引用

ACM/IEEE International Conference for High Performance computing, Networking, Storage, and Analysis

作者： Nadathur Satish Changkyu Kim Jatin Chhugani Pradeep Dubey Parallel Computing Lab Intel Corporation

ISBN: (纸本)9781467308052

Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel~? Endeavor cluster with Intel~? Xeon~? processors E5-2670, which matches the second ranked result in the recent November 2011 Graph500 list [2] with about 5.6X fewer nodes~(12). Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on Intel~? Xeon~? processor X5670-based systems (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.

关键词： HPC high-performance computing among others our algorithm High Performance computing Ganglia Line graph Clusters pushing

来源：评论

学校读者我要写书评

暂无评论

Breaking the Scalability Wall

Breaking the Scalability Wall

引用

International Conference on High Performance computing

作者： Fabrizio Petrini Parallel Computing Lab Intel Corporation

While HPC systems have considerably improved their raw compute performance and scalability over the last two decades - first with the adoption of compute clusters and then throughput computing - good communication performance and overall scalability are still very difficult to achieve with irregular and communication intensive sparse linear algebra and graph applications. Recent advances in chip packaging, such as Embedded Multi-die Interconnect Bridge (EMIB), novel Optical IO Photonics modules that can be directly integrated on chip, and aggressive system designs that eliminate the SW stack and natively support a Distributed Global Address Space, can be used to improve overall performance and potentially break the scalability wall of state of the arts computing systems.

关键词：

来源：评论

学校读者我要写书评

暂无评论

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

GrAPL 2022 Keynote Speaker: GraphBLAS Beyond Simple Graphs

引用

IEEE International Symposium on parallel and Distributed Processing Workshops and Phd Forum (IPDPSW)

作者： Tim Mattson Parallel Computing Lab Intel Labs

来源：评论

学校读者我要写书评

暂无评论

Memory Access Complexity Analysis of SpMV in RAM (h) Model

Memory Access Complexity Analysis of SpMV in RAM (h) Model

引用

10th IEEE International Conference on High Performance computing and Communications

作者： Yuan E Zhang Yun-quan Sun Xiangzheng Lab. of Parallel Computing ISCAS 100190 China Graduate University of Chinese Academy of Sciences 100190 China State Key Lab. of Computer Science CAS 100190 China

ISBN: (纸本)9780769533520

Sparse Matrix-Vector Multiplication is an important computational kernel in scientific applications, and CSR storage algorithm often performs poorly on modern computer systems. But the register-level blocking algorithm can optimize memory hierarchy access, reduce memory access time, and then improve the performance. RAM (h) is a computation model that has h-level memory hierarchies. It indicates that different implementation forms of the same algorithm can have different memory access complexity. In this paper, we try to analyze memory access compelxity of two implementation forms of SpMV(which are CSR storage algorithm and register-level blocking algorithm) and to predict the performance of SpMV through combining the memory access complexity analysis and the data movement/floating point operation ratio analysis. The performance data of two forms and Statistical results of miss number of L1, L2 and TLB on Pentium IV platform are listed. Model analytical results matched well with experimental results.

关键词： SpMV sparse matrix-vector multiplication RAM (h) model memory access complexity

来源：评论

学校读者我要写书评

暂无评论

Bridging the gap between HPC and big data frameworks 43rd

Bridging the gap between HPC and big data frameworks

引用

43rd International Conference on Very Large Data Bases, VLDB 2017

作者： Anderson, Michael Smith, Shaden Sundaram, Narayanan Capotă, Mihai Zhao, Zheguang Dulloor, Subramanya Satish, Nadathur Willke, Theodore L. Parallel Computing Lab United States University of Minnesota United States Brown University United States Infrastructure Research Lab Intel Corporation United States

Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1-17.7 × speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort. © 2017 VLDB Endowment.

关键词： Fault tolerance

来源：评论

学校读者我要写书评

暂无评论

Two-step approach to scheduling quantum circuits

arXiv

引用

arXiv 2017年

作者： Guerreschi, Gian Giacomo Park, Jongsoo Parallel Computing Lab Intel Corporation

As the effort to scale up existing quantum hardware proceeds, it becomes necessary to schedule quantum gates in a way that minimizes the number of operations. There are three constraints that have to be satisfied: the order or dependency of the quantum gates in the specific algorithm, the fact that any qubit may be involved in at most one gate at a time, and the restriction that two-qubit gates are implementable only between connected qubits. The last aspect implies that the compilation depends not only on the algorithm, but also on hardware properties like connectivity. Here we suggest a two-step approach in which logical gates are initially scheduled neglecting connectivity considerations, while routing operations are added at a later step in a way that minimizes their overhead. We rephrase the subtasks of gate scheduling in terms of graph problems like edge-coloring and maximum subgraph isomorphism. While this approach is general, we specialize to a one dimensional array of qubits to propose a routing scheme that is minimal in the number of exchange operations. As a practical application, we schedule the Quantum Approximate Optimization Algorithm in a linear geometry and quantify the reduction in the number of gates and circuit depth that results from increasing the efficacy of the scheduling strategies. Copyright © 2017, The Authors. All rights reserved.

关键词： Logic gates

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：