Most multi-core and some many-core processors implement cache coherency protocols that heavily complicate the design of optimal parallel algorithms. Communication is performed implicitly by cache line transfers betwee...
详细信息
We propose a novel fine-grained quantization (FGQ) method to ternarize pre-trained full precision models, while also constraining activations to 8 and 4-bits. Using this method, we demonstrate minimal loss in classifi...
详细信息
Reduced precision computation for deep neural networks is one of the key areas addressing the widening 'compute gap' driven by an exponential growth in model size. In recent years, deep learning training has l...
详细信息
In general, one of the complexities of large simulations is related to the usage of the heterogeneous computational resources that are needed to execute them. The definition of workflows, usually linked to concrete or...
详细信息
ISBN:
(纸本)9781538678800
In general, one of the complexities of large simulations is related to the usage of the heterogeneous computational resources that are needed to execute them. The definition of workflows, usually linked to concrete orchestrations solutions, has reduced most of that complexity. These solutions are oriented to High Performance computing (HPC) or just deals with services managed remotely. This paper presents a novel solution we propose for running simulations in a hybrid HPC and Cloud infrastructure, exploiting the performance and power of HPC systems and benefiting from the fast and flexible provision of Cloud resources. We provide our vision about the typical simulation workflows and the kind of computational resources that fits best in each phase. In line with such vision, we describe the research done in order to enable the definition of the workflows extending the TOSCA standard (originally focused on Cloud solutions), used by our orchestrator and other solutions. We propose several extensions (types, relationships, compute properties and job properties), compatible with the standard definition, so Cloud and HPC tasks can be processed as expected. The paper also shows a use case implemented with the proposed approach, highlighting some of the benefits found so far.
Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not jus...
详细信息
ISBN:
(纸本)9781467308052
Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. However, current cluster implementations suffer from high latency data communication with large volumes of transfers across nodes, leading to inefficiency in performance and energy consumption. In this work, we show that we can overcome these constraints using a combination of efficient low-overhead data compression techniques to reduce transfer volumes along with latency-hiding techniques. Using an optimized single node graph traversal algorithm [1], our novel cluster optimizations result in over 6.6X performance improvements over state-of-the-art data transfer techniques, and almost an order of magnitude in energy savings. Our resulting implementation of the Graph500 benchmark achieves 115 GigaTEPS on a 320-node/5120 core Intel~? Endeavor cluster with Intel~? Xeon~? processors E5-2670, which matches the second ranked result in the recent November 2011 Graph500 list [2] with about 5.6X fewer nodes~(12). Our cluster optimizations only have a 1.8X overhead in overall performance from the performance of the optimized single-node implementation, and allows for near-linear scaling with number of nodes. Our algorithm on 1024 nodes on Intel~? Xeon~? processor X5670-based systems (with lower per-node performance) for a large multi-Terabyte graph attained 195 GigaTEPS in performance, proving the high scalability of our algorithm. Our per-node performance is the highest in the top 10 of the Nov 2011 Graph500 list.
While HPC systems have considerably improved their raw compute performance and scalability over the last two decades - first with the adoption of compute clusters and then throughput computing - good communication per...
While HPC systems have considerably improved their raw compute performance and scalability over the last two decades - first with the adoption of compute clusters and then throughput computing - good communication performance and overall scalability are still very difficult to achieve with irregular and communication intensive sparse linear algebra and graph applications. Recent advances in chip packaging, such as Embedded Multi-die Interconnect Bridge (EMIB), novel Optical IO Photonics modules that can be directly integrated on chip, and aggressive system designs that eliminate the SW stack and natively support a Distributed Global Address Space, can be used to improve overall performance and potentially break the scalability wall of state of the arts computing systems.
Sparse Matrix-Vector Multiplication is an important computational kernel in scientific applications, and CSR storage algorithm often performs poorly on modern computer systems. But the register-level blocking algorith...
详细信息
ISBN:
(纸本)9780769533520
Sparse Matrix-Vector Multiplication is an important computational kernel in scientific applications, and CSR storage algorithm often performs poorly on modern computer systems. But the register-level blocking algorithm can optimize memory hierarchy access, reduce memory access time, and then improve the performance. RAM (h) is a computation model that has h-level memory hierarchies. It indicates that different implementation forms of the same algorithm can have different memory access complexity. In this paper, we try to analyze memory access compelxity of two implementation forms of SpMV(which are CSR storage algorithm and register-level blocking algorithm) and to predict the performance of SpMV through combining the memory access complexity analysis and the data movement/floating point operation ratio analysis. The performance data of two forms and Statistical results of miss number of L1, L2 and TLB on Pentium IV platform are listed. Model analytical results matched well with experimental results.
Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of...
详细信息
As the effort to scale up existing quantum hardware proceeds, it becomes necessary to schedule quantum gates in a way that minimizes the number of operations. There are three constraints that have to be satisfied: the...
详细信息
暂无评论