Near-term quantum computers will soon reach sizes that are challenging to directly simulate, even when employing the most powerful supercomputers. Yet, the ability to simulate these early devices using classical compu...
详细信息
ISBN:
(数字)9781450351140
ISBN:
(纸本)9781450351140
Near-term quantum computers will soon reach sizes that are challenging to directly simulate, even when employing the most powerful supercomputers. Yet, the ability to simulate these early devices using classical computers is crucial for calibration, validation, and benchmarking. In order to make use of the full potential of systems featuring multi- and many-core processors, we use automatic code generation and optimization of compute kernels, which also enables performance portability. We apply a scheduling algorithm to quantum supremacy circuits in order to reduce the required communication and simulate a 45-qubit circuit on the Cori II super-computer using 8,192 nodes and 0.5 petabytes of memory. To our knowledge, this constitutes the largest quantum circuit simulation to this date. Our highly-tuned kernels in combination with the reduced communication requirements allow an improvement in time-to-solution over state-of-the-art simulations by more than an order of magnitude at every scale.
Just-in-time (JIT) compilation during program execution and ahead-of-time (AOT) compilation during software installation are alternate techniques used by managed language virtual machines (VM) to generate optimized na...
详细信息
ISBN:
(纸本)9781450350303
Just-in-time (JIT) compilation during program execution and ahead-of-time (AOT) compilation during software installation are alternate techniques used by managed language virtual machines (VM) to generate optimized native code while simultaneously achieving binary code portability and high execution performance. Profile data collected by JIT compilers at run-time can enable profile-guided optimizations (PGO) to customize the generated native code to different program inputs. AOT compilation removes the speed and energy overhead of online profile collection and dynamic compilation, but may not be able to achieve the quality and performance of customized native code. The goal of this work is to investigate and quantify the implications of the AOT compilation model on the quality of the generated native code for current VMs. First, we quantify the quality of native code generated by the two compilation models for a state-of-the-art (HotSpot) Java VM. Second, we determine how the amount of profile data collected affects the quality of generated code. Third, we develop a mechanism to determine the accuracy or similarity for different profile data for a given program run, and investigate how the accuracy of profile data affects its ability to effectively guide PGOs. Finally, we categorize the profile data types in our VM and explore the contribution of each such category to performance.
We present a study on Matrix-Vector Product operations in the Maxwell GPU generation through the PyCUDA python library. Through this lens, a broad analysis is performed over different memory management schemes. We ide...
详细信息
ISBN:
(纸本)9783319579726;9783319579719
We present a study on Matrix-Vector Product operations in the Maxwell GPU generation through the PyCUDA python library. Through this lens, a broad analysis is performed over different memory management schemes. We identify the approaches that result in higher performance in current GPU generations when using dense matrices. The found guidelines are then applied to the implementation of the sparse matrix-vector product, covering structured (DIA) and unstructured (CSR) sparse matrix formats. Our experimental study on different datasets reveals that there is room for little improvement in the current state of the memory hierarchy, and that the expected Pascal GPU generation will get a major benefit from our techniques.
Cloud storage services such as Dropbox have been widely used for file collaboration among multiple users. However, this desirable functionality is yet restricted to the 'walled-garden' of each service. At pres...
详细信息
Snapper (Lutjanus sp.) is an economically important fish for local fishermen in Banyuasin coastal water of South Sumatra. However, the current and historical stock of this species is still unknown. This study was aime...
Snapper (Lutjanus sp.) is an economically important fish for local fishermen in Banyuasin coastal water of South Sumatra. However, the current and historical stock of this species is still unknown. This study was aimed to estimate the stock status of Lutjanus sp. in the Banyuasin coastal waters. The annual catch and effort data were analyzed from 2008 to 2016. The different surplus production models were tested to obtain the best-fitted model based on the sign suitability test, model performance test, and multiple criteria analysis. The results indicated that the best-fitted model for Lutjanus sp. was the Fox model. The model had the best value for the determination coefficient (R2 = 97.2%), Nash-Sutcliffe Efficiency (-0.277), Mean Absolute Deviation (29.198), Mean Square Error (1,190.522), Root Mean Square Error (34.504), and RMSE-observations Standard Deviation Ratio (1.13), whereas the value of Mean Absolute Percentage Error (0.05) was the second-best value. The optimum effort (Eopt), maximum sustainable catch (CMSY), and total allowable catch were 22.236 trips/year, 623 ton and 498 ton/year, respectively. Based on plotting the effort and exploitation level (141%; 102%) in 2016, the stock status of Lutjanus sp. indicated depleting stock, the high fishing pressure and could encourage overfishing stock in the future.
This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, ...
详细信息
ISBN:
(纸本)9783319586670
This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, exemplified on the symmetric interior penalty discretization of the Laplacian as a metric for a complex application code in fluid dynamics. State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementations of SIMD vectorization and shared-memory parallelization are detailed. Computational results are presented for dual-socket Intel Haswell CPUs at 28 cores, a 64-core Intel Knights Landing, and a 16-core IBM Power8 processor. Up to polynomial degree six, Knights Landing is approximately twice as fast as Haswell. Power8 performs similarly to Haswell, trading a higher frequency for narrower SIMD units. The performance comparison shows that simple ways to express parallelism through for loops perform better on medium and high core counts than a more elaborate task-based parallelization with dynamic scheduling according to dependency graphs, despite less memory transfer in the latter algorithm.
The GNU Multi-Precision library is a widely used, safety-critical, library for arbitrary-precision arithmetic. Its source code is written in C and assembly, and includes intricate state-of-the-art algorithms for the s...
详细信息
ISBN:
(数字)9783319723082
ISBN:
(纸本)9783319723082;9783319723075
The GNU Multi-Precision library is a widely used, safety-critical, library for arbitrary-precision arithmetic. Its source code is written in C and assembly, and includes intricate state-of-the-art algorithms for the sake of highperformance. Formally verifying the functional behavior of such highly optimized code, not designed with verification in mind, is challenging. We present a fully verified library designed using the Why3 program verifier. The use of a dedicated memory model makes it possible to have the Why3 code be very similar to the original GMP code. This library is extracted to C and is compatible and performance-competitive with GMP.
Emerging non-volatile main memories (NVMMs) technologies can provide both data persistence and highperformance at memory level. The design of existing file systems for NVMM has to handle the data durability problem b...
详细信息
ISBN:
(纸本)9781538625880
Emerging non-volatile main memories (NVMMs) technologies can provide both data persistence and highperformance at memory level. The design of existing file systems for NVMM has to handle the data durability problem between CPU cache and NVMM. However, most NVMM-aware file systems could not meet the strong data consistency requirement of applications with data structures, e.g. B-Tree. Traditional techniques, such as copy-on-write and journaling, delivering data consistency, have defects of write amplification and data copy, respectively. In this paper, we explore SNFS, one log-structured file system with optimization of data consistency based-on non-volatile main memory, providing highperformance for applications with small writes. Specifically, SNFS adopts a small data-log mechanism to journal fine-grained data writes. It also uses in-place writes to minimize memory footprint for small data updating and accelerates data block locating with hashing strategy. Finally, we evaluate SNFS's performance with several write-intensive workloads, and experimental results show that SNFS improves the system throughput by up to 23 times compared to state-of-the-art file systems and reduces the execution time by up to 65.5%.
Wireless communications is one of the fastest growing technology fields, driving numerous other innovations in electronics. One challenging research area within the wireless field is to achieve a higher transmission s...
详细信息
ISBN:
(纸本)9781509038435
Wireless communications is one of the fastest growing technology fields, driving numerous other innovations in electronics. One challenging research area within the wireless field is to achieve a higher transmission speed. Today it is an open question how we can realize a wireless system at a speed of 100 Gb/s or even beyond. If we intend to use such systems in a mobile environment, we can only afford to spend approximately 1 - 10 pW/bit for the end-to-end communication. This includes all processing and protocol steps. A special priority project within the German research community was set up to investigate new paradigms for achieving the 100 Gb/s wireless transmission goal. Within 11 coordinated projects researchers from all over Germany are looking at several relevant issues ranging from the antennas and RF-Frontend, baseband-processing and error correction to protocol processing. One of the big challenges is to find the correct balance between analog and digital signal processing to achieve an extremely highperformance at very low energy consumption. Another challenge is to find a good balance between bandwidth and bandwidth efficiency to achieve the 100 Gb/s goal. Finally, protocol processing will need new approaches to decouple the central processor of a computer from the high-end input/output operations. Here we report about work in progress and initial results of selected projects. One interesting finding was that FEC at speed up to 120 GB/s can be realized in a very energy efficient way and with small area/power consumption
There are numerous domains of science that have been using highperformance computing (HPC) systems for decades. Historically, when new HPC resources are introduced, specific variations may require researchers to make...
详细信息
暂无评论