In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are aris...
详细信息
In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called "Steal Locally, Share Globally" implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer;(ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models.
Ever since the discussions about a possible quantum computer arised, quantum simulations have been at the forefront of possible utilities, with the task of quantum simulations being one that promises quantum advantage...
详细信息
Ever since the discussions about a possible quantum computer arised, quantum simulations have been at the forefront of possible utilities, with the task of quantum simulations being one that promises quantum advantage. Recently, advancements have made it feasible to simulate complex molecules using Variational Quantum Eigensolvers or study the dynamics of many-body spin Hamiltonians. These simulations have the potential to yield valuable outcomes through the application of error mitigation techniques. Simulating smaller models carries a great amount of importance as well and currently, in the Noisy Intermediate Scale Quantum era, is more feasible since it is less prone to errors. The objective of this work is to examine the theoretical background and the circuit implementation of a quantum tunneling simulation, with an emphasis on hardware considerations. This study presents the theoretical background required for such implementation and highlights the main stages of its development. By building on classic approaches of quantum tunneling simulations, this study aims at improving the result of such simulations by employing error mitigation techniques, Zero Noise Extrapolation, and Readout Error Mitigation and uses them in conjunction with multiprogramming of the quantum chip, a technique used for solving the hardware under-utilization problem that arises in such contexts. Quantum simulations are regarded as a promising undertaking in the field of quantum computing. This study focuses on quantum tunneling and aims at simulating it on a quantum computer. With a focus on hardware run considerations for superconducting architectures, various circuit implementation alternatives are clarified. The role of the compiler, the need for hardware aware design, different error mitigation techniques and multiprogramming are discussed, giving a final workflow tailored for the Noisy Intermediate Scale Quantum era. image
Noisy intermediate-scale quantum computers are widely used for quantum computing (QC) from quantum cloud providers. Among them, superconducting quantum computers, with their high scalability and mature processing tech...
详细信息
Noisy intermediate-scale quantum computers are widely used for quantum computing (QC) from quantum cloud providers. Among them, superconducting quantum computers, with their high scalability and mature processing technology based on traditional silicon-based chips, have become the preferred solution for most commercial companies and research institutions to develop QC. However, superconducting quantum computers suffer from fluctuation due to noisy environments. To maintain reliability for every execution, calibration of the quantum processor is significantly important. During the long procedure to calibrate physical quantum bits (qubits), quantum processors must be turned into offline mode. In this work, we propose a real-time calibration framework (RCF) to execute quantum program tasks and calibrate in-demand qubits simultaneously, without interrupting quantum processors. Across a widely used noisy intermediate-scale quantum (NISQ) evaluation benchmark suite such as QASMBench, RCF achieves up to 18% reliability improvement for applications. For reliability on different physical qubits, RCF achieves an average gain of 15.7% (up to 36.7%). For cloud quantum machines, the throughput can be improved up to 9.5 throughput per minute (6.5 on average) based on baseline calibration time. In conclusion, RCF offers a reliable solution for large-scale, long-serving quantum machines.
In the noisy intermediate-scale quantum (NISQ) era, the idea of quantum multiprogramming, running multiple quantum circuits (QCs) simultaneously on the same hardware, helps to improve the throughput of quantum computa...
详细信息
In the noisy intermediate-scale quantum (NISQ) era, the idea of quantum multiprogramming, running multiple quantum circuits (QCs) simultaneously on the same hardware, helps to improve the throughput of quantum computation. However, the crosstalk, unwanted interference between qubits on NISQ processors, may cause performance degradation when using multiprogramming. To address this challenge, we introduce palloq (parallel allocation of QCs), a novel compilation protocol. Palloq improves the performance of quantum multiprogramming on NISQ processors, while paying attention to 1) the combination of QCs chosen for parallel execution and 2) the assignment of program qubit variables to physical qubits, to reduce unwanted interference among the active set of QCs. We also propose a software-based crosstalk detection protocol using a new combination of randomized benchmarking methods. Our method successfully characterizes the suitability of hardware for multiprogramming with relatively low detection costs. We found a tradeoff between the success rate and execution time of the multiprogramming. Our results will be of value when device throughput becomes a significant bottleneck. Until service providers have enough quantum processors available to more than meet demand, this approach will be attractive to the service providers and users who want to optimize job management and throughput of the processor.
The growing need for extracting information from large graphs has been pushing the development of parallel graph algorithms. However, the highly irregular structure of the real-world graphs limits the performance and ...
详细信息
The working set model for program behavior was invented in 1965. It has stood the test of time in virtual memory management for over 50 years. It is considered the ideal for managing memory in operating systems and ca...
详细信息
The working set model for program behavior was invented in 1965. It has stood the test of time in virtual memory management for over 50 years. It is considered the ideal for managing memory in operating systems and caches. Its superior performance was based on the principle of locality, which was discovered at the same time;locality is the observed tendency of programs to use distinct subsets of their pages over extended periods of time. This tutorial traces the development of working set theory from its origins to the present day. We will discuss the principle of locality and its experimental verification. We will show why working set memory management resists thrashing and generates near-optimal system throughput. We will present the powerful, linear-time algorithms for computing working set statistics and applying them to the design of memory systems. We will debunk several myths about locality and the performance of memory systems. We will conclude with a discussion of the application of the working set model in parallel systems, modern shared CPU caches, network edge caches, and inventory and logistics management.
The exact response time analysis for fixed priority scheduling (FPS) in the lowest priority first-based feasibility tests is commonly required as a part of system design tools. This letter proposes an efficient method...
详细信息
The exact response time analysis for fixed priority scheduling (FPS) in the lowest priority first-based feasibility tests is commonly required as a part of system design tools. This letter proposes an efficient method for this, which we named incremental lower bound (ILB) calculation method. Compared to the best algorithm that has been known so far, which is the incremental calculation method, ILB reduces the feasibility test iterations/run times by more than 38% and 20% regardless of varying utilization and the number of tasks in task sets.
In the current approach of automotive electronic system design, the multicore processors have prevailed to achieve high computing performance at low thermal dissipation. Multicore processors offer functional paralleli...
详细信息
In the current approach of automotive electronic system design, the multicore processors have prevailed to achieve high computing performance at low thermal dissipation. Multicore processors offer functional parallelism that helps in meeting the safety critical requirements of vehicles. The number of Electronic Control Units (ECUs) in high-end cars could be reduced by conglomerating more functions into a multicore ECU. AUTOSAR stack has been designed to support the applications developed for multicore ECUs. The real challenges lie in adapting new design methods while developing sophisticated applications with multicore constraints. It is imperative to utilize the most of multicore computational capability towards enhancing the overall performance of ECUs. In this context, the scheduling of the real-time multitasking software components by the operating system is one of the challenging issues to be addressed. Here, the state-of-the-art scheduling algorithm is reviewed and its merits and limitations are identified. A hybrid scheduler has been proposed, tested and compared with the state-of-the-art algorithm that offers better performance in terms of CPU utilization, average response time and deadline missing rate both in normal and high load conditions.
The growing need for extracting information from large graphs has been pushing the development of parallel graph algorithms. However, the highly irregular structure of the real-world graphs limits the performance and ...
详细信息
The growing need for extracting information from large graphs has been pushing the development of parallel graph algorithms. However, the highly irregular structure of the real-world graphs limits the performance and energy improvements of graph applications. In this paper, we show that, in most cases, using all the available cores of the multiprocessor is not the best option in terms of the aforementioned non-functional requirements. Based on that, we propose GraphKat, a framework that enables the simultaneous processing of several algorithms/graphs instead of executing them serially (i.e., one after another), increasing efficiency in terms of performance and energy. GraphKat works in two steps: (i) it characterizes the graph applications with a specific number of threads based on their efficiency levels;and (ii) it defines the execution order of all graph applications in the target system. Experimental results on three multicore processors (Intel and AMD) show that GraphKat improves the overall system's efficiency related to performance (up to 434.26x$$ 434.26\times $$) and energy-saving (up to 245.21x$$ \times $$), and reduces the graph applications' execution time (up to 17.70x$$ 17.70\times $$) and energy consumption (up to 6.64x$$ \times $$) compared to the default execution of parallel applications on HPC systems.
In the present day scenario cloud computing is an attractive subject for IT and non IT personnel. It is a service-oriented pay per use computational model. Cloud has working models with service-oriented delivery mecha...
详细信息
暂无评论