Looking closely at the Top500 list of high-performance computers (HPC) in the world, it becomes clear that computing power is not the only number that has been growing in the last three decades. the amount of power re...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
Looking closely at the Top500 list of high-performance computers (HPC) in the world, it becomes clear that computing power is not the only number that has been growing in the last three decades. the amount of power required to operate such massive computing machines has been steadily increasing, earning HPC users a higher than usual carbon footprint. While the problem is well known in academia, the exact energy requirements of hardware, software and how to optimize it are hard to quantify. To tackle this issue, we need tools to understand the software and its relationship with power consumption in today's high performance computers. Withthat in mind, we present perun, a Python package and command line interface to measure energy consumption based on hardware performance counters and selected physical measurement sensors. this enables accurate energy measurements on various scales of computing, from a single laptop to an MPI-distributed HPC application. We include an analysis of the discrepancies between these sensor readings and hardware performance counters, with particular focus on the power draw of the usually overlooked non-compute components such as memory. One of our major insights is their significant share of the total energy consumption. We have equally analyzed the runtime and energy overhead perun generates when monitoring common HPC applications, and found it to be minimal. Finally, an analysis on the accuracy of different measuring methodologies when applied at large scales is presented.
Cholesky factorization is a method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One ...
详细信息
ISBN:
(纸本)9783031396977;9783031396984
Cholesky factorization is a method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is mathematical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where mathematical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to similar to 26% on a single CPU socket) for a wide range of matrix bandwidths on two different Intel CPU systems.
GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or ...
详细信息
ISBN:
(纸本)9781665440660
GPUs are readily available in cloud computing and personal devices, but their use for data processing acceleration has been slowed down by their limited integration with common programming languages such as Python or Java. Moreover, using GPUs to their full capabilities requires expert knowledge of asynchronous programming. In this work, we present a novel GPU run time scheduler for multi-task GPU computations that transparently provides asynchronous execution, space-sharing, and transfer-computation overlap without requiring in advance any information about the program dependency structure. We leverage the GrCUDA polyglot API to integrate our scheduler with multiple high-level languages and provide a platform for fast prototyping and easy GPU acceleration. We validate our work on 6 benchmarks created to evaluate task-parallelism and show an average of 44% speedup against synchronous execution, with no execution lime slowdown compared to hand-optimized has' code written using the C++ CUDA Graphs API.
Traditional studies on jamming effectiveness and propagation over the wireless channel assume ideal theoretical models, such as Friis and Rician. However, the cited models have been hardly validated by on-field assess...
详细信息
ISBN:
(纸本)9781665435741
Traditional studies on jamming effectiveness and propagation over the wireless channel assume ideal theoretical models, such as Friis and Rician. However, the cited models have been hardly validated by on-field assessments in real jamming scenarios. To the best of our knowledge, we are the first ones to fill the highlighted gap. In particular, our objective is to provide a realistic jamming propagation model, taking into account heterogeneous operating frequencies and technologies. Our findings, supported by an extensive experimental campaign on outdoor jamming propagation, show that independently from the communication frequency the jamming power received at a given distance from the jamming source (fast fading) can be best modelled through a t-locationScale distribution, while the power of the received jamming decades withthe increase of the distance from the jamming source (slow fading) following a power law. As reference applications of the derived experimental model, we describe and demonstrate its usage in two different use-cases, i.e., jamming source localization and dead-reckoning navigation, showing that our model outperforms traditional and state-of-the-art propagation models when dealing with real jamming scenarios. All the acquired data have been released as open-source, to foster experimental research activities on jamming propagation models and their applications.
Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer-scale. Modern parallel programming frameworks...
详细信息
ISBN:
(纸本)9781665440660
Python has become a widely used programming language for research, not only for small one-off analyses, but also for complex application pipelines running at supercomputer-scale. Modern parallel programming frameworks for Python present users with a more granular unit of management than traditional Unix processes and batch submissions: the Python function. We review the challenges involved in running native Python functions at scale, and present techniques for dynamically determining a minimal set of dependencies and for assembling a lightweight function monitor (LFM) that captures the software environment and manages resources at the granularity of single functions. We evaluate these techniques in a range of environments, from campus cluster to supercomputer, and show that our advanced dependency management planning and dynamic resource management methods provide superior performance and utilization relative to coarser-grained management approaches, achieving several-fold decrease in execution time for several large Python applications.
Withthe explosive increase of the various user equipments, access latency has become a paramount metric of QoS in multi-access edge computing (MEC). At the same time, cost expenditure affects and restrains the reduct...
详细信息
ISBN:
(纸本)9781665435741
Withthe explosive increase of the various user equipments, access latency has become a paramount metric of QoS in multi-access edge computing (MEC). At the same time, cost expenditure affects and restrains the reduction of latency. To cope withthese issues, a two-stage replica management mechanism (TRMM) for latency-aware applications in MEC is proposed. First, we design the system architecture of TRMM in MEC environment and construct a novel mathematical model to describe the replica placement decision problem as a dual-objective problem with both latency and cost constraints. Subsequently, in replica recommendation stage, we present a file prospective popularity model based on user mobility and a replica recommendation algorithm;in replica placement rule learning stage, we construct a Q-Learning model, in which a new reward function is defined in terms of data access latency and replica placement cost, and the replica placement rule is defined in terms of 0-1 matrix. Finally, numerical results demonstrate that the TRMM outperforms other replica placement schemes.
Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets...
详细信息
ISBN:
(纸本)9781665435772
Use of Deep Learning (DL) in commercial applications such as image classification, sentiment analysis and speech recognition is increasing. When training DL models with large number of parameters and/or large datasets, cost and speed of training can become prohibitive. distributed DL training solutions that split a training job into subtasks and execute them over multiple nodes can decrease training time. However, the cost of current solutions, built predominantly for cluster computing systems, can still be an issue. In contrast to cluster computing systems, Volunteer Computing (VC) systems can lower the cost of computing, but applications running on VC systems have to handle fault tolerance, variable network latency and heterogeneity of compute nodes, and the current solutions are not designed to do so. We design a distributed solution that can run DL training on a VC system by using a data parallel approach. We implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems. In contrast to traditional VC systems that lower cost by using untrustworthy volunteer devices, we lower cost by leveraging preemptible computing instances on commercial cloud platforms. By using preemptible instances that require applications to be fault tolerant, we lower cost by 70-90% and improve data security.
Sleep staging is an important method to diagnose and treat insomnia, sleep apnea, and other sleep disorders. Compared withthe multi-channel automatic sleep staging system, the single-channel EEG signal contains less ...
详细信息
ISBN:
(纸本)9781665435741
Sleep staging is an important method to diagnose and treat insomnia, sleep apnea, and other sleep disorders. Compared withthe multi-channel automatic sleep staging system, the single-channel EEG signal contains less information, and the traditional single analysis domain feature parameter extraction algorithm cannot meet the requirement of sleep stage classification accuracy. To solve this problem, we propose an automatic sleep staging method based on the combination of time-domain and frequency-domain features based on single-channel EEG signals. Empirical mode decomposition is used to decompose EEG signals in the time domain to obtain the decomposed signals at different time scales. Multiple local features are extracted from each decomposed signal. the frequency-domain features of EEG signals are obtained by using the frequency domain decomposition of EEG signals in various rhythms. the time-domain and frequency-domain decomposition features are combined into eigenvectors and selected for sleep staging. the experimental results show that the sleep staging method proposed in this paper with time-frequency domain features of single-cyhannel EEG signals can approach the accuracy of sleep staging of multi-channel signals on the same data set, and superior to the sleep staging method withthe same single-channel EEG signals.
this paper presents an innovative analog neuron circuit design, subtly implementing ReLU activation function. this analogy neuron mainly consists of two parts. the first part is about linear processing, which response...
详细信息
ISBN:
(数字)9798331542856
ISBN:
(纸本)9798331542863
this paper presents an innovative analog neuron circuit design, subtly implementing ReLU activation function. this analogy neuron mainly consists of two parts. the first part is about linear processing, which responses for weighted summation of input signals. In our design, withthe use of hardware circuit, it is able to assign desired weight to signals from different inputs. the other part is activation function circuit. In this part, we designed four different circuits. Proteus simulation results indicate that all four circuits can implement ReLU function well. With precision rectifier circuit selected, the output voltage has a strong ReLU relationship withthe input, while the other circuits have their advantages and disadvantages. Additionally, the diode clamp circuit realizes ELU function approximately, which provides a new idea for the design of analog neurons.
暂无评论