The increase of parallelism along with data volume has made more efficient ways of managing memory necessary, especially considering shared-memory architectures. The goal is to present some improvements of memory mana...
详细信息
ISBN:
(数字)9798331527211
ISBN:
(纸本)9798331527228
The increase of parallelism along with data volume has made more efficient ways of managing memory necessary, especially considering shared-memory architectures. The goal is to present some improvements of memory management on shared-memory computing architectures, using parallel Discrete Event Simulation (PDES) platforms as a case study. The disclosed solutions span from memory hierarchy awareness in terms of cache/NUMA locality, moving on to an incremental state saving mechanism exploiting write-protection, up to a prompt and memory aware mechanism to do output collection, showing results to validate our findings.
The Edmonds Blossom algorithm is implemented here using depth-first search, which is intrinsically serial. By streamlining the code, our serial implementation is consistently three to five times faster than the previo...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The Edmonds Blossom algorithm is implemented here using depth-first search, which is intrinsically serial. By streamlining the code, our serial implementation is consistently three to five times faster than the previously fastest general graph matching code. By extracting parallelism across iterations of the algorithm, with coarse -grain locking, we are able to further reduce the run lime on random regular graphs fourfold and obtain a two-fold reduction of run time on real-world graphs with similar topology. Solving very sparse graphs (average degree less than four) exhibiting comnwnity structure with eight threads led to a slow down of three-fold, but this slow down is replaced by marginal speed up once the average degree is greater than four. We conclude that our parallel coarse -grain locking implementation performs well when extracting parallelism from this augmenting-path-based algorithm and may work well for similar algorithms.
High-speed photonic reservoir computing (RC) has garnered significant interest in neuromorphic computing. However, existing reservoir layer (RL) architectures mostly rely on time -delayed feedback loops and use analog...
详细信息
High-speed photonic reservoir computing (RC) has garnered significant interest in neuromorphic computing. However, existing reservoir layer (RL) architectures mostly rely on time -delayed feedback loops and use analog -to -digital converters for offline digital processing in the implementation of the readout layer, posing inherent limitations on their speed and capabilities. In this paper, we propose a non -feedback method that utilizes the pulse broadening effect induced by optical dispersion to implement a RL. By combining the multiplication of the modulator with the summation of the pulse temporal integration of the distributed feedback -laser diode, we successfully achieve the linear regression operation of the optoelectronic analog readout layer. Our proposed fully -analog feed -forward photonic RC (FF-PhRC) system is experimentally demonstrated to be effective in chaotic signal prediction, spoken digit recognition, and MNIST classification. Additionally, using wavelength -division multiplexing, our system manages to complete parallel tasks and improve processing capability up to 10 GHz per wavelength. The present work highlights the potential of FF-PhRC as a high-performance, high-speed computing tool for real-time neuromorphic computing. (c) 2023 Optica Publishing Group under the terms of the Optica Open Access Publishing Agreement
Graphs are ubiquitous, and they can model unique characteristics and complex relations of real-life systems. Although using machine learning (ML) on graphs is promising, their raw representation is not suitable for ML...
详细信息
Graphs are ubiquitous, and they can model unique characteristics and complex relations of real-life systems. Although using machine learning (ML) on graphs is promising, their raw representation is not suitable for ML algorithms. Graph embedding represents each node of a graph as a d-dimensional vector which is more suitable for ML tasks. However, the embedding process is expensive, and CPU-based tools do not scale to real-world graphs. In this work, we present GOSH, a GPU-based tool for embedding large-scale graphs with minimum hardware constraints. GOSH employs a novel graph coarsening algorithm to enhance the impact of updates and minimize the work for embedding. It also incorporates a decomposition schema that enables any arbitrarily large graph to be embedded with a single GPU. As a result, GOSH sets a new state-of-the-art in link prediction both in accuracy and speed, and delivers high-quality embeddings for node classification at a fraction of the time compared to the state-of-the-art. For instance, it can embed a graph with over 65 million vertices and 1.8 billion edges in less than 30 minutes on a single GPU.
With ever more complex functionalities being implemented in emerging real-time applications, multi-core systems are demanded for high performance, with directed acyclic graphs (DAG) being used to model functional depe...
详细信息
With ever more complex functionalities being implemented in emerging real-time applications, multi-core systems are demanded for high performance, with directed acyclic graphs (DAG) being used to model functional dependencies. For a single DAG task, our previous work presented a concurrent provider and consumer (CPC) model that captures the node-level dependency and parallelism, which are the two key factors of a DAG. Based on the CPC, scheduling and analysis methods were constructed to reduce makespan and tighten the analytical bound of the task. However, the CPC-based methods cannot support multi-DAGs as the interference between DAGs (i.e., inter-task interference) is not taken into account. To address this limitation, this article proposes a novel multi-DAG scheduling approach which specifies the number of cores a DAG can utilise so that it does not incur the inter-task interference. This is achieved by modelling and understanding the workload distribution of the DAG and the system. By avoiding the inter-task interference, the constructed schedule provides full compatibility for the CPC-based methods to be applied on each DAG and reduces the pessimism of the existing analysis. Experimental results show that the proposed multi-DAG method achieves an improvement up to 80% in schedulability against the original work that it extends, and outperforms the existing multi-DAG methods by up to 60% for tightening the interference.
DLA-Future implements an efficient GPU-enabled distributed eigenvalue solver using a software architecture based on the C++ std::execution concurrency proposal. The state-of-the-art linear algebra implementations LAPA...
详细信息
ISBN:
(数字)9783031617638
ISBN:
(纸本)9783031617621;9783031617638
DLA-Future implements an efficient GPU-enabled distributed eigenvalue solver using a software architecture based on the C++ std::execution concurrency proposal. The state-of-the-art linear algebra implementations LAPACK and ScaLAPACK were designed for legacy systems and employ fork-join parallelism, which can perform inefficiently on modern architectures. The benefits of task-based linear algebra implementations are significant. The reduction of synchronization points and the ease of overlapping computation with communication are two of the main benefits that lead to improved performance. In specific cases, the ability to schedule multiple algorithms concurrently yields a noticeable reduction of time-to-solution. We present the implementation of DLA-Future and the results on different types of systems starting from Piz Daint multicore and GPU partitions, moving to more recent architectures available in ALPS. The benchmark results are divided into two categories. The first contains a comparison of DLA-Future against widely used eigensolver implementations. The second category showcases the performance of the eigensolver in real applications. We present results generated with CP2K, where DLA-Future support was easily added thanks to the provided C API, which is compatible with the ScaLAPACK interface.
The Single Source Shortest Path (SSSP) problem is a classic graph theory problem that arises frequently in various practical scenarios;hence, many parallel algorithms have been developed to solve it. However, these al...
详细信息
The Single Source Shortest Path (SSSP) problem is a classic graph theory problem that arises frequently in various practical scenarios;hence, many parallel algorithms have been developed to solve it. However, these algorithms operate on static graphs, whereas many real-world problems are best modeled as dynamic networks, where the structure of the network changes with time. This gap between the dynamic graph modeling and the assumed static graph model in the conventional SSSP algorithms motivates this work. We present a novel parallel algorithmic framework for updating the SSSP in large-scale dynamic networks and implement it on the shared-memory and GPU platforms. The basic idea is to identify the portion of the network affected by the changes and update the information in a rooted tree data structure that stores the edges of the network that are most relevant to the analysis. Extensive experimental evaluations on real-world and synthetic networks demonstrate that our proposed parallel updating algorithm is scalable and, in most cases, requires significantly less execution time than the state-of-the-art recomputing-from-scratch algorithms.
parallel connection of multiple inverters is an important means to solve the expansion,reserve and protection of distributed power generation,such as *** view of the shortcomings of traditional droop control methods s...
详细信息
parallel connection of multiple inverters is an important means to solve the expansion,reserve and protection of distributed power generation,such as *** view of the shortcomings of traditional droop control methods such as weak anti-interference ability,low tracking accuracy of inverter output voltage and serious circulation phenomenon,a finite control set model predictive control(FCS-MPC)strategy of microgrid multiinverter parallel system based on Mixed Logical Dynamical(MLD)modeling is ***,the MLD modeling method is introduced logical variables,combining discrete events and continuous events to form an overall differential equation,which makes the modeling more *** a predictive controller is designed based on the model,and constraints are added to the objective function,which can not only solve the real-time changes of the control system by online optimization,but also effectively obtain a higher tracking accuracy of the inverter output voltage and lower total harmonic distortion rate(Total Harmonics Distortion,THD);and suppress the circulating current between the inverters,to obtain a good dynamic ***,the simulation is carried out onMATLAB/Simulink to verify the correctness of the model and the rationality of the proposed *** paper aims to provide guidance for the design and optimal control of multi-inverter parallelsystems.
This work explores "reverse engineering" organizational patterns from distributed machine control system (DMCS) patterns. The authors analyzed four core DMCS patterns (Separate real-time, Isolate Functionali...
详细信息
ISBN:
(纸本)9798400716836
This work explores "reverse engineering" organizational patterns from distributed machine control system (DMCS) patterns. The authors analyzed four core DMCS patterns (Separate real-time, Isolate Functionalities, Variable Manager, and Notifications) utilizing the similarity between communication structures within any organization and the architectural structures in the software architecture patterns. As a result, four new corresponding organizational patterns were written. In this paper, these patterns and the outline for the ideation process are presented.
The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution infor...
详细信息
The ability to accurately estimate job runtime properties allows a scheduler to effectively schedule jobs. State-of-the-art online cluster job schedulers use history-based learning, which uses past job execution information to estimate the runtime properties of newly arrived jobs. However, with fast-paced development in cluster technology (in both hardware and software) and changing user inputs, job runtime properties can change over time, which lead to inaccurate predictions. In this article, we explore the potential and limitation of real-time learning of job runtime properties, by proactively sampling and scheduling a small fraction of the tasks of each job. Such a task-sampling-based approach exploits the similarity among runtime properties of the tasks of the same job and is inherently immune to changing job behavior. Our analytical and experimental analysis of 3 production traces with different skew and job distribution shows that learning in space can be substantially more accurate. Our simulation and testbed evaluation on Azure of the two learning approaches anchored in a generic job scheduler using 3 production cluster job traces shows that despite its online overhead, learning in space reduces the average Job Completion time (JCT) by 1.28x, 1.56x, and 1.32x compared to the prior-art history-based predictor. We further analyze the experimental results to give intuitive explanations to why learning in space outperforms learning in time in these experiments. Finally, we show how sampling-based learning can be extended to schedule DAG jobs and achieve similar speedups over the prior-art history-based predictor.
暂无评论