the proceedings contain 12 papers. the special focus in this conference is on Job Scheduling Strategies for parallelprocessing. the topics include: Optimization of Execution Parameters of Moldable Ultrasoun...
ISBN:
(纸本)9783031226977
the proceedings contain 12 papers. the special focus in this conference is on Job Scheduling Strategies for parallelprocessing. the topics include: Optimization of Execution Parameters of Moldable Ultrasound Workflows Under Incomplete Performance Data;Scheduling of Elastic Message Passing applications on HPC Systems;preface;on the Feasibility of Simulation-Driven Portfolio Scheduling for Cyberinfrastructure Runtime Systems;Improving Accuracy of Walltime Estimates in PBS Professional Using Soft Walltimes;re-making the Movie-Making Machine;using Kubernetes in Academic Environment: Problems and Approaches;AI-Job Scheduling on Systems with Renewable Power Sources;Toward Building a Digital Twin of Job Scheduling and Power Management on an HPC System;encoding for Reinforcement Learning Driven Scheduling.
In this article we present PARSIR (parallel SImulation Runner), a package that enables the effective exploitation of shared-memory multi-processor machines for running discrete event simulation models. PARSIR is a com...
详细信息
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the ...
详细信息
ISBN:
(纸本)9781665497473
Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the hit rate of a single machine. But in this paper, we argue that the caching decision of a distributed memory system should be performed in a cooperative manner for the parallel data analytic applications, which are commonly used by emerging technologies, such as Big Data and AI (Artificial Intelligence), to perform data mining and sophisticated analytics on larger data volume in a shorter time. A parallel data analytic job consists of multiple parallel tasks. Hence, the completion time of a job is bounded by its slowest task, meaning that the job cannot benefit from caching until all inputs of its tasks are cached. To address the problem, we proposed a cooperative caching design that periodically rearranges the cache placement among nodes according to the data access pattern while taking the task dependency and network locality into account. Our approach is evaluated by a trace-driven simulator using both synthetic workload and real-world traces. the results show that we can reduce the average completion times up to 33% compared to a non-collaborative caching polices and 25% compared to other start-of-the-art collaborative caching policies.
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can ...
详细信息
ISBN:
(数字)9781665497992
ISBN:
(纸本)9781665497992
In recent years, high computational power has been required for computer platforms to support complex systems such as self-driving systems. Clustered many-core processors and directed acyclic graphs (DAGs), which can represent dependencies and parallelism of task processing, have attracted much attention as solutions to this problem. Previous studies on scheduling DAGs on multi-core processors have attempted to reduce the makespan (i.e., time it takes for a task to complete) by increasing the number of processes that can be executed in parallel. However, in self-driving systems, such as those utilizing clustered many-core processors, it is impossible to sufficiently increase the utilization of processor cores due to high-load processing. In this paper, a scheduling method is proposed to improve the utilization of processor cores by parallel executing high-load processes in parallel across multiple cores. the proposed method can reduce the makespan of DAGs performing high-load processing on clustered many-core processors.
Non-Uniform Memory Access (NUMA) systems are preva-lent in HPC, where optimal thread and page placement are crucial for enhancing performance and minimizing energy us-age [1]-[3]. Moreover, considering that NUMA syste...
详细信息
Quadratic Unconstrained Binary Optimization (QUBO) is a combinatorial optimization to find an optimal binary solution vector that minimizes the energy value defined by a quadratic formula of binary variables in the ve...
详细信息
ISBN:
(纸本)9798350311990
Quadratic Unconstrained Binary Optimization (QUBO) is a combinatorial optimization to find an optimal binary solution vector that minimizes the energy value defined by a quadratic formula of binary variables in the vector. As many NP-hard problems can be reduced to QUBO problems, considerable research has gone into developing QUBO solvers running on various computing platforms such as quantum devices, ASICs, FPGAs, GPUs, and optical fibers. this paper presents a framework called Diverse Adaptive Bulk Search (DABS), which has the potential to find optimal solutions of many types of QUBO problems. Our DABS solver employs a genetic algorithm-based search algorithm featuring three diverse strategies: multiple search algorithms, multiple genetic operations, and multiple solution pools. During the execution of the solver, search algorithms and genetic operations that succeeded in finding good solutions are automatically selected to obtain better solutions. Moreover, search algorithms traverse between different solution pools to find good solutions. We have implemented our DABS solver to run on multiple GPUs. Experimental evaluations using eight NVIDIA A100 GPUs confirm that our DABS solver succeeds in finding optimal or potentially optimal solutions for three types of QUBO problems.
the concept of memory disaggregation has recently been gaining traction in research. With memory disaggregation, data center compute nodes can directly access memory on adjacent nodes and are therefore able to overcom...
详细信息
ISBN:
(纸本)9781665497473
the concept of memory disaggregation has recently been gaining traction in research. With memory disaggregation, data center compute nodes can directly access memory on adjacent nodes and are therefore able to overcome local memory restrictions, introducing a new data management paradigm for distributed computing. this paper proposes and demonstrates a memory disaggregated in-memory object store framework for big data applications by leveraging the newly introduced thymesisFlow memory disaggregation system. the framework extends the functionality of the pre-existing Apache Arrow Plasma object store framework to distributed systems by enabling clients to easily and efficiently produce and consume data objects across multiple compute nodes. this allows big data applications to increasingly leverage parallelprocessing at reduced development costs. In addition, the paper includes latency and throughput measurements that indicate only a modest performance penalty is incurred for remote disaggregated memory access as opposed to local (similar to 6.5 vs similar to 5.75 GiB/s). the results can be used to guide the design of future systems that leverage memory disaggregation as well as the newly presented framework. this work is open-source and publicly accessible at https://***/10.5281/zenodo.6368998.
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributed arithmetic (DA) has been frequently employed for area-t...
详细信息
ISBN:
(纸本)9798350330991;9798350331004
Computation of the inner products is frequently used in machine learning (ML) algorithms apart from signal processing and communication applications. distributed arithmetic (DA) has been frequently employed for area-time efficient inner-product implementations. In conventional DA-based architectures, one of the vectors is constant and known a priori. Hence, the traditional DA architectures are not suitable when both vectors are variable. However, computing the inner product of a pair of variable vectors is frequently used for matrix multiplication of various forms and convolutional neural networks. In this paper, we present a novel DA-based architecture for computing the inner product of variable vectors. To derive the proposed architecture, the inner product of any given length is decomposed into a set of short-length inner products, such that the inner product could be computed by successive accumulation of the results of shortlength inner products. We have designed a DA-based architecture for the computation of the short-length inner-product of variable vectors and used that in successive clock cycles to compute the whole inner-product by successive accumulation. the post-layout synthesis results using Cadence Innovus with a GPDK 90nm technology library show that the proposed DA-based parallel architecture offers significant advantages in area-delay product and energy consumption over the bit-serial DA architecture.
Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. the basic idea of file system checkers is straightforward: important metad...
详细信息
ISBN:
(纸本)9798350337662
Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. the basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking;inconsistent metadata will be repaired or overwritten by its 'more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice. In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.
Finding the connected components of an undirected graph is one of the most fundamental graph problems. Connected components are used in a wide spectrum of applications including VLSI design, machine learning and image...
详细信息
ISBN:
(纸本)9781665481069
Finding the connected components of an undirected graph is one of the most fundamental graph problems. Connected components are used in a wide spectrum of applications including VLSI design, machine learning and image analysis. Sequentially, one can easily find all connected components in linear time using breadth-first traversal. However, in a massively distributed setting, finding connected components in a scalable way becomes much harder due to data irregularities and the overhead associated withthe increased need for communication. In this work, we present a communication-efficient distributed graph algorithm for finding connected components that scales to massively parallel machines. Our algorithm is based on a recent linear-work shared-memory parallel algorithm by Blelloch et al. [1] and refines it for a distributed memory setting. this includes a communication-efficient graph contraction procedure, as well as a distributed variant of the low diameter decomposition by Miller et al. [2]. We tackle the data irregularities introduced by high degree vertices by using an efficient procedure for distributing their incident edges. Our experimental evaluation on up to 16 384 cores indicates a good weak scaling behavior that outperforms current state-of-the-art algorithms.
暂无评论