Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CP...
详细信息
Processing large-scale graphs is challenging due to the nature of the computation that causes irregular memory access patterns. Managing such irregular accesses may cause significant performance degradation on both CPUs and GPUs. Thus, recent research trends propose graph processing acceleration with Field-Programmable Gate Arrays (FPGA). FPGAs are programmable hardware devices that can be fully customised to perform specific tasks in a highly parallel and efficient manner. However, FPGAs have a limited amount of on-chip memory that cannot fit the entire graph. Due to the limited device memory size, data needs to be repeatedly transferred to and from the FPGA on-chip memory, which makes data transfer time dominate over the computation time. A possible way to overcome the FPGA accelerators' resource limitation is to engage a multi-FPGA distributed architecture and use an efficient partitioning scheme. Such a scheme aims to increase data locality and minimise communication between different partitions. This work proposes an FPGA processing engine that overlaps, hides and customises all data transfers so that the FPGA accelerator is fully utilised. This engine is integrated into a framework for using FPGA clusters and is able to use an offline partitioning method to facilitate the distribution of large-scale graphs. The proposed framework uses Hadoop at a higher level to map a graph to the underlying hardware platform. The higher layer of computation is responsible for gathering the blocks of data that have been pre-processed and stored on the host's file system and distribute to a lower layer of computation made of FPGAs. We show how graph partitioning combined with an FPGA architecture will lead to high performance, even when the graph has Millions of vertices and Billions of edges. In the case of the PageRank algorithm, widely used for ranking the importance of nodes in a graph, compared to state-of-the-art CPU and GPU solutions, our implementation is the fastest, a
In distributed computing such as grid computing, online users submit their tasks anytime and anywhere to dynamic resources. Task arrival and execution processes are stochastic. How to adapt to the consequent uncertain...
详细信息
In distributed computing such as grid computing, online users submit their tasks anytime and anywhere to dynamic resources. Task arrival and execution processes are stochastic. How to adapt to the consequent uncertainties, as well as scheduling overhead and response time, are the main concern in dynamic scheduling. Based on the decision theory, scheduling is formulated as a Markov decision process (MDP). To address this problem, an approach from machine learning is used to learn task arrival and execution patterns online. The proposed algorithm can automatically acquire such knowledge without any aforehand modeling, and proactively allocate tasks on account of the forthcoming tasks and their execution dynamics. Under comparison with four classic algorithms such as Min-Min, Min-Max, Suffrage, and ECT, the proposed algorithm has much less scheduling overhead. The experiments over both synthetic and practical environments reveal that the proposed algorithm outperforms other algorithms in terms of the average response time. The smaller variance of average response time further validates the robustness of our algorithm. (C) 2014 Elsevier Inc. All rights reserved.
With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing reso...
详细信息
With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a distributed learning system, task scheduling and resource scheduling are required. This article proposes to combine ARIMA and GRU models to predict the future task volume. In terms of task scheduling, multi-priority task queues are used to divide tasks into different queues according to their priorities to ensure that high-priority tasks can be completed in advance. In terms of resource scheduling, the reinforcement learning method is adopted to manage limited computing resources. The reward function of reinforcement learning is constructed based on the resources occupied by the task, the training time, the accuracy of the model. When a distributed learning model tends to converge, the computing resources of the task are gradually reduced so that they can be allocated to other learning tasks. The results of experiments demonstrate that RLPTO tends to use more compu-ting nodes when facing tasks with large data scale and has good scalability. The distributed learning system reward experiment shows that RLPTO can make the computing cluster get the largest reward.
Communication technologies are of primary importance for today's activity, since they contribute to achieving a lot of services for humans, from simple phone calls to advanced multimedia services, banking activiti...
详细信息
Communication technologies are of primary importance for today's activity, since they contribute to achieving a lot of services for humans, from simple phone calls to advanced multimedia services, banking activities, and healthcare, to cite a few [1]. They are also the key factor of many measurement applications in the contexts of industry 4.0, transportation, environmental monitoring, telemetering, building automation, and the emerging applications of the Internet of Things (IoT) and Industrial Internet of Things (IIoT) [2]-[4]. In other words, communication technologies enable the concept of "Networking for measurements" which explains the crucial role of networks for measurement applications.
Process descriptions are the backbones for creating products and delivering services auto-matically. computing the alignments between process descriptions (such as process mod-els) and process behavior is one of the f...
详细信息
Process descriptions are the backbones for creating products and delivering services auto-matically. computing the alignments between process descriptions (such as process mod-els) and process behavior is one of the fundamental tasks to lead to better processes and services. The reason is that the computed results can be directly used in checking compli-ance, diagnosing deviations, and analyzing bottlenecks for processes. Although various alignment techniques have been proposed in recent years, their performance is still chal-lenged by large logs and models. In this work, we introduce an efficient approach to accel-erate the computation of alignments. Specifically, we focus on the computation of optimal alignments, and try to improve the performance of the state-of-the-art A*-based method through Petri net decomposition. We present the details of our designs and also show that our approach can be easily implemented in a distributed environment using the Spark plat-form. Using datasets with large event logs and process models, we experimentally demon-strate that our approach can indeed accelerate current A*-based implementations in general. (c) 2022 Elsevier Inc. All rights reserved.
We give a protocol for Asynchronous distributed Key Generation (A-DKG) that is optimally resilient (can withstand f < n/3 faulty parties), has a constant expected number of rounds, has O(lambda n(3)) expected commu...
详细信息
We give a protocol for Asynchronous distributed Key Generation (A-DKG) that is optimally resilient (can withstand f < n/3 faulty parties), has a constant expected number of rounds, has O(lambda n(3)) expected communication complexity, and assumes only the existence of a PKI. Prior to our work, the best A-DKG protocols required Omega(n) expected number of rounds, and Omega(n(4)) expected communication. Our A-DKG protocol relies on several building blocks that are of independent interest. We define and design a Proposal Election (PE) protocol that allows parties to retrospectively agree on a valid proposal after enough proposals have been sent from different parties. With constant probability the elected proposal was proposed by a nonfaulty party. In building our PE protocol, we design a Verifiable Gather protocol which allows parties to communicate which proposals they have and have not seen in a verifiable manner. The final building block to our A-DKG is a Validated Asynchronous Byzantine Agreement (VABA) protocol. We use our PE protocol to construct a VABA protocol that does not require leaders or an asynchronous DKG setup. Our VABA protocol can be used more generally when it is not possible to use threshold signatures.
A new technique for distribution of GEANT4 processes is introduced to simplify running a simulation in a parallel environment such as a tightly coupled computer cluster. Using a new C+ + class derived from the GEANT4 ...
详细信息
A new technique for distribution of GEANT4 processes is introduced to simplify running a simulation in a parallel environment such as a tightly coupled computer cluster. Using a new C+ + class derived from the GEANT4 toolkit, multiple runs forming a single simulation are managed across a local network of computers with a simple inter-node communication protocol. The class is integrated with the GEANT4 toolkit and is designed to scale from a single symmetric multiprocessing (SMP) machine to compact clusters ranging in size from tens to thousands of nodes. User designed 'work tickets' are distributed to clients using a client server work flow model to specify the parameters for each individual run of the simulation. The new g4distributedRunmanager class was developed and well tested in the course of our Neutron Stimulated Emission Computed Tomography (NSECT) experiments. It will be useful for anyone running GEANT4 for large discrete data sets such as covering a range of angles in computed tomography, calculating dose delivery with multiple fractions or simply speeding the through put of a single model. (C) 2014 Elsevier B.V. All rights reserved,
With the arrival of the current digital era and the advancement of information transmission technologies, there has been an unprecedented rise in data. Efficient extraction of useful information from the volumes of da...
详细信息
With the arrival of the current digital era and the advancement of information transmission technologies, there has been an unprecedented rise in data. Efficient extraction of useful information from the volumes of data has garnered growing interest from academics and the industry. Data mining research focuses on finding utility patterns in large datasets. But the inherent complications like frequent scans, creation of substantial candidate sets, etc. plague the mining process for large datasets. Distributive architecture-based approaches also prove inefficacious due to high communication overhead over iterations. High communication cost over data exchange both locally and remotely further aggravates the situation. We propose a Communication Cost Effective Utility-based Pattern Mining (CEUPM) algorithm based on the Spark framework to address this issue. Spark accelerates iterative scanning by storing scanned datasets in a memory abstraction called resilient distributed datasets (RDD). RDD operations need a redistribution of data among cluster nodes during processing. To minimize the communication cost incurred during the shuffle process, we adopt a search space division strategy based on data parallelism for a fair and effective task allocation across cluster nodes. Communication overhead is incurred during this redistribution or shuffle process while minimizing costs. Experimental results in four real datasets demonstrate that CEUPM considerably reduces shuffling overhead and outperforms other existing methods in terms of memory usage, communication cost, execution time, and scalability.
This paper presents an efficient real-time person re-identification (ReID) and pedestrian tracking solution optimized for resource-constrained edge devices in multi-camera surveillance. Our key contribution is a hybri...
详细信息
This paper presents an efficient real-time person re-identification (ReID) and pedestrian tracking solution optimized for resource-constrained edge devices in multi-camera surveillance. Our key contribution is a hybrid distributed architecture that offloads lightweight detection tasks (using YOLOv10n) to edge devices, while a centralized server handles advanced feature extraction (OSNet) and robust identity tracking (ByteTrack). To improve efficiency, we integrate adaptive frame skipping on edge devices and parallel batch processing on the server. Semantic-enhanced embeddings and a memory-based retrieval mechanism improve ReID performance in crowded scenes. Additionally, we employ Apache Kafka for efficient load balancing and video stream management. Experimental results on CUHK03 and Penn-Fudan demonstrated high accuracy while maintaining real-time performance on limited-resource hardware (2 vCPU, 4 GB RAM, and Jetson Nano). These results make our approach a practical solution for real-world surveillance applications in crowded environments. Our code is available at: https://***/2uanDM/reid-pipeline.
Deep learning's widespread adoption in various fields has made distributed training across multiple computing nodes essential. However, frequent communication between nodes can significantly slow down training spe...
详细信息
Deep learning's widespread adoption in various fields has made distributed training across multiple computing nodes essential. However, frequent communication between nodes can significantly slow down training speed, creating a bottleneck in distributed training. To address this issue, researchers are focusing on communication optimization algorithms for distributed deep learning systems. In this paper, we propose a standard that systematically classifies all communication optimization algorithms based on mathematical modeling, which is not achieved by existing surveys in the field. We categorize existing works into four categories based on the optimization strategies of communication: communication masking, communication compression, communication frequency reduction, and hybrid optimization. Finally, we discuss potential future challenges and research directions in the field of communication optimization algorithms for distributed deep learning systems.
暂无评论