Domain-specific languages that execute image processing pipelines on GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, cur...
详细信息
ISBN:
(纸本)9781450380751
Domain-specific languages that execute image processing pipelines on GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have limitations: 1) they require intra thread block synchronization, which has a nontrivial cost, 2) they must choose between small tiles that require more overlapped computations or large tiles that increase shared memory access (and lowers occupancy), and 3) their autoscheduling algorithms use simplified GPU models that can result in inefficient global memory accesses. We present a new approach for executing image processing pipelines on GPUs that addresses these limitations as follows. 1) We fuse loops to form overlapped tiles that fit in a single warp, which allows us to use lightweight warp synchronization. 2) We introduce hybrid tiling, which stores overlapped regions in a combination of thread-local registers and shared memory. thus hybrid tiling either increases occupancy by decreasing shared memory usage or decreases overlapping computations using larger tiles. 3) We present an automatic loop fusion algorithm that considers several factors that affect the performance of GPU kernels. We implement these techniques in PolyMage-GPU, which is a new GPU backend for PolyMage. Our approach produces code that is faster than Halide's manual schedules: 1.65x faster on an NVIDIA GTX 1080Ti and 1.33x faster on an NVIDIA Tesla V100.
High-Performance Computing (HPC) is a fundamental tool for improving the performance of many algorithms in terms of time, especially for large-scale problems. In the last years, various HPC architectures have been dev...
详细信息
Edge computing extends cloud computing capabilities to the edge of the network, allowing for instance Internet-of-things (IoT) applications to process computation more locally and thus more efficiently. We aim to mini...
详细信息
ISBN:
(纸本)9789897584244
Edge computing extends cloud computing capabilities to the edge of the network, allowing for instance Internet-of-things (IoT) applications to process computation more locally and thus more efficiently. We aim to minimize latency and delay in edge architectures. We focus on an advanced architectural setting that takes communication and processing delays into account in addition to an actual request execution time in a performance engineering scenario. Our architecture is based on multi-cluster edge layer with local independent edge node clusters. We argue that particle swarm optimisation as a bio-inspired optimisation approach is an ideal candidate for distributed load processing in semi-autonomous edge clusters for IoT management. By designing a controller and using a particle swarm optimization algorithm, we can demonstrate that processing and propagation delay and the end-to-end latency (i.e., total response time) can be optimized.
Tangle is a novel directed acyclic graph (DAG)-based distributed ledger preferred over traditional linear ledgers in blockchain applications because of better transaction throughput. Earlier techniques have mostly foc...
详细信息
ISBN:
(纸本)9783030602482;9783030602475
Tangle is a novel directed acyclic graph (DAG)-based distributed ledger preferred over traditional linear ledgers in blockchain applications because of better transaction throughput. Earlier techniques have mostly focused on comparing the performance of graph chains over linear chains and incorporating the Markov Chain Monte Carlo process in probabilistic traversals to detect unverified transactions in DAG chains. In this paper, we present a parallel detection method for unverified transactions. Experimental evaluation of the proposed parallel technique demonstrates a significant, scalable average speed-up of close to 70%, and a peak speed-up of approximately 73% for a large number of transactions.
Vehicle Routing Problems (VRPs) are well-know combinatorial optimization problems used to design an optimal route for a fleet of vehicles to service a set of customers under a number of constraints. Due to their NP-ha...
详细信息
ISBN:
(纸本)9783030602451;9783030602444
Vehicle Routing Problems (VRPs) are well-know combinatorial optimization problems used to design an optimal route for a fleet of vehicles to service a set of customers under a number of constraints. Due to their NP-hard complexity, a number of purely computational techniques have been proposed in recent years in order to solve them. Among these techniques, nature-inspired algorithms have proven their effectiveness in terms of accuracy and convergence speed. Some of these methods are also designed in such a way to decompose the basic problem into a number of sub-problems which are subsequently solved in parallel computing environments. It is therefore the purpose of this paper to review the fresh corpus of the literature dealing withthe main approaches proposed over the past few years to solve combinatorial optimization problems in general and, in particular, the VRP and its different variants. Bibliometric and review studies are conducted with a special attention paid to metaheuristic strategies involving procedures withparallelarchitectures. the obtained results show an expansion of the use of parallelalgorithms for solving various VRPs. Nevertheless, the regression in the number of citations in this framework proves that the interest of the research community has declined somewhat in recent years. this decline may be explained by the lack of rigorous mathematical results and practical interfaces under famous calculation softwares.
Scientific workflows are increasingly important for complex scientific applications. Recently, Function as a Service (FaaS) has emerged as a platform for processing non-interactive tasks. FaaS (such as AWS Lambda and ...
详细信息
ISBN:
(纸本)9783030715939;9783030715922
Scientific workflows are increasingly important for complex scientific applications. Recently, Function as a Service (FaaS) has emerged as a platform for processing non-interactive tasks. FaaS (such as AWS Lambda and Google Cloud Functions) can play an important role in processing scientific workflows. A number of works have demonstrated their ability to process these workflows. However, some issues were identified when workflows executed on cloud functions due to their limits (e.g., stateless behaviour). A major issue is the additional data transfer during the execution between object storage and the FaaS invocation environment. this leads to increased communication costs. DEWE v3 is one of the Workflow Management Systems (WMSs) that already had foundations for processing workflows with cloud functions. In this paper, we have modified the job dispatch algorithm of DEWE v3 on a function environment to reduce data dependency transfers. Our modified algorithm schedules jobs with precedence constraints to be executed in a single function invocation. therefore, later jobs can utilise output files generated from their predecessor job in the same invocation. this reduces the makespan of workflow execution. We have evaluated the improved scheduling algorithm and the original with small- and large-scale Montage workflows. the experimental results show that our algorithm can reduce the overall makespan in contrast to the original DEWE v3 by about 10%.
parallel data platforms are recognized as a key solution for processing analytical queries running on extremely large data warehouses (DWs). Deploying a DW on such platforms requires efficient data partitioning and al...
详细信息
ISBN:
(纸本)9783030389611;9783030389604
parallel data platforms are recognized as a key solution for processing analytical queries running on extremely large data warehouses (DWs). Deploying a DW on such platforms requires efficient data partitioning and allocation techniques. Most of these techniques assume a priori knowledge of workload. To deal withtheir evolution, reactive strategies are mainly used. the BI 2.0 requirements have put large batch and ad-hoc user queries at the center. Consequently, reactive-based solutions for deploying a DW in parallel platforms are not sufficient. Autonomous computing has emerged as a paradigm that allows digital objects managing themselves in accordance with high-level guidance by the means of proactive approaches. Being inspired by this paradigm, we propose in this paper, a proactive approach based on a query clustering model to deploying a DW over a parallel platform. the query clustering triggers partitioning and allocation processes by considering only evolved query groups. Intensive experiments were conducted to show the efficiency of our proposal.
Most existing optimization methods for neural architecture search (NAS), including evolutionary algorithms, reinforcement learning and gradient-based approaches, have not employed memory strategies explicitly, which m...
详细信息
Population-based search algorithms, such as the Differential Evolution approach, evolve a pool of candidate solutions during the optimization process and are suitable for massively parallelarchitectures promoted by t...
详细信息
Graphics processing units (GPUs) are widely used in the area of scientific computing. While GPUs provide much higher peak performance, efficient implementation of real applications on the GPU architectures is still a ...
详细信息
Graphics processing units (GPUs) are widely used in the area of scientific computing. While GPUs provide much higher peak performance, efficient implementation of real applications on the GPU architectures is still a non-trivial task. It is crucial to realize efficient solution algorithmsthat can better utilize GPU architectures. this paper presents our efforts in parallelizing and optimizing LESAP, a CFD application for scramjet combustion simulation, on NVIDIA GPUs. the GPU parallelization is realized based on the CUDA programming model, with a data-parallel implicit time-marching method that is efficient on the GPU architecture. Furthermore, shared memory and redundant calculation are proposed to reduce memory access overhead during GPU computation, and data transfer between CPU and GPU is optimized by packing the data to be transferred. the experimental results show that the GPU version, when runs on four V100 GPUs, achieves a speedup of 11.26 times compared to the CPU version that runs on two 24-core Intel Skylake Gold 6240R CPUs. Excellent parallel scalability across multiple GPUs is also observed.
暂无评论