Modern multi-core servers are powerful enough to process multi-gigabit live packet streams on the network data plane. However, in most cases network programmers must build their applications from scratch, by implement...
详细信息
ISBN:
(纸本)9781728181042
Modern multi-core servers are powerful enough to process multi-gigabit live packet streams on the network data plane. However, in most cases network programmers must build their applications from scratch, by implementing both the interfaces towards the lower hardware level and the proper mechanisms for parallel programming. Data Stream Processing (DaSP) frameworks have recently emerged as promising approaches to overcome the above issues and to let programmers simply focus on the logic of the application to develop. However, DaSP platforms are generally not designed for the networking domain, in terms of both performance and functions. In this paper, we selected the WindFlow DaSP framework and built suitable extensions to attach multiple (accelerated) packet sources of data to it. We then implemented a simple monitoring application on top of WindFlow and carried out stress tests with synthetic and real traffic. The results prove that performance scale linearly with the processing cores so that the application was able to process the whole amount of live data up to nearly 20 Gbps rate.
The article discusses a way to improve the efficiency of complex query execution in modern DBMS. The method is based on the use of tree structures, key hash codes, and the possibility of optimization based on partitio...
详细信息
ISBN:
(纸本)9781665404761
The article discusses a way to improve the efficiency of complex query execution in modern DBMS. The method is based on the use of tree structures, key hash codes, and the possibility of optimization based on partitioning. As an additional aspect of optimization, the method of parallel operation of the proposed method is described.
Inspired by earlier work on Augur, Vate is a probabilistic programming language for the construction of JVM based probabilistic models with an Object-Oriented interface. As a compiled language it is able to examine th...
详细信息
ISBN:
(纸本)9781450382984
Inspired by earlier work on Augur, Vate is a probabilistic programming language for the construction of JVM based probabilistic models with an Object-Oriented interface. As a compiled language it is able to examine the dependency graph of the model to produce optimised code that can be dynamically targeted to different platforms. Using Gibbs Sampling, Metropolis-Hastings and variable marginalisation it can handle a range of model types and is able to efficiently infer values, estimate probabilities, and execute models.
Despite of the widespread implementation of agent-based models in ecological modeling and another several areas, modelers have been concerned by the time consuming of these type of models. This paper presents a strate...
详细信息
ISBN:
(纸本)9783030869601;9783030869595
Despite of the widespread implementation of agent-based models in ecological modeling and another several areas, modelers have been concerned by the time consuming of these type of models. This paper presents a strategy to parallelize an agent-based model of spatial distribution of biological species, operating in a multi-stage synchronous distributed memory mode, as a way to obtain gains in the performance while reducing the need for synchronization. A multiprocessing implementation divides the environment (a rectangular grid corresponding to the study area) into stage-subsets, according to the number of defined or available processes. In order to ensure that there is no information loss, each stage-subset is extended with an overlapping section from each one of its neighbouring stage-subsets. The effect of the size of this overlapping on the quality of the simulations is studied. These results seem to indicate that it is possible to establish an optimal trade-off between the level of redundancy and the synchronization frequency. The reported paralellization method was tested in a standalone multicore machine but may be seamlessly scalable to a computation cluster.
Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to anal...
详细信息
ISBN:
(纸本)9781665423694
Emphasis on static timing analysis (STA) has shifted from graph-based analysis (GBA) to path-based analysis (PBA) for reducing unwanted slack pessimism. However, it is extremely time-consuming for a PBA engine to analyze a large set of critical paths. Recent years have seen many parallel PBA applications, but most of them are limited to CPU parallelism and do not scale beyond a few threads. To overcome this challenge, we propose in this paper a high-performance graphics processing unit (GPU)-accelerated PBA framework that efficiently analyzes the timing of a generated critical path set. We represent the path set in three dimensions, timing test, critical path, and pin, to structure the granularity of parallelism scalable to arbitrary problem sizes. In addition, we leverage task-based parallelism to decompose the PBA workload into CPU-GPU dependent tasks where kernel computation and data processing overlap efficiently. Experimental results show that our framework applied to an important PBA application can speed up the state-of-the-art baseline up to 10x on a million-gate design.
Many graph optimization problems, such as theMaximumWeighted Independent Set problem, are NP-hard. For large scale graphs that have billions of edges or vertices, these problems are hard to be computed directly even u...
详细信息
FPGA-based accelerators have received increasing attention recently. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large workloads. To achieve that...
详细信息
ISBN:
(纸本)9781665435550
FPGA-based accelerators have received increasing attention recently. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large workloads. To achieve that, FPGAs need to be interconnected in a Multi-FPGA architecture. However, programming such architecture is a challenging endeavor. This paper extends the OpenMP task-based computation offloading model to enable several FPGAs to work as a single Multi-FPGA architecture. Experimental results, for a set of OpenMP stencil applications running on a Multi-FPGA platform, have shown close to linear speedups as the number of FPGAs and IP-cores per FPGA increases.
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write ...
详细信息
ISBN:
(纸本)9781450384414
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write to different memory addresses that are contained within the same cache block causing unnecessary invalidations. Past work has proposed numerous ways to mitigate false sharing from coherence protocols optimized for certain sharing patterns, to software tools for false-sharing detection and repair. Our work leverages approximate computing and store value similarity in error-tolerant multi-threaded applications. We introduce a novel cache coherence protocol which implements an approximate store instruction and coherence states to allow some limited incoherence within approximatable shared data to mitigate both coherence misses and coherence traffic for various sharing patterns. For applications from the Phoenix and AxBench suites, we see dynamic energy improvements within the NoC and memory hierarchy of up to 50.1% and speedup of up to 37.3% with low output error for approximate applications that exhibit false sharing.
The article considers one algorithm for the optimal distribution of "objects" of an arbitrary nature among "storages", the essence of which is determined by the subject area. Some subject areas for...
详细信息
ISBN:
(纸本)9781665404761
The article considers one algorithm for the optimal distribution of "objects" of an arbitrary nature among "storages", the essence of which is determined by the subject area. Some subject areas for which the optimal distribution problem is relevant are considered. Authors considers the problem of accelerating of the Join operation is considered. In the case of big data parallel processing, the Join operation requires uniform distribution of data between the cluster processors. In this case, parallel implementation of the Join operation will be effective only when the computational complexities of its execution in all database fragments will be minimally different from each other. The optimality criterion should ensure uniform distribution of data. A detailed description of the heuristic optimal distribution algorithm is given. Objective functions for the problems under consideration are proposed. A description is given of the experiments that made it possible to assess the quality of the heuristic greedy optimal distribution algorithm. As a result of these experiments, the dependences of the execution time of the algorithm on the number of distributed objects and the quality of distribution (the difference between the maximum and minimum storage capacity) on the number of stores and the interval of the values of the objects weight. It is shown that the algorithm is quite simple and can be easily implemented in any programming language. The running time of the algorithm, even for big data, is small, which allows it to be effectively used in the preparation of data for parallel solving problems with high computational complexity. The algorithm shows good results when distributing ables-operands across data warehouses. The largest storage capacity differs from the smallest by a small amount.
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in supercomputers, including the US Department of Energy's forthcoming exascale systems. Since the execution model fo...
详细信息
ISBN:
(纸本)9781665411103
Graphics Processing Units (GPUs) have become a key technology for accelerating node performance in supercomputers, including the US Department of Energy's forthcoming exascale systems. Since the execution model for GPUs differs from that for conventional processors, applications need to be rewritten to exploit GPU parallelism. Performance tools are needed for such GPU-accelerated systems to help developers assess how well applications offload computation onto GPUs. In this paper, we describe extensions to Rice University's HPCToolkit performance tools that support measurement and analysis of Intel's DPC++ programming model for GPU-accelerated systems atop an implementation of the industry-standard OpenCL framework for heterogeneous parallelism on Intel GPUs. HPCToolkit supports three techniques for performance analysis of programs atop OpenCL on Intel GPUs. First, HPCToolkit supports profiling and tracing of OpenCL kernels. Second, HPCToolkit supports CPU-GPU blame shifting for OpenCL kernel executions-a profiling technique that can identify code that executes on one or more CPUs while GPUs are idle. Third, HPCToolkit supports fine-grained measurement, analysis, and attribution of performance metrics to OpenCL GPU kernels, including instruction counts, execution latency, and SIMD waste. The paper describes these capabilities and then illustrates their application in case studies with two applications that offload computations onto Intel GPUs.
暂无评论