The efficient scheduling of large mixed parallel applications is challenging. Most existing algorithms utilize scheduling heuristics and approximation algorithms to determine a good schedule as basis for an efficient ...
详细信息
ISBN:
(纸本)9780769535449
The efficient scheduling of large mixed parallel applications is challenging. Most existing algorithms utilize scheduling heuristics and approximation algorithms to determine a good schedule as basis for an efficient execution in large scale scientific computing. This paper concentrates on the scheduling of mixed parallel applications represented by task graphs with parallel tasks and precedence constraints between them. Layer-based scheduling algorithms for homogeneous target platforms are improved by adding a move-blocks phase that further reduces the resulting parallel runtime. The layer-based scheduling approach is described and the move-blocks algorithm is introduced in detail. The move-blocks extension provides better scheduling results for small as well as for large problems but has only a small increase in runtime. This is shown by a comparison of the modified and the original algorithms over a wide range of test cases.
Collision checking takes most of the time in sampling based path planning algorithms. When the scene gets crowded, more samples are needed and the probability decreases to find a collision free sample. Broad phase alg...
详细信息
ISBN:
(纸本)9780769549392;9781467353212
Collision checking takes most of the time in sampling based path planning algorithms. When the scene gets crowded, more samples are needed and the probability decreases to find a collision free sample. Broad phase algorithms are designed to eliminate obviously collision free samples, so narrow phase algorithms can concentrate on fewer samples suspected to be in collision. In this study, we compare the performance of two broad phase algorithms implemented on both CPU and GPU. A novel technique is proposed to provide load balancing and efficient cache utilization on Bounding Sphere Collision Detection algorithm. Furthermore, Thrust library is extensively utilized on Sweep and Prune (SAP) algorithm. Our experimental results indicate speedups up to 103 times faster for GPU-based SAP algorithm and 134 times faster for GPU-based Bounding Sphere algorithm, compared to CPU implementations. This may allow using sampling based path planning algorithms for scenes with many robots.
network Function Virtualization enables operators to schedule diverse networkprocessing workloads on a general-purpose hardware infrastructure. However, short-lived processing peaks make an efficient dimensioning of ...
详细信息
ISBN:
(纸本)9781728165820
network Function Virtualization enables operators to schedule diverse networkprocessing workloads on a general-purpose hardware infrastructure. However, short-lived processing peaks make an efficient dimensioning of processing resources under stringent tail latency constraints challenging. To reduce dimensioning overheads, several load balancing approaches, which either adaptively steer network traffic to a group of servers or to their internal CPU cores, have separately been investigated. In this paper, we present Inter-Server RSS (isRSS), a hardware mechanism built on top of Receive Side Scaling in the network interface card, which combines intra- and inter-server load balancing. In a first step, isRSS targets a balanced utilization of processing resources by steering packet bursts to CPU cores based on per-core load feedback. If all local CPU cores are highly loaded, isRSS avoids high queueing delays by redirecting newly arriving packet bursts to other servers, which execute the same network functions, exploiting that processing peaks are unlikely to occur at all servers at the same time. Our evaluation based on real-world network traces shows that compared to Receive Side Scaling, the joint intra- and inter-server load balancing approach is able to reduce the processing capacity dimensioned for network function execution by up to 38.95% and limit packet reordering to 0.0589% while maintaining tail latencies.
Future many-core chips are envisioned to feature up to a thousand cores on a chip. With an increasing number of cores on a chip the problem of distributing load gets more prevalent. Even if a piece of software is desi...
详细信息
ISBN:
(纸本)9780769543284
Future many-core chips are envisioned to feature up to a thousand cores on a chip. With an increasing number of cores on a chip the problem of distributing load gets more prevalent. Even if a piece of software is designed to exploit parallelism it is not an easy to place parallel tasks on the cores to achieve maximum performance. This paper proposes the connectivity-sensitive algorithm for static task-placement onto a 2D mesh of interconnected cores. The decreased feature sizes of future VLSI chips will increase the number of permanent and transient faults. To accommodate partially faulty hardware the algorithm is designed to allow placement on irregular core structures, in particular, meshes with faulty nodes and links. The quality of the placement is measured by comparing the results to two baseline algorithms in terms of communication efficiency.
This paper demonstrates on speeding up an accurate analysis of fault trees using stochastic logic through GPGPUs. Actually, probability models of dynamic gates and new accurate models for different combinations of col...
详细信息
ISBN:
(纸本)9781467387767
This paper demonstrates on speeding up an accurate analysis of fault trees using stochastic logic through GPGPUs. Actually, probability models of dynamic gates and new accurate models for different combinations of cold spare gate e.g., two cold spare gates with a share spare and a cold spare gate with more than one spare inputs are developed in this paper. Experimental results show that on average;the proposed analysis method is 235 times faster than CPU simulation time. Moreover, proposing new stochastic models results accuracy and simplicity as additional advantages of the proposed method.
This paper proposes a strategy to organize metric-space query processing in multi-core search nodes as understood in the context of search engines running on clusters of computers. The strategy is applied in each sear...
详细信息
ISBN:
(纸本)9780769539393
This paper proposes a strategy to organize metric-space query processing in multi-core search nodes as understood in the context of search engines running on clusters of computers. The strategy is applied in each search node to process all active queries visiting the node as part of their solution which, in general, for each query is computed from the contribution of each search node. When query traffic is high enough, the proposed strategy assigns one thread to each query and lets them work in a fully asynchronous manner. When query traffic is moderate or low, some threads start to idle so they are put to work on queries being processed by other threads. The strategy solves the associated synchronization problem among threads by switching query processing into a bulk-synchronous mode of operation. This simplifies the dynamic re-organization of threads and overheads are very small with the advantage that the overall work-load is evenly distributed across all threads.
Though parallelprocessing of applications related to video processing and rendering are widespread, parallelprocessing of audio synthesis software (also known as soft-synths) is not much researched. This paper addre...
详细信息
ISBN:
(纸本)9781509060580
Though parallelprocessing of applications related to video processing and rendering are widespread, parallelprocessing of audio synthesis software (also known as soft-synths) is not much researched. This paper addresses our research and experiments in parallelizing digital audio synthesizers on a commodity multicore platform where these synthesizers are most commonly used. Soft real-time requirements and overheads of parallelization are two of the competing forces in this research. As a case study, the ALSA (Advanced Linux Sound Architecture) Modular audio Synthesizer (AMS) is evaluated. AMS employs a modular approach to digital music synthesis, is often part of a standard Linux installation package, has a GUI for user interactions, and like other audio synthesizers it has a soft real-time requirement. The main intentions of parallelization are for enhancing throughput and hence stability, whereby more complex and higher quality audio can be generated. The GUI based interactive approach adds an extra challenge of a dynamic call graph that can change on-the-fly. The paper compares the pros and cons of the different techniques adopted, and highlights the advantages of parallelization. The lessons learnt can also be used in parallelizing other existing audio synthesizers and designing new parallel synthesizers from scratch.
This paper describes xENoC, an automatic and component re-use HW-SW environment to build simulatable and synthesizable network-on-Chip-based MPSoC architectures. xENoC is based on a tool, named NoCWizard, which uses a...
详细信息
ISBN:
(纸本)9780769530895
This paper describes xENoC, an automatic and component re-use HW-SW environment to build simulatable and synthesizable network-on-Chip-based MPSoC architectures. xENoC is based on a tool, named NoCWizard, which uses an eXtensible Markup Language (XML) specification, and a set of modularized components and templates to generate many types of NoC instances by using Verilog HDL. This NoC models can be customized in terms of topology, tile location/mapping, RNIs generation, different types of routers, FIFO and packet/flit sizes, by simply modifying the XML specifications. Furthermore, xENoC is also composed of software components, i.e. RNI drivers and a parallel programming model, embedded Message Passing Interface (eMPI), which let us to carry out a complete HW-SW co-design methodology to design distributed-memory NoC-based MPSoCs parallel applications. Through xENoC different distributed-memory NoC-based MPSoCs designs have been created simulated and prototyped in physical platforms (e.g. FPGA boards), and some parallel multiprocessor test traffic applications are running there as system level demonstrators.
The main data-driven techniques for detecting cybersecurity attacks are based on the analysis of network traffic data and/or of application/system logs (stored in a host or in some other kind of device). A wide range ...
详细信息
ISBN:
(数字)9781728165820
ISBN:
(纸本)9781728165820
The main data-driven techniques for detecting cybersecurity attacks are based on the analysis of network traffic data and/or of application/system logs (stored in a host or in some other kind of device). A wide range of machine-learning techniques (and possible alternative configurations of them) have been proposed in the literature so far, for this purpose, but none of them has been proven to consistently overcome the others across different datasets. In order to ensure better accuracy and stability, the ensemble paradigm can be exploited as an effective solution for combining such techniques. However, as attack detection problems are hard to cope with and, usually, entail the analysis of large and fast streams of data, different types of ensemble (and of base algorithms composing the ensemble) should have experimented, exploiting distributed architecture to suitably reduce the high-execution times necessary to run them. In order to handle all these issues, a p2p environment to validate ensemble-based approaches in the cybersecurity domain is proposed in this paper. Two case studies are analyzed by using this framework, which concern the detection of intrusions in network-traffic data and of deviant process instances. Preliminary scalability results demonstrate that the framework is a viable solution for these challenging kind of problems.
Context-based Adaptive Binary Arithmetic Coding (CABAC) is the only compute-intensive task in the High Efficiency Video Coding (HEVC) Standard that does not contain significant data-level parallelism. As a result, it ...
详细信息
ISBN:
(纸本)9781728165820
Context-based Adaptive Binary Arithmetic Coding (CABAC) is the only compute-intensive task in the High Efficiency Video Coding (HEVC) Standard that does not contain significant data-level parallelism. As a result, it is often a throughput bottleneck for the overall decoding process, especially for high-quality videos. Consequently, the use of high-level parallelization techniques is inevitable to reach throughput requirements for CABAC decoding. Multiple high-level parallelization tools are specified in HEVC, amongst which Wavefront parallelprocessing (WPP) has only small losses in coding efficiency. However, it lacks in parallel efficiency due to a ramp-up and -down in active parallel threads within a frame. This is a serious problem for systems that cannot process multiple frames at the same time due to performance or memory constraints (e.g. mobile devices), and also for low-delay applications such as video conferencing. To address this issue, we present three improved WPP implementations for HEVC CABAC decoding. They differ in the granularity at which dependency checks are performed. The improvement comes from increased parallel efficiency of the WPP implementation while using the same number of threads as conventional WPP. The proposed implementations allow speedups up to 1.83 x with very little implementation overhead.
暂无评论