the proceedings contain 61 papers. the topics discussed include: multi-objective hardware-software co-optimization for the SNIPER multi-core simulator;consistency checking of safety arguments in the goal structuring n...
ISBN:
(纸本)9781479965687
the proceedings contain 61 papers. the topics discussed include: multi-objective hardware-software co-optimization for the SNIPER multi-core simulator;consistency checking of safety arguments in the goal structuring notation standard;interleaving ontology-based reasoning and natural language processing for character identification in folktales;nonparametric weighted MMC feature extraction method for HRRP recognition;real-time pedestrian detection in urban scenarios;vision algorithms and embedded solution for pedestrian detection with far infrared camera;aggregate road surface based environment representation using digital elevation maps;UDCT complex coefficient statistics based rotation invariant texture characterization;industrial AGVs: toward a pervasive diffusion in modern factory warehouses;protecting cache memories through data scrambling technique;parallel object-oriented implementation of the TestU01 statistical test suites;and parallel implementation of the matrix rank test for randomness assessment.
Interest in parallelarchitectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic processing Units (CPUs) and Intel M...
详细信息
Interest in parallelarchitectures applied to real time selections is growing in High Energy Physics (HEP) experiments. In this paper we describe performance measurements of Graphic processing Units (CPUs) and Intel Many Integrated Core architecture (MIC) when applied to a typical HEP online task: the selection of events based on the trajectories of charged particles. We use as benchmark a scaled-up version of the algorithm used at CDF experiment at Tevatron for online track reconstruction - the SVT algorithm - as a realistic test-case for low-latency trigger systems using new computing architectures for LHC experiment. We examine the complexity/performance trade-off in porting existing serial algorithms to many-core devices. Measurements of both data processing and data transfer latency are shown, considering different I/O strategies to/from the parallel devices.
Running BWA in multithreaded mode on a multi-socket server results in poor scaling behaviour. this is because the current parallelisation strategy does not take into account the load imbalance that is inherent to the ...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
Running BWA in multithreaded mode on a multi-socket server results in poor scaling behaviour. this is because the current parallelisation strategy does not take into account the load imbalance that is inherent to the properties of the data being aligned, e.g. varying read lengths and numbers of mutations. Additional load imbalance is also caused by the BWA code not anticipating certain hardware characteristics of multi- socket multicores, such as the non-uniform memory access time of the different cores. We show that rewriting the parallel section using Cilk removes the load imbalance, resulting in a factor two performance improvement over the original BWA.
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processingarchitectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task g...
详细信息
ISBN:
(纸本)9781479976157
We present a simulated annealing based partitioning technique for mapping task graphs, onto heterogeneous processingarchitectures. Task partitioning onto homogeneous architectures to minimize the makespan of a task graph, is a known NP-hard problem. Heterogeneity greatly complicates the aforementioned partitioning problem, thus making heuristic solutions essential. A number of heuristic approaches have been proposed, some using simulated annealing. We propose a simulated annealing method with a novel NEXT STATE function to enable exploration of different regions of the global search space when the annealing temperature is high and making the search more local as the temperature drops. the novelty of our approach is two fold: (1) we go a step further than the existing scientific literature, considering heterogeneity at levels of task parallelism, data parallelism and communication. (2) We present a novel algorithm that uses simulated annealing to find better partitions in the presence of heterogeneous architectures, data parallel execution units, and significant data communication costs. We conduct a statistical analysis of the performance of the proposed method, which shows that our approach clearly outperforms the existing simulated annealing method.
In this paper we consider the cognitive process as a set of different tasks. In particular tasks of clustering, classification and search of association. Described the parameters of similarity of these tasks, hypothes...
详细信息
ISBN:
(纸本)9781479933037
In this paper we consider the cognitive process as a set of different tasks. In particular tasks of clustering, classification and search of association. Described the parameters of similarity of these tasks, hypothesized the possibility of creating a unified methodology for cognitive systems. Shown the possible original architecture of the system, its description, command and data formats, the principles of operation. Describes the implementation of a system model for the GPU.
Due to the increasing number of cores of current parallel machines, the question arises to which cores parallel tasks should be mapped. thus, parallel task scheduling is now more relevant than ever, especially under t...
详细信息
ISBN:
(数字)9783642551956
ISBN:
(纸本)9783642551956
Due to the increasing number of cores of current parallel machines, the question arises to which cores parallel tasks should be mapped. thus, parallel task scheduling is now more relevant than ever, especially under the moldable task model, in which tasks are allocated a fixed number of processors before execution. Scheduling algorithms commonly assume that the speedup function of moldable tasks is either non-decreasing, sub-linear or concave. In practice, however, the resulting speedup of parallel programs on current hardware with deep memory hierarchies is most often neither non-decreasing nor concave. We present a new algorithm for the problem of scheduling moldable tasks with precedence constraints for the makespan objective and for arbitrary speedup functions. We show through simulation that the algorithm not only creates competitive schedules for moldable tasks with arbitrary speedup functions, but also outperforms other published heuristics and approximation algorithms for non-decreasing speedup functions.
We derive a new parallel communication-avoiding matrix powers algorithm for matrices of the form A = D + USVH, where D is sparse and USVH has low rank and is possibly dense. We demonstrate that, with respect to the co...
详细信息
ISBN:
(纸本)9783642552243
We derive a new parallel communication-avoiding matrix powers algorithm for matrices of the form A = D + USVH, where D is sparse and USVH has low rank and is possibly dense. We demonstrate that, with respect to the cost of computing k sparse matrix-vector multiplications, our algorithm asymptotically reduces the parallel latency by a factor of O(k) for small additional bandwidth and computation costs. Using problems from real-world applications, our performance model predicts up to 13x speedups on petascale machines.
In this paper, we propose a method of enhancing Multi-Objective Genetic algorithms (MOGAs) for document clustering withparallel programming. the document clustering using MOGAs shows better performance than other clu...
详细信息
ISBN:
(纸本)9789897580390
In this paper, we propose a method of enhancing Multi-Objective Genetic algorithms (MOGAs) for document clustering withparallel programming. the document clustering using MOGAs shows better performance than other clustering algorithms. However, the overall computation time of the MOGAs is considerably long as the number of documents increases. To effectively avoid this problem, we implement the MO-GAs with General-Purpose computing on Graphics processing Units (GPGPU) to compute the document similarities for the clustering. Furthermore, we introduce two thread architectures (Term-threads and Document-threads) in the CUDA (Compute Unified Device Architecture) language. the experimental results show that the parallel MOGAs with CUDA are tremendously faster than the general MOGAs.
Halftoning is an important process to convert a gray scale image into a binary image with black and white pixels. the clipping-free DBS (Direct Binary Search)-based halftoning is one of the halftoning methods that can...
详细信息
ISBN:
(数字)9783319111971
ISBN:
(纸本)9783319111971;9783319111964
Halftoning is an important process to convert a gray scale image into a binary image with black and white pixels. the clipping-free DBS (Direct Binary Search)-based halftoning is one of the halftoning methods that can generate high quality binary images. However, considering the computing time, it is not realistic for most applications such as printing purpose. the main contribution of this paper is to show a new GPU implementation for the clipping-free DBS-based halftoning. We have considered programming issues of the GPU architecture to implement the method on the GPU. the experimental result shows that our GPU implementation on NVIDIA GeForce GTX 780 Ti for a 4096x3072 gray scale image runs in 7.240 seconds, while the CPU implementation runs in 346.6 seconds. thus, our GPU implementation attains a speed-up factor of 47.82.
the efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personal...
详细信息
ISBN:
(纸本)9781479934805
the efficient processing of large collections of patterns expressed as Boolean expressions over event streams plays a central role in major data intensive applications ranging from user-centric processing and personalization to real-time data analysis. On the one hand, emerging user-centric applications, including computational advertising and selective information dissemination, demand determining and presenting to an end-user the relevant content as it is published. On the other hand, applications in real-time data analysis, including push-based multi-query optimization, computational finance and intrusion detection, demand meeting stringent subsecond processing requirements and providing high-frequency event processing. We achieve these event processing requirements by exploiting the shift towards multi-core architectures by proposing novel adaptive parallel compressed event matching algorithm (A-PCM) and online event stream re-ordering technique (OSR) that unleash an unprecedented degree of parallelism amenable for highly parallel event processing. In our comprehensive evaluation, we demonstrate the efficiency of our proposed techniques. We show that the adaptive parallel compressed event matching algorithm can sustain an event rate of up to 233,863 events/second while state-of-the-art sequential event matching algorithms sustains only 36 events/second when processing up to five million Boolean expressions.
暂无评论