this paper presents the implementation of ray-tracing-based algorithms for multi-objective geospatial optimization targeting various many-core processing technologies such as graphics processing units, x86 multi-cores...
详细信息
ISBN:
(纸本)9781479948970
this paper presents the implementation of ray-tracing-based algorithms for multi-objective geospatial optimization targeting various many-core processing technologies such as graphics processing units, x86 multi-cores, and ARM processors. High performance is achieved through highly parallel core algorithms, executed on multiple compute devices across a heterogeneous architecture using low-level OpenCL kernels. algorithms for calculating line-of-sight ballistic threat, visual observability, ground plane extraction, and Markov chain Monte Carlo optimization provide an augmented geospatial intelligence and situational awareness in three-dimensional urban environments.
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal ...
ISBN:
(纸本)9783642330773
the proceedings contain 73 papers. the topics discussed include: accelerating the dynamic programming for the optimal polygon triangulation on the GPU;security computing for the resiliency of protecting from internal attacks in distributed wireless sensor networks;optimization of a short-range proximity effect correction algorithm in e-beam lithography using GPGPUs;vectorized algorithms for Quadtree construction and descent;an optimal parallel prefix-sums algorithm on the memory machine models for GPUs;enhancing the performance of a distributed mobile computing environment by topology construction;maintaining consistency in software transactional memory through dynamic versioning tuning;a new low latency parallel turbo decoder employing parallel phase decoding method;high-performance matrix multiply on a massively multithreaded Fiteng1000 processor;and on construction of Cloud IaaS for VM live migration using KVM and OpenNebula.
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel ...
详细信息
ISBN:
(纸本)9781479907298
For high performance computation memory access is a major issue. Whether it is a supercomputer, a GPGPU device, or an Application Specific Instruction set Processor (ASIP) for Digital Signal processing (DSP) parallel execution is a necessity. A high rate of computation puts pressure on the memory access, and it is often non-trivial to maximize the data rate to the execution units. Many algorithmsthat from a computational point of view can be implemented efficiently on parallelarchitectures fail to achieve significant speed-ups. the reason is very often that the speed-up possible withthe available execution units are poorly utilized due to inefficient data access. this paper shows a method for improving the access time for sequences of data that are completely static at the cost of extra memory. this is done by resolving memory conflicts by using padding. the method can be automatically applied and it is shown to significantly reduce the data access time for sorting and FFTs. the execution time for the FFT is improved with up to a factor of 3.4 and for sorting by a factor of up to 8.
Fractal organizations are a class of bio-inspired distributed hierarchical architectures in which control and feedback information are allowed to flow independently of the position the participating nodes have in the ...
详细信息
ISBN:
(纸本)9781479924813
Fractal organizations are a class of bio-inspired distributed hierarchical architectures in which control and feedback information are allowed to flow independently of the position the participating nodes have in the system hierarchy. In this paper we discuss the adoption of a fractal organization in a class of socio-technical systems characterized by a centralized architecture. We present the key architectural traits of the resulting Fractal Social Organization and put forward our conjecture that services based on the presented solution may exhibit significant improvements, e.g., in terms of scalability and performance. In order to provide elements to justify our conjecture we describe how we envision the use of the new organization in two different cases: a framework for semantic service description-and-matching and a low-cost telemonitoring service.
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessor...
详细信息
ISBN:
(纸本)9780769549392;9781467353212
Decision Support System (DSS) workloads are known to be one of the most time-consuming database workloads that process large data sets. Traditionally, DSS queries have been accelerated using large-scale multiprocessors. In this work we exploit the benefits of using future many-core architectures, more specifically on-chip clustered many-core architectures. To achieve this goal we propose different representative data parallel versions of the original database scan and join algorithms. We also study the impact on the performance when on-chip memory, shared among all cores, is used as a prefetching buffer. For our experiments we study the behaviour of three queries from the standard DSS benchmark TPC-H executing on the Intel Single chip Cloud Computer experimental processor (Intel SCC). Our results show that parallelism can be well exploited by such architectures and how important it is to have a balance between computation and data intensity. Moreover, from our experimental results we show that performance improvement of 5x and 10x for the corresponding query implementation without data prefetching. Finally we show how we could efficiently use the system in order to achieve high power-performance efficiency when using the proposed prefetching buffer.
the performance of parallel distributed data management systems becomes increasingly important withthe rise of Big Data. parallel joins have been widely studied both in the parallelprocessing and the database commun...
详细信息
ISBN:
(纸本)9780769550886
the performance of parallel distributed data management systems becomes increasingly important withthe rise of Big Data. parallel joins have been widely studied both in the parallelprocessing and the database communities. Nevertheless, most of the algorithms so far developed do not consider the data skew, which naturally exists in various applications. State of the art methods designed to handle this problem are based on extensions to either of the two prevalent conventional approaches to parallel joins - the hash-based and duplication-based frameworks. In this paper, we introduce a novel parallel join framework, query-based distributed join (QbDJ), for handling data skew on distributed architectures. Further, we present an efficient implementation of the method based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate the performance of our approach on a cluster of 192 cores (16 nodes) and datasets of 1 billion tuples with different skews. the results show that the method is scalable, and also runs faster with less network communication compared to state-of-art PRPD approach in [1] under high data skew.
How to map IP cores onto NoC architectures is a significant issue (application mapping) in multi-core system design. Many mapping algorithms which aim at optimizing cost metrics(e.g. energy consumption) in the mapping...
详细信息
Discrete Stochastic Arithmetic (DSA) estimates round-off error propagation in a program. It is based on a synchronous execution of several instances of the program to control using a random rounding mode. In this pape...
详细信息
ISBN:
(纸本)9780769549675
Discrete Stochastic Arithmetic (DSA) estimates round-off error propagation in a program. It is based on a synchronous execution of several instances of the program to control using a random rounding mode. In this paper we show how we can take advantage of multicore processors, which are nowadays widespread, to reduce the cost of DSA in terms of execution time. Several processes execute in parallel different instances of the program and exchange data when necessary. Several strategies are compared for the estimation of the result accuracy and the detection of numerical instabilities. With our parallel implementation, the cost of DSA is reduced by a factor of about 2 compared withthe sequential approach. Our parallel implementation of DSA has been used successfully for the numerical validation of a real-life application.
Embedded real-time algorithms are often realized with dedicated hardware, exhibiting high production costs and low programming flexibility thereafter. For instance, semi-global matching for stereo image processing, in...
详细信息
ISBN:
(纸本)9781479901036
Embedded real-time algorithms are often realized with dedicated hardware, exhibiting high production costs and low programming flexibility thereafter. For instance, semi-global matching for stereo image processing, including complex data flows, traditionally runs on customized hardware modules. Combining the processing and memory capabilities of multiple individual cores, emerging embedded multi-core technologies address these problems. However, considering concurrency issues (e. g., data races and lock contentions), parallel programming requires experienced programmers and technology-specific techniques (e. g., synchronization libraries) and tools (e. g., parallel profilers), which are often not available on embedded platforms. In this work, we introduce a parallel version of a semi-global matching algorithm and demonstrate within this case study runtime optimizations necessary to meet real-time requirements. We also show structured steps of the applied parallelization workflow, illustrating an efficient migration strategy to multi-core platforms using runtime information (e. g., profiles and hardware counters). Finally, to evaluate the resulting performance characteristics, we compare the runtime behavior of the parallel version running on a Freescale P4080 processor with reference values taken on an Intel i7, a field-programmable logic device, an extended general purpose processor and a GPU.
Using the Barnes-Hut algorithm as an example we deal withthe design of parallelalgorithmsthat are able to exploit multicore CPUs and GPUs conjointly. Specifically, we demonstrate how to modularize a parallel applic...
详细信息
ISBN:
(纸本)9783642400476
Using the Barnes-Hut algorithm as an example we deal withthe design of parallelalgorithmsthat are able to exploit multicore CPUs and GPUs conjointly. Specifically, we demonstrate how to modularize a parallel application according to specific aspects of parallel execution. this allows for a flexible assignment of individual modules to the two parallelarchitectures based on their actual performance characteristics. Furthermore, we discuss a hybrid module for the most time consuming part of the algorithm that utilizes CPU and GPU simultaneously employing a novel load balancing heuristic. Our experimental evaluation shows that our method greatly increases overall efficiency by allowing to deploy the optimal configuration of modules for each individual computer system.
暂无评论