Weather radar is a system that utilizes advanced radio wave engineering to detect precipitation in the atmosphere. One of the wave generation technique used in weather radar is frequency-modulated continuous wave (FMC...
详细信息
ISBN:
(纸本)9781538672600;9781538672594
Weather radar is a system that utilizes advanced radio wave engineering to detect precipitation in the atmosphere. One of the wave generation technique used in weather radar is frequency-modulated continuous wave (FMCW), with dual polarization for differentiating detected precipitation types by its shape and size. Weather radar signal processing is usually performed using digital signal processing and field-programmable gate array (FPGA), that performs well but with difficulty in system development and deployment. Software implementation of weather radar signal processing enables easier and faster development and deployment withthe cost of performance when done serially. parallel implementation using general purpose graphics processing units (GP-GPU) may provide best of both worlds with easier development and deployment compared to hardware-based solutions but with better performance than serial CPU implementations. In this paper, implementation of various optimization strategies weather signal radar processing in GP-GPU environment on the Nvidia CUDA platform is shown. Performance measurements show that among optimization strategies implemented, only the utilization of multiple CUDA streams give significant performance gain. this paper contributes in attempts to build full weather radar signal processing stack on GPU.
Data collection, processing and visualization techniques has been going through a rapid evolution in recent years. Various applications utilize these new results;especially combined. parallel to this, there are new de...
详细信息
Data collection, processing and visualization techniques has been going through a rapid evolution in recent years. Various applications utilize these new results;especially combined. parallel to this, there are new developments within the Cyber-Physical Systems (CPS) domain. Beside the main purpose of CPS - getting physical systems serve to work better, faster, more optimized -, the concept can improve the longterm usability of the physical equipment, as well. the principles of proactive maintenance are rooted from the need of shortterm fault correction and long-term usability. the idea is to track system status not only for operations but for maintenance purpose as well. this allows for scheduling maintenance based on need - rather than based on operating time or "mileage". this paper presents the MANTIS framework for proactive maintenance. It utilizes CPS concepts for system modeling, furthermore, it proposes a combined tool set for data collection, processing and presentation. In order to cover the full value chain, the framework applies for all the three Tiers: the Edge, the Platform, and the Enterprise, as well.
Heterogeneous CPU-GPU platforms include resources to benefit from different kinds of parallelism present in many data mining applications based on evolutionary algorithmsthat evolve solutions with time-demanding fitn...
详细信息
ISBN:
(数字)9783319654829
ISBN:
(纸本)9783319654829;9783319654812
Heterogeneous CPU-GPU platforms include resources to benefit from different kinds of parallelism present in many data mining applications based on evolutionary algorithmsthat evolve solutions with time-demanding fitness evaluation. this paper describes an evolutionary parallel multi-objective feature selection procedure with subpopulations using two scheduling alternatives for evaluation of individuals according to the number of subpopulations. Evolving subpopulations usually provides good diversity properties and avoids premature convergence in evolutionary algorithms. the proposed procedure has been implemented in OpenMP to distribute dynamically either subpopulations or individuals among devices and OpenCL to evaluate the individuals taking into account the devices characteristics, providing two parallelism levels in CPU and up to three levels in GPUs. Different configurations of the proposed procedure have been evaluated and compared with a master-worker approach considering not only the runtime and achieved speedups but also the energy consumption between both scheduling models.
this paper deals with designing of RNS (Residue Number System) based building blocks for applications in digital signal processing. RNS provides parallel, carry free operations and since it deals with small numbers he...
详细信息
this paper deals with designing of RNS (Residue Number System) based building blocks for applications in digital signal processing. RNS provides parallel, carry free operations and since it deals with small numbers hence it is faster than other conventional methods. RNS based processing is performed in three stages namely Forward Conversion (FC), Modular Operations, and Reverse Conversion (RC). this paper is aimed at designing and analysis of efficient blocks in terms of area, delay and power for special moduli-set {2 n -1, 2 n , 2 n +1} using std_cell at 32/28 nm technology. Modification is done in earlier proposed architecture of Forward Converter for making it work for all valid combination of input data. In all basic blocks, binary adder is main component. Verilog HDL is used to design different blocks. Synopsys design compiler is used for area, power and delay calculation at 32/28 nm technology. It is observed that CSA (Carry Select Adder) based reverse converter is approximately 20% faster as compared to that based on RCA (Ripple Carry Adder). Also we have compared LUT based design of two different types of reverse converter namely CRT (Chinese Reminder theorem) and MRC (Mixed Radix Converter).
Multicore clusters are widely used to solve combinatorial optimization problems, which require high computing power and a large amount of memory. In this sense, Hash Distributed A* (HDA*) parallelizes A*, a combinator...
详细信息
ISBN:
(数字)9783319654829
ISBN:
(纸本)9783319654829;9783319654812
Multicore clusters are widely used to solve combinatorial optimization problems, which require high computing power and a large amount of memory. In this sense, Hash Distributed A* (HDA*) parallelizes A*, a combinatorial optimization algorithm, using the MPI library. HDA* scales well on multicore clusters and on multicore machines. Additionally, there exist several versions of HDA* that were adapted for multicore machines, using the Pthreads library. In this paper, we present Hybrid HDA* (HHDA*), a hybrid parallel search algorithm based on HDA* that combines message-passing (MPI) with shared-memory programming (Pthreads) to better exploit the computing power and memory of multicore clusters. We evaluate the performance and memory consumption of HHDA* on a multicore cluster, using the 15-puzzle as a case study. the results reveal that HHDA* achieves a slightly higher average performance and uses considerably less memory than HDA*. these improvements allowed HHDA* to solve one of the hardest 15-Puzzle instances.
In this paper, a software/hardware framework is proposed for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee ...
详细信息
In this paper, a software/hardware framework is proposed for generating uniform random numbers in parallel. Using the Fast Jump Ahead technique, the software can produce initial states for each generator to guarantee independence of different sub-streams. With support from the software, the hardware structure can be easily constructed by simply replicating the single generator. We apply the framework to parallelize MT19937 algorithm. Experimental results shows that our framework is capable of generating arbitrary number of independent parallel random sequences while obtaining speedup roughly proportional to the number of parallel cores. Meanwhile, our framework is superior to those existing architectures reported in the literatures in boththroughput rate and scalability. Furthermore, we implement 149 parallel instances of MT19937 generators on a Xilinx Virtex-5 FPGA device. It achieves the throughput of 42.61M samples/s. Compared to CPU and GPU implementations, the throughput is 10.0 and 2.5 times faster, while the throughput-power efficiency achieves 167.3 and 18.1 times speedup, respectively.
this article presents massively parallel execution of the BLAST algorithm on supercomputers and HPC clusters using thousands of processors. Our work is based on the optimal splitting up the set of queries running with...
详细信息
ISBN:
(数字)9783319654829
ISBN:
(纸本)9783319654829;9783319654812
this article presents massively parallel execution of the BLAST algorithm on supercomputers and HPC clusters using thousands of processors. Our work is based on the optimal splitting up the set of queries running withthe non-modified NCBI-BLAST package for sequence alignment. the work distribution and search management have been implemented in Java using a PCJ (parallel Computing in Java) library. the PCJ-BLAST package is responsible for reading sequence for comparison, splitting it up and start multiple NCBI-BLAST executables. We also investigated a problem of parallel I/O and thanks to PCJ library we deliver high throughput execution of BLAST. the presented results show that using Java and PCJ library we achieved very good performance and efficiency. In result, we have significantly reduced time required for sequence analysis. We have also proved that PCJ library can be used as an efficient tool for fast development of the scalable applications.
parallel computing architectures like GPUs have traditionally been used to accelerate applications with dense and highly-structured workloads;however, many important applications in science and engineering are irregul...
详细信息
ISBN:
(纸本)9781538610428
parallel computing architectures like GPUs have traditionally been used to accelerate applications with dense and highly-structured workloads;however, many important applications in science and engineering are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Numerical simulation of charged particle beam dynamics is one such application where the distribution of work and data in the accurate computation of collective effects at each time step is irregular and exhibits control-flow and memory access patterns that are not readily amenable to GPU's architecture. algorithms withthese properties tend to present both significant branch and memory divergence on GPUs which leads to severe performance bottlenecks. We present a novel cache-aware algorithm that uses machine learning to address this problem. the algorithm presented here uses supervised learning to adaptively model and track irregular access patterns in the computation of collective effects at each time step of the simulation to anticipate the future control-flow and data access patterns. Access pattern forecast are then used to formulate runtime decisions that minimize branch and memory divergence on GPUs, thereby improving the performance of collective effects computation at a future time step based on the observations from earlier time steps. Experimental results on NVIDIA Tesla K40 GPU shows that our approach is effective in maximizing data reuse, ensuring workload balance among parallelthreads, and in minimizing both branch and memory divergence. Further, the parallel implementation delivers up to 485 Gflops of double precision performance, which translates to a speedup of up to 2.5X compared to the fastest known GPU implementation.
In the emerging digital age, massive production of data is occurred actively or passively by collecting data from users and environment via applications, sensor devices and so on. that makes it important and crucial t...
详细信息
ISBN:
(纸本)9781538659304
In the emerging digital age, massive production of data is occurred actively or passively by collecting data from users and environment via applications, sensor devices and so on. that makes it important and crucial to have the ability to process big data efficiently and effectively utilize it. the challenge to process big data is that it has high volume, velocity, variety, as well as veracity and value. In this paper, we present a survey of related work and prescribe our recommendations towards building Bayesian classification for big data environments. It is based on MapReduce and is distributed, parallel, single pass and incremental which makes it feasible to be deployed and executed on cloud computing platform We also carry out scalability analysis of the proposed solution that it can train Bayesian classifier to perform predictive analytics by processing big data with large volume, velocity and variety.
Although the advancement of cyber technologies in sensing, communication and smart measurement devices significantly enhanced power system security and reliability, its dependency on data communications makes it vulne...
详细信息
ISBN:
(纸本)9781538617762
Although the advancement of cyber technologies in sensing, communication and smart measurement devices significantly enhanced power system security and reliability, its dependency on data communications makes it vulnerable to cyber-attacks. Coordinated false data injection (FDI) attacks manipulate power system measurements in a way that emulate the real behaviour of the system and remain unobservable, which misleads the state estimation process, and may result in power outages and even system blackouts. In this paper a robust dynamic state estimation (DSE) algorithm is proposed and implemented on the massively parallel architecture of graphic processing unit (GPU). Numerical simulation on IEEE-118 bus system demonstrate the efficiency and accuracy of the proposed mechanism.
暂无评论