the paper discusses the selected bi-clustering algorithms in terms of energy efficiency. We demonstrate the need for the power aware software development, elaborate bi-clustering methods and applications, and describe...
详细信息
ISBN:
(纸本)9783319780542;9783319780535
the paper discusses the selected bi-clustering algorithms in terms of energy efficiency. We demonstrate the need for the power aware software development, elaborate bi-clustering methods and applications, and describe the experimental computational cluster with a custom built energy measurement instrumentation.
the proceedings contain 101 papers. the special focus in this conference is on parallelprocessing and Applied Mathematics. the topics include: Performance and energy analysis of scientific workloads executing on LPSo...
ISBN:
(纸本)9783319780238
the proceedings contain 101 papers. the special focus in this conference is on parallelprocessing and Applied Mathematics. the topics include: Performance and energy analysis of scientific workloads executing on LPSoCs;Energy Efficient dynamic load balancing over multiGPU heterogeneous systems;scheduling data gathering with maximum lateness objective;Fair scheduling in grid VOs with anticipation heuristic;a security-driven approach to online job scheduling in IaaS cloud computing systems;dynamic load balancing algorithm for heterogeneous clusters;multi-objective extremal optimization in processor load balancing for distributed programs;pardis: A process calculus for parallel and distributed programming in haskell;towards high-performance python;Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources;actor model of a new functional language - Anemone;Almost optimal column-wise prefix-sum computation on the GPU;A combination of intra- and inter-place work stealing for the APGAS library;Benchmarking molecular dynamics with OpenCL on many-core architectures;efficient language-based parallelization of computational problems using cilk plus;a taxonomy of task-based technologies for high-performance computing;Interoperability of GASPI and MPI in large scale scientific applications;Evaluation of the parallel performance of the java and PCJ on the intel KNL based systems;Fault-tolerance mechanisms for the java parallel codes implemented withthe PCJ library;Exploring graph analytics withthe PCJ toolbox;Relaxing the correctness conditions on concurrent data structures for multicore CPUs. A numerical case study;Big data analytics in java with PCJ library: Performance comparison with hadoop;parallel exact diagonalization approach to large molecular nanomagnets modelling.
the proceedings contain 45 papers. the special focus in this conference is on parallelprocessing and Applied Mathematics. the topics include: Performance and energy analysis of scientific workloads executing on LPSoC...
ISBN:
(纸本)9783319780535
the proceedings contain 45 papers. the special focus in this conference is on parallelprocessing and Applied Mathematics. the topics include: Performance and energy analysis of scientific workloads executing on LPSoCs;Energy Efficient dynamic load balancing over multiGPU heterogeneous systems;scheduling data gathering with maximum lateness objective;Fair scheduling in grid VOs with anticipation heuristic;a security-driven approach to online job scheduling in IaaS cloud computing systems;dynamic load balancing algorithm for heterogeneous clusters;multi-objective extremal optimization in processor load balancing for distributed programs;pardis: A process calculus for parallel and distributed programming in haskell;towards high-performance python;Using GPGPU accelerated interpolation algorithms for marine bathymetry processing with on-premises and cloud based computational resources;actor model of a new functional language - Anemone;Almost optimal column-wise prefix-sum computation on the GPU;A combination of intra- and inter-place work stealing for the APGAS library;Benchmarking molecular dynamics with OpenCL on many-core architectures;efficient language-based parallelization of computational problems using cilk plus;a taxonomy of task-based technologies for high-performance computing;Interoperability of GASPI and MPI in large scale scientific applications;Evaluation of the parallel performance of the java and PCJ on the intel KNL based systems;Fault-tolerance mechanisms for the java parallel codes implemented withthe PCJ library;Exploring graph analytics withthe PCJ toolbox;Relaxing the correctness conditions on concurrent data structures for multicore CPUs. A numerical case study;Big data analytics in java with PCJ library: Performance comparison with hadoop;parallel exact diagonalization approach to large molecular nanomagnets modelling.
the paper describes a GEP-based ensemble classifier constructed using the stacked generalization concept. the classifier has been implemented with a view to enable parallelprocessing, withthe use of Spark and SWIM -...
详细信息
ISBN:
(纸本)9783319984469;9783319984452
the paper describes a GEP-based ensemble classifier constructed using the stacked generalization concept. the classifier has been implemented with a view to enable parallelprocessing, withthe use of Spark and SWIM - an open source genetic programming library. the classifier has been validated in computational experiments carried-out on benchmark datasets.
Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, us...
详细信息
ISBN:
(纸本)9781450359504
Stencil computation is one of the most important kernels in many application domains such as image processing, solving partial differential equations, and cellular automata. Many of the stencil kernels are complex, usually consist of multiple stages or iterations, and are often computation-bounded. Such kernels are often offloaded to FPGAs to take advantages of the efficiency of dedicated hardware. However, implementing such complex kernels efficiently is not trivial, due to complicated data dependencies, difficulties of programming FPGAs with RTL, as well as large design space. In this paper we present SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs. the SODA microarchitecture minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism. the SODA automation framework takes high-level user input and generates efficient, high-frequency dataflow implementation. this significantly reduces the difficulty of programming FPGAs efficiently for stencil algorithms. the SODA design-space exploration framework models the resource constraints and searches for the performance-optimized configuration with accurate models for post-synthesis resource utilization and on-board execution throughput. Experimental results from on-board execution using a wide range of benchmarks show up to 3.28x speed up over 24-thread CPU and our fully automated framework achieves better performance compared with manually designed state-of-the-art FPGA accelerators.
In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unk...
详细信息
ISBN:
(纸本)9781538649756
In silico investigation of biological systems requires the knowledge of numerical parameters that cannot be easily measured in laboratory experiments, leading to the Parameter Estimation (PE) problem, in which the unknown parameters are automatically inferred by means of optimization algorithms exploiting the available experimental data. Here we present (MSPSO)-P-2, an efficient parallel and distributed implementation of a PE method based on Particle Swarm Optimization (PSO) for the estimation of reaction constants in mathematical models of biological systems, considering as target for the estimation a set of discrete-time measurements of molecular species amounts. In particular, such PE method accounts for the availability of experimental data typically measured under different experimental conditions, by considering a multi-swarm PSO in which the best particles of the swarms can migrate. this strategy allows to infer a common set of reaction constants that simultaneously fits all target data used in the PE. To the aim of efficiently tackling the PE problem, (MSPSO)-P-2 embeds the execution of cupSODA, a deterministic simulator that relies on Graphics processing Units to achieve a massive parallelization of the simulations required in the fitness evaluation of particles. In addition, a further level of parallelism is realized by exploiting the Master-Slave distributed programming paradigm. We apply (MSPSO)-P-2 for the PE of synthetic biochemical models with10, 20 and 30 parameters to be estimated, and compare the performances obtained with different GPUs and different configurations (i.e., numbers of processes) of the Master-Slave.
In our previous papers [12,13], we proposed the parallel realization of the Deep Belief Network (DBN). this research confirmed the potential usefulness of the first generation of the Intel MIC architecture for impleme...
详细信息
ISBN:
(纸本)9783319780245;9783319780238
In our previous papers [12,13], we proposed the parallel realization of the Deep Belief Network (DBN). this research confirmed the potential usefulness of the first generation of the Intel MIC architecture for implementing DBN and similar algorithms. In this work, we investigate how the Intel MIC and CPU platforms can be applied to implement efficiently the complete learning process using DBNs with layers corresponding to the Restricted Boltzman Machines. the focus is on the new generation of Intel MIC devices known as Knights Landing. Unlike the previous generation, called Knights Corner, they are delivered not as coprocessors, but as standalone processors. the learning procedure is based on the matrix approach, where learning samples are grouped into packages, and represented as matrices. We study the possible ways of improving the performance taking into account features of the Knights Landing architecture, and parameters of the learning algorithm. In particular, the influence of the package size on the accuracy of learning, as well as on the performance of computations are investigated using conventional CPU and Intel Xeon Phi. the performance advantages of Knights Landing over Knights Corner are presented and discussed.
We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics processing Units (GPUs). We implement token recombination as an atomic GPU opera...
详细信息
ISBN:
(纸本)9781510872219
We describe initial work on an extension of the Kaldi toolkit that supports weighted finite-state transducer (WFST) decoding on Graphics processing Units (GPUs). We implement token recombination as an atomic GPU operation in order to fully parallelize the Viterbi beam search, and propose a dynamic load balancing strategy for more efficient token passing scheduling among GPU threads. We also redesign the exact lattice generation and lattice pruning algorithms for better utilization of the GPUs. Experiments on the Switchboard corpus show that the proposed method achieves identical 1-best results. and lattice quality in recognition and confidence measure tasks, while running 3 to 15 times faster than the single process Kaldi decoder. the above results are reported on different GPU architectures. Additionally we obtain a 46-fold speedup with sequence parallelism and multi-process service (MPS) in GPU.
A Coarse-Grained Reconfigurable Architecture called RASP2.0 is proposed in this paper for communication baseband signal processing. Based on the pipeline bubbles theory, the reconfigurable data path is divided into th...
详细信息
ISBN:
(数字)9781728109749
ISBN:
(纸本)9781728109749
A Coarse-Grained Reconfigurable Architecture called RASP2.0 is proposed in this paper for communication baseband signal processing. Based on the pipeline bubbles theory, the reconfigurable data path is divided into the data flow between processing elements and the data interaction between reconfigurable arrays and memory structure. To reduce the data transmission delay, the data flow features are summarized based on the locality and lifetime of data. By employing a parallel memory structure combined withthe DLT-based data updating strategy, the access performance is improved by 33% on average compared with RASPLO. As a result, the reconfigurable system presents more performance advantages and flexibility than other similar platforms.
In this paper, the directivity of the circular array is analyzed, real-time beamforming algorithm of circular array in frequency domain similar to parallel FIR filter structure is proposed by using the characteristic ...
详细信息
暂无评论