the proceedings contain 77 papers. the topics discussed include: on aggressive early deflation in parallel variants of the QR algorithm;a model for efficient onboard actualization of an instrumental cyclogram for the ...
ISBN:
(纸本)9783642281501
the proceedings contain 77 papers. the topics discussed include: on aggressive early deflation in parallel variants of the QR algorithm;a model for efficient onboard actualization of an instrumental cyclogram for the mars MetNet mission on a public cloud infrastructure;distributed Java programs initial mapping based on extremal optimization;global asynchronous parallel program control for multicore processors;streaming model computation of the FDTD problem;numerical investigation of the cumulant expansion for Fourier path integrals;simulated annealing with coarse graining and distributed computing;high performance computing techniques for scaling image analysis workflows;parallel computation of bivariate polynomial resultants on graphics processing units;an interval version of the Crank-Nicolson method - the first approach;and an interval finite difference method of Crank-Nicolson type for solving the one-dimensional heat conduction equation with mixed boundary conditions.
the proceedings contain 77 papers. the topics discussed include: on aggressive early deflation in parallel variants of the QR algorithm;a model for efficient onboard actualization of an instrumental cyclogram for the ...
ISBN:
(纸本)9783642281440
the proceedings contain 77 papers. the topics discussed include: on aggressive early deflation in parallel variants of the QR algorithm;a model for efficient onboard actualization of an instrumental cyclogram for the mars MetNet mission on a public cloud infrastructure;distributed Java programs initial mapping based on extremal optimization;global asynchronous parallel program control for multicore processors;streaming model computation of the FDTD problem;numerical investigation of the cumulant expansion for Fourier path integrals;simulated annealing with coarse graining and distributed computing;high performance computing techniques for scaling image analysis workflows;parallel computation of bivariate polynomial resultants on graphics processing units;an interval version of the Crank-Nicolson method - the first approach;and an interval finite difference method of Crank-Nicolson type for solving the one-dimensional heat conduction equation with mixed boundary conditions.
In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. the modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the numbe...
详细信息
ISBN:
(数字)9783642297403
ISBN:
(纸本)9783642297403;9783642297397
In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. the modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases, it suffers from high communication delays. therefore, NoC based architecture is proposed. the N-Body problem is a classical problem of approximating the motion of bodies. Two methods, namely Barnes-Hut (Barnes) and Fast Multipole (FMM), have been developed for fast simulation. the two algorithms have been implemented and studied in conventional computer systems and Graphics processing Units (GPUs). However, as a promising unconventional multicore architecture, the evaluation of N-Body methods in a NoC platform has not been well addressed. We define a NoC model based on state-of-the-art systems. Evaluation results are presented using a cycle accurate full system simulator. Experiments show that, Barnes scales better (53.7x/Barnes and 36.6x/FMM for 64 processing elements) and requires less cache than FMM. However, we observe hot-spot traffic in Barnes. Our analysis and experiment results provide a guideline for studying N-Body methods in a NoC platform.
Most Data Warehouses (DW) are stored in Relational Database Management Systems (RDBMS) using a star-schema model. While this model yields a trade-off between performance and storage requirements, huge data warehouses ...
详细信息
Nowadays, not only CPU but also GPU goes along the trend of multi-core processors. parallelprocessing presents not only an opportunity but also a challenge at the same time. To explicitly parallelize the software by ...
详细信息
A new transient analysis method is proposed for general linear dynamic networks, such as on-chip power grid networks, using hybrid GPU-based multicore platform. the new method, called ETBR-GPU, first performs sampling...
详细信息
ISBN:
(纸本)9781467308595
A new transient analysis method is proposed for general linear dynamic networks, such as on-chip power grid networks, using hybrid GPU-based multicore platform. the new method, called ETBR-GPU, first performs sampling-like reduction on the original circuit matrices where the frequency domain responses at different frequency points can be calculated in parallel on multicore CPU. After the reduction, the reduced circuit matrices, which are dense but well suitable for GPU's data parallel computing, are simulated on GPU. Such reduction based simulation technique is very amenable for parallelization on the hybrid multicore and GPU platforms, where coarse-grained task-level and fine-grained lightweight-thread level parallelism can be both exploited. the proposed method is very general, since it can analyze any linear networks with complicated structures and macromodels, and it does not assume some structure properties in order to build problem-specific preconditioners, as many iterative solvers do. Experiments show that the new method achieves about one or two orders of magnitude speedup when compared to the general LU-based simulation method on some recently published IBM power grid benchmark circuits.
Computer simulations withthe first-principle (kinetic) model are essential for studying multi-scale processes in space plasma. We develop numerical schemes for Vlasov simulations for practical use on currently-existi...
详细信息
In this paper, a new parallel phase algorithm for parallel turbo decoder is proposed. Traditional sliding window turbo algorithm exchanges extrinsic information phase by phase, it will induce long decoding latency. th...
详细信息
Using passwords to verify a user's identity is the most widely deployed method for electronic authentication. When system administrators need to recover lost passwords or test accounts for easily guessable passwor...
详细信息
ISBN:
(纸本)9781467323703;9781467323727
Using passwords to verify a user's identity is the most widely deployed method for electronic authentication. When system administrators need to recover lost passwords or test accounts for easily guessable passwords, it can require millions of hash function and string comparison operations. these operations can be computationally expensive but are easily parallelizable because each password can be tested independently. therefore, using high performance computing (HPC) can greatly reduce the time required to perform password recovery. Due to the high level of fine-grained parallelism of this type of problem, GPU computing using Compute Unified Device Architecture (CUDA) can be used to further improve performance. the scale of HPC can be further increased through the use of multiple GPUs, but this requires communication between the GPU devices and can reduce the overall performance due to increased communications latency. In this work a well established HPC framework, Message Passing Interface (MPI), was used to minimize the amount of latency and handle the communication between the devices. this allowed for a course-grained division of the problem using MPI where each device applies a fine-grained division of the problem using CUDA to perform the actual calculations. this paper describes three dictionary-based password recovery algorithmsthat use both MPI and CUDA. In this approach the hashed values of known words are computed and compared with hash values of unknown user passwords. the algorithms differed in GPU memory utilization and how the data was divided and distributed among the MPI nodes and GPU devices. A divided dictionary algorithm split the dictionary of potential passwords over the GPUs and copied the password database to each GPU. A divided password database algorithm split the password database and copied the potential passwords. A minimal memory algorithm split the password database and sequentially processed individual passwords on the GPUs. the div
Current multicore system technology enables implementation of particular program functions like library operations, special functions generation, optimized data search etc. using dedicated computing units to increase ...
详细信息
ISBN:
(纸本)9783642281501;9783642281518
Current multicore system technology enables implementation of particular program functions like library operations, special functions generation, optimized data search etc. using dedicated computing units to increase overall program performance. A parallel system can be equipped with a set of such units to speed up execution of applications, which use such functionality. To properly model and schedule programs using such functions running on a dedicated hardware, a proper program representation must be introduced. the paper presents special scheduling algorithm for programs represented as graphs, based on a modified ETF heuristics. the algorithm is meant for a modular architecture composed of many CMP modules interconnected by a global data communication network. the assumed architecture of dedicated CMP modules enables personalized fully synchronous program execution, which uses communication on the fly to strongly reduce inter-core communication overheads.
暂无评论