this paper presents two methods for solving a partial differential equation of the second order, with application to the well-known Poisson equation. these methods are aimed at making a high-speed hardware solver. the...
详细信息
this paper presents two methods for solving a partial differential equation of the second order, with application to the well-known Poisson equation. these methods are aimed at making a high-speed hardware solver. the solutions presented will be a part of a hardware device simulator which is called "Virtual Device". We present simulation results to compare the two methods for solving this equation. We start with an iterative method (Gauss-Seidel method) and then end with a direct method (LU method).
the highly specialised Neural Accelerator Board (NAB), and the Shiva (a reconfigurable multiprocessor) are two architectures designed to augment general purpose workstations. the NAB uses custom VLSI components to per...
详细信息
the highly specialised Neural Accelerator Board (NAB), and the Shiva (a reconfigurable multiprocessor) are two architectures designed to augment general purpose workstations. the NAB uses custom VLSI components to perform weighted 10-bit fixed-point multiply/accumulate operations at a total rate of 20 giga-operations per second. the Shiva uses Intel i860 microprocessors to achieve a good balance of floating point, graphics, and integer performance. the two can be combined into a heterogeneous, reconfigurable multiprocessor, which has applications in image analysis and radar signal processing. this paper outlines the benefits of heterogeneity and reconfigurability. A description is given of the architectures of the NAB and the Shiva, with details of the proposed applications.< >
Deep Affine Normalizing Flows are efficient and powerful models for high-dimensional density estimation and sample generation. Yet little is known about how they succeed in approximating complex distributions, given t...
详细信息
While GPU is becoming a compelling acceleration solution for a series of scientific applications, most existing work on climate models only achieved limited speedup. this is due to partial porting of the huge code and...
详细信息
While GPU is becoming a compelling acceleration solution for a series of scientific applications, most existing work on climate models only achieved limited speedup. this is due to partial porting of the huge code and the memory bound inherence of these models. In this work, we design and implement a customized GPU-based acceleration of the Princeton Ocean Model (gpuPOM) based on mpiPOM, which is one of the parallel versions of the Princeton Ocean Model. Based on Nvidia's state-of-the-art GPU architectures (K20X and K40m), we rewrite the full mpiPOM model from the original Fortran version into the CUDA-C version. We present the GPU acceleration methods used in the gpuPOM, especially the techniques to ease its memory bound problem through better use of GPU's memory hierarchy. the experimental results indicate that the gpuPOM with one K40m GPU achieves from 6.3-fold to 16.7-fold speedup over different Intel multi-core CPUs and one K20X GPU achieves from 5.8-fold to 15.5-fold speedup.
Growing bandwidth demand in the Internet requires new algorithms and architectures to provide a high degree of QoS. Further, to complicate the problem of Traffic Engineering, real time data processing requires more pr...
详细信息
Growing bandwidth demand in the Internet requires new algorithms and architectures to provide a high degree of QoS. Further, to complicate the problem of Traffic Engineering, real time data processing requires more priority than non-real time data processing. In this paper, we present an effective solution for improving QoS of audio and video packets in MPLS networks under real time traffic conditions. the contribution of this work is two fold. First, we investigate the impact of increased traffic on QoS parameters under heavy loading conditions and further we propose an efficient routing mechanism based on active networking concepts [1] to satisfy QoS requirements of audio and video packets.
We have designed data list processing for multicore-GPU platforms and significantly improved the performance of both numerical and symbolic applications. For the latter, a novel aspect of our design was the management...
详细信息
We have designed data list processing for multicore-GPU platforms and significantly improved the performance of both numerical and symbolic applications. For the latter, a novel aspect of our design was the management and processing of new data dynamically generated within GPUs. this paper presents various optimisations to our first design [1] aimed to use more the GPU, through reducing communication between the host (a multicore) and the GPU, in order to improve performance further. We present experimental results for three applications with different granularities and access patterns. Performance was improved again, significantly in some cases; using multicore-GPU platforms efficiently may involve complex changes to software.
Conventional Brownian dynamics (BD) simulations with hydrodynamic interactions utilize 3n×3n dense mobility matrices, where n is the number of simulated particles. this limits the size of BD simulations, particul...
详细信息
Conventional Brownian dynamics (BD) simulations with hydrodynamic interactions utilize 3n×3n dense mobility matrices, where n is the number of simulated particles. this limits the size of BD simulations, particularly on accelerators with low memory capacities. In this paper, we formulate a matrix-free algorithm for BD simulations, allowing us to scale to very large numbers of particles while also being efficient for small numbers of particles. We discuss the implementation of this method for multicore and many core architectures, as well as a hybrid implementation that splits the workload between CPUs and Intel Xeon Phi coprocessors. For 10,000 particles, the limit of the conventional algorithm on a 32 GB system, the matrix-free algorithm is 35 times faster than the conventional matrix based algorithm. We show numerical tests for the matrix-free algorithm up to 500,000 particles. For large systems, our hybrid implementation using two Intel Xeon Phi coprocessors achieves a speedup of over 3.5x compared to the CPU-only case. Our optimizations also make the matrix-free algorithm faster than the conventional dense matrix algorithm on as few as 1000 particles.
GPU hardware architectures have evolved into a suitable platform for the hardware acceleration of complex computing tasks. Stereo vision is one such task where acceleration is desirable for robotic and automotive syst...
详细信息
GPU hardware architectures have evolved into a suitable platform for the hardware acceleration of complex computing tasks. Stereo vision is one such task where acceleration is desirable for robotic and automotive systems. Much research was invested in developing stereo vision algorithms with increased quality, but real-time implementations are still lacking. In this work we focus on creating a real-time dense stereo reconstruction system. We selected the Semi-global Matching method as the basis of our system due to its high quality and reduced computational complexity. the Census transform is selected as the matching metric because our results show that it can reduce the matching errors for traffic images compared to classical solutions. We also present two modifications to the original Semi-Global algorithm to improve the sub-pixel accuracy and the execution time. the system was implemented and evaluated on a current generation GPU with a running time of 19ms for image having the resolution 512×383.
Internet fundamentally changes the model of software development, the demands of software quality, and the process of software resource sharing. Internet- based environment for trustworthy software production is recog...
详细信息
Internet fundamentally changes the model of software development, the demands of software quality, and the process of software resource sharing. Internet- based environment for trustworthy software production is recognized as a key topic of software engineering in both academic and software industry. In this paper, the concepts and models of trustworthy software are introduced which dominate the design of Trustie environment. Trustie provides trustworthy software components sharing by an evolving software repository, and provides collaborative software development in a customizable development platform powered by a software production line framework. Finally the layered practices of research and application based on Trustie preliminarily demonstrate the effectiveness as well as the promising future of this environment.
the main popular vectorization methods for the SIMD extension dig the parallelism of the programs relying on the compiler's data dependence analysis. But the data dependence analysis can not deal withthe non-stru...
详细信息
the main popular vectorization methods for the SIMD extension dig the parallelism of the programs relying on the compiler's data dependence analysis. But the data dependence analysis can not deal withthe non-structured control flow statements. therefore, the up-to-date compilers are extremely limited to vectorize these statements. Here is a vectorization method of the export branch for the SIMD extension, which can automatically and effectively vectorize the export branch within the vector length. And the results of performance test show that this method can both fully ensure the semantic correctness of the control flow and exploit the parallelism of the data flow.
暂无评论