The Gaia Astrometric Verification Unit-Global Sphere Reconstruction (AVU-GSR) parallel Solver aims to find the astrometric parameters for similar to 10(8) stars in the Milky Way, the attitude and the instrumental spec...
详细信息
The Gaia Astrometric Verification Unit-Global Sphere Reconstruction (AVU-GSR) parallel Solver aims to find the astrometric parameters for similar to 10(8) stars in the Milky Way, the attitude and the instrumental specifications of the Gaia satellite, and the global parameter gamma of the post Newtonian formalism. The code iteratively solves a system of linear equations, A x x = b, where the coefficient matrix A is large (similar to 10(1)1x10(8) elements) and sparse. To solve this system of equations, the code exploits a hybrid implementation of the iterative PC-LSQR algorithm, where the computation related to different horizontal portions of the coefficient matrix is assigned to separate MPI processes. In the original code, each matrix portion is further parallelized over the OpenMP threads. To further improve the code performance, we ported the application to the GPU, replacing the OpenMP parallelization language with OpenACC. In this port, similar to 95% of the data is copied from the host to the device at the beginning of the entire cycle of iterations, making the code compute bound rather than data-transfer bound. The OpenACC code presents a speedup of similar to 1.5 over the OpenMP version but further optimizations are in progress to obtain higher gains. The code runs on multiple GPUs and it was tested on the CINECA supercomputer Marconi100, in anticipation of a port to the pre-exascale system Leonardo, that will be installed at CINECA in 2022.
We describe the development of a data-level, massivelyparallel software system for the solution of multicommodity network flow problems. Using a smooth linear-quadratic penalty (LQP) algorithm we transform the multic...
详细信息
We describe the development of a data-level, massivelyparallel software system for the solution of multicommodity network flow problems. Using a smooth linear-quadratic penalty (LQP) algorithm we transform the multicommodity network flow problem into a sequence of independent min-cost network flow subproblems. The solution of these problems is coordinated via a simple, dense, nonlinear master program to obtain a solution that is feasible within some user-specified tolerance to the original multicommodity network flow problem. Particular emphasis is placed on the mapping of both the subproblem and master problem data to the processing elements of a massivelyparallel computer, the Connection Machine CM-2. As a result of this design we can solve large and sparse optimization problems on current SIMD massivelyparallel architectures. Details of the implementation are reported, together with summary computational results with a set of test problems drawn from a Military Airlift Command application.
We present edge-friend, a data structure for quad meshes with access to neighborhood information required for Catmull-Clark subdivision surface refinement. Edge-friend enables efficient real-time subdivision surface r...
详细信息
We present edge-friend, a data structure for quad meshes with access to neighborhood information required for Catmull-Clark subdivision surface refinement. Edge-friend enables efficient real-time subdivision surface rendering. In particular, the resulting algorithm is deterministic, does not require hardware support for atomic floating-point arithmetic, and is optimized for efficient rendering on GPUs. Edge-friend exploits that after one subdivision step, two edges can be uniquely and implicitly assigned to each quad. Additionally, edge-friend is a compact data structure, adding little overhead. Our algorithm is simple to implement in a single compute shader kernel, and requires minimal synchronization which makes it particularly suited for asynchronous execution. We easily extend our kernel to support relevant Catmull-Clark subdivision surface features, including semi-smooth creases, boundaries, animation and attribute interpolation. In case of topology changes, our data structure requires little preprocessing, making it amendable for a variety of applications, including real-time editing and animations. Our method can process and render billions of triangles per second on modern GPUs. For a sample mesh, our algorithm generates and renders 2.9 million triangles in 0.58ms on an AMD Radeon RX 7900 XTX GPU.
暂无评论