Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is there a way to pass and receive gpu pointers to linear solvers? #113

Open
ImBlackMagic opened this issue Jan 5, 2024 · 3 comments
Open

Comments

@ImBlackMagic
Copy link

Hello!

I'm working on a project that needs a large sparse linear system solved each iteration of a simulation, this takes about 90% of each iterations time, the matrix is about 10.000x10.000 in size, with 126k elements, unsymmetric.

The final objective of the project is to have everything running on GPU (CUDA kernels for everything that isn't the linear solver), and, according to strumpack's documentation, the interface receives and returns host memory pointers, so I would need to pass from GPU to CPU, then strumpack inside would make the conversion twice, CPU -> GPU to calculate, and then GPU -> CPU to return the data.

So, I have a couple of questions:

  • Is there a way to avoid this overhead and directly pass and receive GPU buffers?
  • Does strumpack use a hybrid approach to solve sparse systems (this would render this question moot most probably)?

Thanks in advance!

PD: I spent several hours staring at the source code to no avail, I guess I have yet to attain higher arcane powers

@pghysels
Copy link
Owner

pghysels commented Jan 7, 2024

Hi

No, at the moment all the input is from host memory.

The main code for GPU factorization is in src/sparse/fronts/FrontalMatrixGPU.cpp or src/sparse/fronts/FrontalMatrixMAGMA.cpp if MAGMA is enabled.
Only the MAGMA code does the triangular solve on the GPU for now.

There are a number of steps in the code that are still done on the CPU, such as the application of the permutations (for fill reduction and static pivoting), and the scaling. This is why the input is still required on the CPU.

We will try to move everything to GPU in the future.

Pieter

@ImBlackMagic
Copy link
Author

Thanks for your answer.

If I understand your reply correctly, do I also need to install magma to maximize the GPU steps? I didn't install MAGMA on my current setup, but it is still faster than what I had running previously.

Thanks again for your answer!

@pghysels
Copy link
Owner

pghysels commented Jan 8, 2024

Yes, we have two implementations for the numerical factorization, one using just the CUDA libraries, and another using MAGMA. For the factorization, the MAGMA code is slightly faster than the CUDA code.
The MAGMA implementation also does the triangular solve on the GPU, if the triangular factors fit on the device (and when using a single MPI rank). If you do not configure with MAGMA, the triangular solve phase is always done on the CPU.

I think in the future we will drop the non-MAGMA implementation, since we are relying more and more on MAGMA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants