Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STRUMPACK with > 1 GPU per node #108

Open
sebastiangrimberg opened this issue Sep 28, 2023 · 11 comments
Open

STRUMPACK with > 1 GPU per node #108

sebastiangrimberg opened this issue Sep 28, 2023 · 11 comments

Comments

@sebastiangrimberg
Copy link

I'm having some issues running STRUMPACK with more than a single GPU per node. By comparison, SuperLU_DIST is fine. This is with CUDA. In particular, if I run with 2 MPI processes (not CUDA aware), where each process is assigned in my own code to devices 0 and 1, I get:

CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

Is there anything special we need to do building STRUMPACK + MPI with CUDA support?

@pghysels
Copy link
Owner

You should run 1 MPI process per GPU (sounds like that is what you are trying to do).
On systems like Perlmutter or Frontier etc, you can use the job scheduler to make sure that each MPI process only sees a single GPU device.

But if an MPI process sees multiple GPU devices, we do in STRUMPACK:
cudaSetDevice(rank % devs);
where rank is the MPI rank and devs is the number of GPUs, from cudaGetDeviceCount.

What exactly do you mean with:
where each process is assigned in my own code to devices 0 and 1
?

@sebastiangrimberg
Copy link
Author

What exactly do you mean with: where each process is assigned in my own code to devices 0 and 1 ?

I mean I am calling cudaSetDevice(device_id); in my application code, before (and after) calling STRUMPACK since I'm doing some matrix assembly on the GPU.

I'm not sure why this isn't working though. It seems like the strategy from inside of STRUMPACK to call cudaSetDevice is probably setting the same exact GPU device as I have set in my code beforehand, even if it shouldn't make a difference in this circumstance.

@pghysels
Copy link
Owner

Could it be due to MAGMA? Can you try without MAGMA?

I think the code at this line is not executed when running without MAGMA:
CUDA assertion failed: invalid resource handle /build/extern/STRUMPACK/src/dense/CUDAWrapper.cu 112

@sebastiangrimberg
Copy link
Author

Hm, odd. Building without MAGMA the error message I get is:

CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyDeviceToHost)) failed with error:
 --> an illegal memory access was encountered

which occurs outside of the STRUMPACK solve. Again, everything is totally fine with using SuperLU_DIST (which is using GPU) or a CPU-based direct solve. It seems like maybe STRUMPACK is corrupting memory somewhere?

@pghysels
Copy link
Owner

I'll see if I can find a machine with multiple GPUs per node.
Maybe I can reproduce on Permutter.

@sebastiangrimberg
Copy link
Author

Awesome, thanks! For reference, I'm running on AWS, on a single p3.8xlarge EC2 instance (4 x V100 GPU).

@pghysels
Copy link
Owner

I can reproduce it on Perlmutter, using setup (1): https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-all-gpus-visible-to-all-tasks
With setup (2) (https://docs.nersc.gov/systems/perlmutter/running-jobs/#1-node-4-tasks-4-gpus-1-gpu-visible-to-each-task) it works fine.

The only difference between setups (1) and (2) is that for (1) STRUMPACK calls cudaSetDevice.

I did also notice that it runs fine with setup (1) when I add export OMP_NUM_THREADS=1.

I will investigate further.

You say it works with SuperLU. Did you set SUPERLU_BIND_MPI_GPU?

@sebastiangrimberg
Copy link
Author

Awesome, thank you for your help with this and great that you can reproduce. No I did not set SUPERLU_BIND_MPI_GPU when testing with SuperLU_DIST. That looks like it controls whether or not SuperLU will call cudaSetDevice during setup, so this is consistent with your findings that STRUMPACK works OK when not calling cudaSetDevice. It's interesting that OMP_NUM_THREADS also affects the result, I was also running with OMP_NUM_THREADS=1 for my tests. I'll spend some time looking into this as well so let me know if I can be of help in any way.

@pghysels
Copy link
Owner

pghysels commented Oct 2, 2023

It works correctly with SuperLU with SUPERLU_BIND_MPI_GPU.

I can't figure out what is wrong. I know that calling cudaSetDevice will reset the device, and then all streams etc will be invalid. But cudaSetDevice is the first CUDA call in STRUMPACK.
I thought it might be due to CUDA aware MPI, so I tried to disable that, but that doesn't make a difference.

Perhaps you can set CUDA_VISIBLE_DEVICES, but you need to find a way to set that to a different value for different MPI ranks.

@sebastiangrimberg
Copy link
Author

sebastiangrimberg commented Oct 2, 2023

I wonder if the issue is somehow an interplay between STRUMPACK and SLATE (I noticed SLATE also has calls to cudaSetDevice internally). I'm not super familiar with how STRUMPACK is using SLATE but this is definitely a differentiator vs. SuperLU.

@pghysels
Copy link
Owner

pghysels commented Oct 2, 2023

I also see the issue without linking with SLATE (or MAGMA).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants