Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-GPU via threading instead of processes #50

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

marius311
Copy link
Owner

WIP towards using one-GPU-per-thread instead of one-GPU-per-process. The big advantage is you don't need to launch multiple processes (which is slow and at least doubles the startup time) and memory can be shared between the GPUs using unified memory, instead of having to be serialized and distributed between the different processes (which necessarily passes through the CPU memory).

Right now this works:

tmap(collect(devices())) do dev
    device!(dev)
    for i=1:N
         gradient-> norm(LenseFlow(ϕ)*f), ϕ)
    end
end

although MAP and sampling is still WIP.

Right now this needs CUDA 11.2 and my branch of CUDA.jl https://github.com/marius311/CUDA.jl/tree/no_gc_ctx_switch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant