Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support via CuArrays.jl #86

Open
9 tasks
kose-y opened this issue Nov 24, 2019 · 12 comments
Open
9 tasks

GPU support via CuArrays.jl #86

kose-y opened this issue Nov 24, 2019 · 12 comments

Comments

@kose-y
Copy link

kose-y commented Nov 24, 2019

Thanks for the nice package. I think it would be better to have GPU support (along with multi-thread evaluation for CPU in #82).

Edited by @lostella on Dec 7th, 2019

Note similar optimizations could be used in the computation of f and gradient, besides prox.

@kose-y kose-y changed the title GPU support via Cup GPU support via CuArrays.jl Nov 24, 2019
@lostella
Copy link
Member

Hi @kose-y, I’m not familiar with CuArrays, but from what I understand the type of arrays it provides is compatible with AbstractArray, and that should be sufficient for many operators in ProximalOperators to work.

Could you provide a specific example of a case where that doesn’t work as expected?

@kose-y
Copy link
Author

kose-y commented Nov 24, 2019

Looping with direct elementwise access (e.g. https://github.com/kul-forbes/ProximalOperators.jl/blob/master/src/functions/normL2.jl#L37) is expected to be very slow on GPUs, and prox_naive would be much more efficient.

Adding prox_naive! and defining prox and prox! for CuArrays based on prox_naive would be a good way.

@lostella
Copy link
Member

Well in that specific case I think it makes sense to just remove that loop and instead rely on broadcasting. Other cases may be just as simple. Thanks for pointing this out!

@kose-y
Copy link
Author

kose-y commented Nov 25, 2019

Thank you for the comment. I have to point out that this is not one specific case: I could quickly find many of them under the directory functions/. I will try to list them up when I have time.

@lostella
Copy link
Member

That would be nice, thanks. You can add the list as items to this issue, by editing the original post, so that progress towards solving this is tracked.

@lostella
Copy link
Member

lostella commented Dec 7, 2019

@kose-y I’ve edited your post adding a list of proxes that could be improved. I’m not sure 100% on all of them, but I guess they’re worth a try.

I guess any attempt in solving this issue would need to come with evidence that performance improves with CuArray on GPU but doesn’t degrade with Array on CPU.

@kose-y
Copy link
Author

kose-y commented Dec 7, 2019

@lostella Thank you for the listing. I was thinking about separate implementations for GPU because with multi-threading support (#82) in consideration, efficient implementations for CPU and GPU would not coincide. CuArrays would be an optional dependency in that case.

@lostella
Copy link
Member

lostella commented Dec 7, 2019

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

@mfalt
Copy link
Collaborator

mfalt commented Dec 7, 2019

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

How did you write the broadcast? The simple version on the CPU:

y .= max.(f.lb, min.(f.ub, x))

takes for me with scalar lb/ub:
broadcast: 7.837 μs
loop: 43.034 μs (32 μs with @inbounds @simd)
However, for vectors lb/ub i get
broadcast; 75.665 μs
loop: 47.566 μs (42 μs with @inbounds @simd)
i.e slower in the second case. Ran with size 10000.

Edit: Interestingly enough, with Float32, I get
Scalar:
broadcast; 3.708 μs
loop: 43.810 μs (1.564 μs with @inbounds @simd)
Vector:
broadcast: 122.975 μs
loop: 98.015 μs (95.285 μs with @inbounds @simd)

@lostella
Copy link
Member

lostella commented Dec 7, 2019

@mfalt I used exactly the line you mentioned, and vectors lb/ub. I tried with size 10k and 100k, and observed a slowdown similar to yours. In the 100k case

  • loop: ~100us
  • broadcast: ~370us

@lostella
Copy link
Member

lostella commented Dec 8, 2019

My takeaway so far: for serial computation, looping through the data just once wins, and should be preferred over broadcasting.

For IndBox with Array bounds (size = 100k), I'm getting no significant difference between single and double precision. My CPU: 2.5 GHz Intel Core i7.

  • Single loop (current implementation):
    • Float32: 103.812 μs (0 allocations: 0 bytes)
    • Float64: 104.059 μs (0 allocations: 0 bytes)
  • Broadcasting min/max: y .= min.(f.ub, max.(f.lb, x))
    • Float32: 376.113 μs (0 allocations: 0 bytes)
    • Float64: 376.062 μs (0 allocations: 0 bytes)
  • Broadcasting clamp: y .= clamp.(x, f.lb, f.ub)
    • Float32: 195.196 μs (0 allocations: 0 bytes)
    • Float64: 195.527 μs (0 allocations: 0 bytes)

When scalar bounds, things are a bit faster in all cases:

  • Single loop:
    • Float32: 84.856 μs (0 allocations: 0 bytes)
    • Float64: 84.885 μs (0 allocations: 0 bytes)
  • Broadcasting min/max:
    • Float32: 286.705 μs (0 allocations: 0 bytes)
    • Float64: 286.718 μs (0 allocations: 0 bytes)
  • Broadcasting clamp:
    • Float32: 121.591 μs (0 allocations: 0 bytes)
    • Float64: 114.609 μs (0 allocations: 0 bytes)

@JeffFessler
Copy link

It may be helpful to use @. to coalesce the broadcasting operations.
https://docs.julialang.org/en/v1/base/arrays/#Base.Broadcast.@__dot__

For example:

y = sign.(x).*max.(0, abs.(x) .- gamma .* f.lambda)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants