GPU support via CuArrays.jl #86

kose-y · 2019-11-24T14:19:08Z

Thanks for the nice package. I think it would be better to have GPU support (along with multi-thread evaluation for CPU in #82).

Edited by @lostella on Dec 7th, 2019

Note similar optimizations could be used in the computation of f and gradient, besides prox.

The text was updated successfully, but these errors were encountered:

lostella · 2019-11-24T16:12:01Z

Hi @kose-y, I’m not familiar with CuArrays, but from what I understand the type of arrays it provides is compatible with AbstractArray, and that should be sufficient for many operators in ProximalOperators to work.

Could you provide a specific example of a case where that doesn’t work as expected?

kose-y · 2019-11-24T20:32:18Z

Looping with direct elementwise access (e.g. https://github.com/kul-forbes/ProximalOperators.jl/blob/master/src/functions/normL2.jl#L37) is expected to be very slow on GPUs, and prox_naive would be much more efficient.

Adding prox_naive! and defining prox and prox! for CuArrays based on prox_naive would be a good way.

lostella · 2019-11-24T22:11:16Z

Well in that specific case I think it makes sense to just remove that loop and instead rely on broadcasting. Other cases may be just as simple. Thanks for pointing this out!

kose-y · 2019-11-25T11:44:50Z

Thank you for the comment. I have to point out that this is not one specific case: I could quickly find many of them under the directory functions/. I will try to list them up when I have time.

lostella · 2019-11-25T20:22:50Z

That would be nice, thanks. You can add the list as items to this issue, by editing the original post, so that progress towards solving this is tracked.

lostella · 2019-12-07T14:28:03Z

@kose-y I’ve edited your post adding a list of proxes that could be improved. I’m not sure 100% on all of them, but I guess they’re worth a try.

I guess any attempt in solving this issue would need to come with evidence that performance improves with CuArray on GPU but doesn’t degrade with Array on CPU.

kose-y · 2019-12-07T15:06:57Z

@lostella Thank you for the listing. I was thinking about separate implementations for GPU because with multi-threading support (#82) in consideration, efficient implementations for CPU and GPU would not coincide. CuArrays would be an optional dependency in that case.

lostella · 2019-12-07T15:48:07Z

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

mfalt · 2019-12-07T19:42:49Z

That's a good idea, even before any multi-threading is introduced: after a quick benchmark, I observed an increase in time of ~10% for NormL2 and ~270% (!!!) for IndBox, when removing loops in favor of broadcasting on my machine (using the standard Array, on CPU).

How did you write the broadcast? The simple version on the CPU:

y .= max.(f.lb, min.(f.ub, x))

takes for me with scalar lb/ub:
broadcast: 7.837 μs
loop: 43.034 μs (32 μs with @inbounds @simd)
However, for vectors lb/ub i get
broadcast; 75.665 μs
loop: 47.566 μs (42 μs with @inbounds @simd)
i.e slower in the second case. Ran with size 10000.

Edit: Interestingly enough, with Float32, I get
Scalar:
broadcast; 3.708 μs
loop: 43.810 μs (1.564 μs with @inbounds @simd)
Vector:
broadcast: 122.975 μs
loop: 98.015 μs (95.285 μs with @inbounds @simd)

lostella · 2019-12-07T22:31:26Z

@mfalt I used exactly the line you mentioned, and vectors lb/ub. I tried with size 10k and 100k, and observed a slowdown similar to yours. In the 100k case

loop: ~100us
broadcast: ~370us

lostella · 2019-12-08T13:26:44Z

My takeaway so far: for serial computation, looping through the data just once wins, and should be preferred over broadcasting.

For IndBox with Array bounds (size = 100k), I'm getting no significant difference between single and double precision. My CPU: 2.5 GHz Intel Core i7.

Single loop (current implementation):
- Float32: 103.812 μs (0 allocations: 0 bytes)
- Float64: 104.059 μs (0 allocations: 0 bytes)
Broadcasting min/max: y .= min.(f.ub, max.(f.lb, x))
- Float32: 376.113 μs (0 allocations: 0 bytes)
- Float64: 376.062 μs (0 allocations: 0 bytes)
Broadcasting clamp: y .= clamp.(x, f.lb, f.ub)
- Float32: 195.196 μs (0 allocations: 0 bytes)
- Float64: 195.527 μs (0 allocations: 0 bytes)

When scalar bounds, things are a bit faster in all cases:

Single loop:
- Float32: 84.856 μs (0 allocations: 0 bytes)
- Float64: 84.885 μs (0 allocations: 0 bytes)
Broadcasting min/max:
- Float32: 286.705 μs (0 allocations: 0 bytes)
- Float64: 286.718 μs (0 allocations: 0 bytes)
Broadcasting clamp:
- Float32: 121.591 μs (0 allocations: 0 bytes)
- Float64: 114.609 μs (0 allocations: 0 bytes)

JeffFessler · 2022-10-06T13:26:05Z

It may be helpful to use @. to coalesce the broadcasting operations.
https://docs.julialang.org/en/v1/base/arrays/#Base.Broadcast.@__dot__

For example:

ProximalOperators.jl/src/functions/normL1.jl

Line 127 in 5b2845a

y = sign.(x).*max.(0, abs.(x) .- gamma .* f.lambda)

kose-y changed the title ~~GPU support via Cup~~ GPU support via CuArrays.jl Nov 24, 2019

lostella added help wanted enhancement labels Dec 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support via CuArrays.jl #86

GPU support via CuArrays.jl #86

kose-y commented Nov 24, 2019 •

edited by lostella

Loading

lostella commented Nov 24, 2019

kose-y commented Nov 24, 2019

lostella commented Nov 24, 2019

kose-y commented Nov 25, 2019

lostella commented Nov 25, 2019

lostella commented Dec 7, 2019

kose-y commented Dec 7, 2019

lostella commented Dec 7, 2019

mfalt commented Dec 7, 2019 •

edited

Loading

lostella commented Dec 7, 2019

lostella commented Dec 8, 2019

JeffFessler commented Oct 6, 2022

GPU support via CuArrays.jl #86

GPU support via CuArrays.jl #86

Comments

kose-y commented Nov 24, 2019 • edited by lostella Loading

lostella commented Nov 24, 2019

kose-y commented Nov 24, 2019

lostella commented Nov 24, 2019

kose-y commented Nov 25, 2019

lostella commented Nov 25, 2019

lostella commented Dec 7, 2019

kose-y commented Dec 7, 2019

lostella commented Dec 7, 2019

mfalt commented Dec 7, 2019 • edited Loading

lostella commented Dec 7, 2019

lostella commented Dec 8, 2019

JeffFessler commented Oct 6, 2022

kose-y commented Nov 24, 2019 •

edited by lostella

Loading

mfalt commented Dec 7, 2019 •

edited

Loading