Performance Status #77

tknopp · 2022-01-30T23:45:28Z

I made several major performance improvements over the last couple of weeks and thought that it makes sense to open an issue to describe where we stand and what I have found out. This issue can also be used for performance tracking in the future. I guess that I squeezed out about a factor of 3-5 and we are now on the same level as the fastest NFFT package on the CPU (FINUFFT).

The most important improvement was achieved by blocking the domain where the equidistant sampling nodes lay. It was clear that this will enable multi-threading of the adjoint but it turns out that it also improves the serial NFFT. The reason is that is allows to avoid cache misses a lot.
Blocking does increase precomputation time but we are still quite fast and in particular way faster than the transforms itself.
Within a block it is important to not touch the non-equidistant sampling nodes since that would again lead to cache misses. This can be avoided by precalculating the required offsets for each block separately. The cache misses are still there but we move it to precomputation time
I removed quite some type instabilities and the Julia profiler helped me a lot to find the bottlenecks.
The full precomputation is now much slower than using the LUT. This is not too surprising since our B matrix has quite some structure that we cannot be easily exploited when treating it just as a sparse matrix. However, we keep it as a backend since it enables GPU computing
I needed to switch Julia multi-threading libraries. Polyester.jl turned out to not play well with FFTW and I now use FLoops.jl.

There are certainly some smaller improvements possible. Here are some ideas:

precomputation is not optimized yet
In trafo/adjoint it might be possible to exploit SIMD instructions. Simply using the @simd macro did not help. Probably we already use SIMD?
Right now the block size is hardcoded to 64 in each direction, which is certainly not the best value in all possible situations. Probably we need some NFFT.MEASURE at some point.

The text was updated successfully, but these errors were encountered:

tknopp · 2022-02-12T15:21:29Z

Status Update:

I managed to make the FINUFFT benchmarks not include the precomputation time. Its a complete hack since I pull out the timing from the FINUFFTPlan https://github.com/JuliaMath/NFFT.jl/blob/master/benchmark/performance.jl#L69
This means this number is kind of the lower bound but one cannot reach it in practice since FINUFFT does not allow to cache plans.
Of course this made FINUFFT a little bit faster requiring me to optimize even more.
We now implement LUT and TENSOR. The later is basically as fast as one can get (IMHO).
I moved from (2m+1) kernel size (2m) after great advice from NFFT3 colleagues. This makes the NFFT slightly less accurate but improves the runtime.
I added new graphs "performance vs accuracy", see: https://juliamath.github.io/NFFT.jl/dev/performance/

The only benchmark where we are slightly slower than FINUFFT is the 3D one. Here are two remarks:

When switching from m=7 to m=8 one can see a sudden performance drop. This is likely because we are using tuples and there seems to be some magic number for L=2*m = 16 entries, where things are slowed down.
In the adjoint for m<8 we are very slightly slower. Its not so clear why since the C++ code from FINUFFT and our Julia code look nearly identical.

tknopp · 2022-02-13T17:08:57Z

Status update: Improved toBlock and addBlock quite a bit, which basically closed the gap to FINUFFT for the 3D NFFT. The m>8 issue above was worked around by switching from a tuple-implementation to an array implementation for larger m. Furthermore a pure real implementation seems to be slightly faster for large m while small m are faster with the complex tuple implementation. Now we have this ugly heuristic: https://github.com/JuliaMath/NFFT.jl/blob/master/src/multidimensional.jl#L376

Since we are faster or on the same level as FINUFFT, the performance work now is basically done.

roflmaostc · 2022-06-14T12:58:11Z

I just hook in here.
I noticed, large allocations which were caused by the automatic threading settings of NFFT.jl but ultimately were caused by FFTW.
See here.
What's your opinion on that? By default you always do multi threading, right?

tknopp · 2022-06-19T09:40:04Z

@roflmaostc: That is a little bit more complicated:

We do automatically thread if, and only if, the user starts Julia with multiple threads. Unfortunately, Julia does not allow to set the number of threads during runtime
We right now do not change the number of FFTW threads because I did not want to change global state. If you look at my test script, you can see that I set it manually https://github.com/JuliaMath/NFFT.jl/blob/master/benchmark/performance_simple.jl. However, given that Add get_num_threads FFTW.jl#171 has landed we can now properly chose the number of threads. In your concrete situation you might want to set FFTW threads globally to 1 and get better performance.
You can disable threading by setting NFFT._use_threads[] = false. But this works only if you do this once before calling any function, i.e. this is not a runtime option. I am not happy with this situation and we need to discuss this with Julia core threading devs.

Independently of all that: I am benchmarking large 1D / 2D / 3D transforms and they all benefit from multi-threading. May here be the issue that your transforms are 1D and short (< 10000)?

monreal93 mentioned this issue Mar 2, 2023

Segmentation fault MagneticResonanceImaging/MRIReco.jl#142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Status #77

Performance Status #77

tknopp commented Jan 30, 2022

tknopp commented Feb 12, 2022

tknopp commented Feb 13, 2022

roflmaostc commented Jun 14, 2022

tknopp commented Jun 19, 2022

Performance Status #77

Performance Status #77

Comments

tknopp commented Jan 30, 2022

tknopp commented Feb 12, 2022

tknopp commented Feb 13, 2022

roflmaostc commented Jun 14, 2022

tknopp commented Jun 19, 2022