Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster radix conversion #499

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

czurnieden
Copy link
Contributor

Implemented the Schönhage method (Divide&Conquer) for radix conversion. I used normal division because Lars Helmström's method was a bit unstable for larger input (or I was to stupid to implement it, but the values are different to the normal division ).

Speed enhancement is as expected:
Tested s_mp_faster_to_radix with the loop

for (i=2;i<64;i++) {
   for (j=1;j<1700;j++){
        mp_rand(&N,j)

old: 324 sec 119 ms 629 usec 727 nsec
new: 9 sec 149 ms 193 usec 858

Same loop with base 10 only: (i = 10 fixed) and j = 17477 ( log(17477 * 60)/log(10) ~ 6.0206 over a million decimal places)
old: 62 sec 920 ms 351 usec 67 nsec
new: 0 sec 458 ms 471 usec 969 nsec

Results have been checked against the old function.

Looks good enough to me.

I did not benchmark the new s_mp_faster_read_radix and testing is only done with a small round-about test intest.c.

Ah, yes: restricting the new method to base 10 only wouldn't have saved a lot. One additional small table and about half a dozen lines of code would have been be saved, maybe 150 bytes or so, if at all.

I used some Ideas from Lars Helmström and some more from @MasterDuke17 to write this code. Don't blame them for my mistakes, please.

@czurnieden
Copy link
Contributor Author

Yepp, should have tested read_radix better, labeled it with work in progress now.

I got undefined reference to s_mp_radix_exponent_y'because I made it private, as it should be, but it cannot be used intest.cany more because of that. It is a small table I can C&P it intotest.c` or does somebody here has a better idea?

@czurnieden
Copy link
Contributor Author

clang-5.0 -I./ -Wall -Wsign-compare -Wextra -Wshadow -fsanitize=undefined -fno-sanitize-recover=all -fno-sanitize=float-divide-by-zero -Wdeclaration-after-statement -Wbad-function-cast -Wcast-align -Wstrict-prototypes -Wpointer-arith -Wsystem-headers -O3 -funroll-loops -fomit-frame-pointer -Wno-typedef-redefinition -Wno-tautological-compare -Wno-builtin-requires-header -m64    demo/test.o demo/shared.o libtommath.a -o test
Run test clang-5.0 -m64 
s_mp_faster_to_radix.c:25:12: runtime error: signed integer overflow: 65536 * 65536 cannot be represented in type 'int'
error 1 while running tests
The command "./testme.sh ${BUILDOPTIONS}" exited with 128.
after_failure.1
0.00s$ cat test_*.log
Digit size 64 Bit 
Size of mp_digit: 8
Size of mp_word: 16
MP_DIGIT_BIT: 60
MP_DEFAULT_DIGIT_COUNT: 32

Cannot happen at that place, result carefully chosen to not exceed 2^20 which fits in an int32_t. Great, now I have to convince the compiler to accept that fact, too ;-)

@czurnieden
Copy link
Contributor Author

Oh my, LTM's testing environment is really killing me!
Did I just spend 2 hours to find out that I have to sanitize the output of mtest?

But it's all green now, so: Happy Holidays, y'all!

@czurnieden
Copy link
Contributor Author

Was able to make Helmström's N-R trick working for base 10. Benchmarked against div-only up to a string length of 2^19 (both together did not allow for more because with that amount of recursion the 8meg stack (that's quite a standard size for Linux) overflowed but both work for larger input stand-alone).

The N-R trick is about as fast/slow as div-standalone, maybe a bit slower for small input but it uses much more heap.
To make N-R work for all of the other bases it needs a table with the individual error corrections which I cannot find mathematically (it is unknown to me if it is actually impossible) only empirically.

I don't know if the large amount of recursions (not only the up to 29-depth tree for the D&C itself but also the recursions from the Karatsuba and Toom-Cook multiplications and fast division is implemented recursively, too) is a problem beside my testing. Can try to change the recursions in read_radix to iterations but I have to check if it can be done without a stack-on-the-heap otherwise its gets quite complicated and ugly.

@czurnieden
Copy link
Contributor Author

Added the N-R method for testing. Add MP_TO_RADIX_USE_NEWTON_RAPHSON to LTM_CFLAGS to switch it on for base 10.

@czurnieden
Copy link
Contributor Author

Needs a bit of a clean-up but otherwise good to go.

@sjaeckel sjaeckel force-pushed the faster_radix_conversion branch 2 times, most recently from c5ccf14 to e60867d Compare April 11, 2023 14:31
@sjaeckel
Copy link
Member

I took the liberty to rebase, add some more changes and force-push the branch.

Could we maybe squash this list of commits and fixups from 2020 and 2021 into 1-3 commits? Or does it make sense to retain the history? Maybe those N-R based tests could be of interest?

@czurnieden
Copy link
Contributor Author

It's always nice when you come to your bench, fresh coffee in your pot just to find out that all of the work has been done already! :-)

Thanks a lot!

Could we maybe squash this list of commits and fixups from 2020 and 2021 into 1-3 commits?

Yes, of course.

Or does it make sense to retain the history?

Oh, please don't! It was quite a painful experience: large numbers mean long waits for the tests to complete just to find out that something didn't quite work out at the end and so on. Nah, dust it with quicklime, bury the remains and let the horses run over the ground such that nobody can find it.

Maybe those N-R based tests could be of interest?

Yes, but it either needs a table with the corrections for which I have only empirical methods to find the values at this point, which is a bit cumbersome to say the least, or another N-R round. The difference in speed is already negligible and the second Newton round would ruin that slight speed advantage completely.
But I am still not sure if I just made a small mistake somewhere which caused that mess, so it might be a good idea to keep that part?

What I vaguely remember to have planned was to pluck the Barret division out as a distinct function. But we can still do that if there is an actual use-case.

@czurnieden
Copy link
Contributor Author

Removed all of the little commits bragging about doing chores (e.g.: correcting typos etc).

Backup of old branch is in faster_radix_conversion_full_history, just in case.

I had a bit of a hick-up while merging with local, hope I have repaired it successfully.

@czurnieden
Copy link
Contributor Author

@MasterDuke17 Your are marked as one of the reviewers. Do you want to add your 2 cents, too? I mean: more eyes, more better! to steal the term from A.v.E.

@czurnieden
Copy link
Contributor Author

The timings here are quite different from the timings on my machine. the smaller the digit the smaller the relation because we use used for the cut-offs. I got for MP_16BIT a relation of about 2:1, for MP_32BIT one of around 3:1 and for MP_64BIT about 4:1. Most of the results of the CI are about the same except when it is not.
Mmh…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants