Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Examples: add ggml training for MNIST #908

Merged
merged 9 commits into from
Aug 20, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

This PR aims to add training to the MNIST example. The current state is that it seems to work (with bad performance). My current plans include:

  • Rewrite the MNIST example in such a way that deduplicates code for e.g. data loading via something like mnist-common.h, consistently use GGUF files instead of binary files.
  • Add fully connected MNIST training via Python (based on the Google Colab thing).
  • Add randomized tensor initialization to ggml. I think there is no C-native way to get normally distributed floats so I will implement a transformation from the uniform distribution.
  • Add some code that somehow defines datasets to ggml as well as functions that perform one epoch of optimization on said dataset. This would also allow you to more easily do a train/validation split and to calculate validation loss/accuracy. Also you could transform raw data into a structured format usable by ggml.

Please provide feedback, especially regarding whether the last two points should be part of ggml.

@slaren
Copy link
Collaborator

slaren commented Aug 1, 2024

  • Add randomized tensor initialization to ggml. I think there is no C-native way to get normally distributed floats so I will implement a transformation from the uniform distribution.

  • Add some code that somehow defines datasets to ggml as well as functions that perform one epoch of optimization on said dataset. This would also allow you to more easily do a train/validation split and to calculate validation loss/accuracy. Also you could transform raw data into a structured format usable by ggml.

I imagine that ultimately your goal is to implement GPU support, but none of the functions in ggml.c that access the tensor data are compatible with ggml-backend, since they access tensor->data directly, which is a backend-dependant pointer to device memory. If you want to support GPU acceleration, then this function would need to be able to work with ggml-backend in some way, possibly by calling ggml_backend_tensor_set to modify the tensor data.

It might be a good time to move the CPU backend code in ggml.c to a different file, and keep only the common graph, tensor etc functions in ggml.c, with compatibility with ggml-backend.

I think there is no C-native way to get normally distributed floats so I will implement a transformation from the uniform distribution.

Alternatively, it might also be a good time to start porting the code in ggml to C++.

@JohannesGaessler
Copy link
Collaborator Author

Alternatively, it might also be a good time to start porting the code in ggml to C++.

I don't really have strong feelings regarding C vs. C++ in either direction, the transformation of uniformly distributed random numbers to normally distributed numbers is pretty easy though. I don't know if the C++ std library has support for SIMD instructions which could be used to speed up the transfromation (but I think this will not be relevant for the current use cases of ggml).

@slaren
Copy link
Collaborator

slaren commented Aug 1, 2024

I don't know if the C++ std library has support for SIMD instructions which could be used to speed up the transfromation

Probably not, but since this is only done once during initialization, I don't think it needs to be very optimized. Mostly I was thinking about simplifying the implementation.

@JohannesGaessler
Copy link
Collaborator Author

I refactored the MNIST code for fully connected model to be mostly concentrated in mnist-common.cpp. I took over the Python Jupyter Notebook for training in PyTorch and added it as a Python script. I fixed the ggml training performance issue (10000 Adam iterations instead of a single one per call). With a batch size of 1000 ggml CPU training is ~2x faster than PyTorch CPU training, with a batch size of 100 it's slightly slower (I set the batch size to 1000 because it's faster for both).

The workflow that I envision for the MNIST example is that someone would first run the PyTorch training example (which downloads the MNIST dataset and writes a GGUF model to disk) and then the ggml train and eval examples. There are issues with downloading the MNIST dataset from https://yann.lecun.com/exdb/mnist/ that this way would be fixed without the need to add the dataset to the ggml repository.

@ggerganov
Copy link
Owner

Thanks for the update. The implementation looks nice so far.

with a batch size of 100 it's slightly slower

Any guess why that would be?

@JohannesGaessler
Copy link
Collaborator Author

I'm not sure why the performance with a batch size of 100 is worse, my best guess is that there for some reason more overhead in the ggml code.

@JohannesGaessler
Copy link
Collaborator Author

I re-added support for ggml graph export/import. I wrote a function mnist_graph_eval that should be able to read both fully connected and convolutional ggml graph dumps as long as they have well-defined tensors for the inputs and outputs. I made it so that the graph is dumped at the end of the ggml training but notably this does not work with the master ggml code. The training parameters are treated as nodes rather than leaves due to the presence of gradients so they aren't being dumped to the ggml file. The tensor flags for input/output/parameter are also not being dumped which I assume is an oversight. I made it so that the tensor flags are being dumped and read and that the tensor data is being saved/loaded for parameters.

I also added more information on loss/accuracy.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 15, 2024

I pushed training support for a convolutional neural network. There is a single binary each for ggml training/evaluation. For the training binary the user specifies mnist-fc or mnist-cnn as an arch, the binary then trains a model and saves both a GGUF model and a ggml graph. The user can then pass either file to the evaluation binary. ggml CNN training performance is quite bad, ~10x slower than TensorFlow (presumably because im2col -> matrix multiplication is inefficient). I scaled down the CNN in order to keep the training sufficiently fast. I also removed dropout from the TensorFlow script since ggml doesn't support it. I changed the data type of all tensors in the CNN to FP32 since that is the only data type for which ggml supports backpropagation.

I implemented backwards passes for im2col and pool_2d to enable ggml training of the CNN. This necessitated new ggml ops GGML_OP_IM2COL_BACK and GGML_OP_POOL_2D_BACK since I don't think it's possible to implement the backwards passes using existing ggml ops. I did not touch backwards passes for these new ops since I don't understand the purpose of having backwards passes for dedicated backwards pass ops.

I changed ggml_conv_2d to use the data type of the convolution kernel as the intermediary data type for im2col (this is needed for training and I think makes more sense than just using FP16 unconditionally).

I implemented support for batch sizes > 1 for ggml_pool_2d (just required passing one more value from the input, may have been an oversight).

I added a fix for noncontinuous gradients to test-grad0. Near noncontinuities the numerical calculation of the gradient can fail because it assumes that the first-order derivative is approximately constant on the order of eps. I added an optional argument to specify a finite list of expected values that the gradient can assume. If the numerically calculated gradient is not close to any of these values the comparison to backpropagation is skipped. For example, for ReLU the expected gradient values are 0 and 1. This fixes the issue where test-grad0 is randomly failing on master for ops like relu, step, or abs if running long enough.

I am close to finishing the minimum features that I had planned for the MNIST example. I think changes to ggml will be fairly minimal from now on. One issue that I still need to resolve is what to do with the Metal files for MNIST. The README doesn't say anything about them but I would assume that they are to demonstrate Metal inference. I think it would make more sense to handle this via the ggml backends interface (which I plan to do in the future for CUDA).

@ggerganov
Copy link
Owner

I am close to finishing the minimum features that I had planned for the MNIST example. I think changes to ggml will be fairly minimal from now on. One issue that I still need to resolve is what to do with the Metal files for MNIST. The README doesn't say anything about them but I would assume that they are to demonstrate Metal inference. I think it would make more sense to handle this via the ggml backends interface (which I plan to do in the future for CUDA).

Let's remove the metal files. These were needed at the start as PoC and demonstration, but currently don't serve any purpose.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review August 16, 2024 18:54
@JohannesGaessler
Copy link
Collaborator Author

I would like help with the web demo. For me it's already broken on master: it needs an additional source to compile and even then the demo is not functional when I run it locally. With this PR I have the same issue so I can't check whether I have broken anything.

I adapted the README to the new code. I did not add images of the graphs because they will be subject to change once there is broadcasting support for the backwards pass of ggml_add.

I did not add the MNIST dataset or any pretrained models to the GGML repository. They could in principle be added but if these files are iterated upon each iteration would increase the repository size. I don't have a strong opinion either way.

Other than the things mentioned, I think this PR is in a state where it can be merged (after a rebase).

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 16, 2024

@jboero now would be a good time to help with code review if you're interested.

@rgerganov
Copy link
Collaborator

I would like help with the web demo. For me it's already broken on master: it needs an additional source to compile and even then the demo is not functional when I run it locally. With this PR I have the same issue so I can't check whether I have broken anything.

Make sure that you have ggml-model-f32.bin model file in models/mnist. The current python script for producing this model (convert-h5-to-ggml.py) is pulling a ton of dependencies and needs to be updated to use gguf.

Install Emscripten by following their instructions and then:

$ source $EMSDK_PATH/emsdk_env.sh
$ emcc -I../../include -I../../include/ggml -I../../examples ../../src/ggml.c ../../src/ggml-quants.c ../../src/ggml-aarch64.c main.cpp -o web/mnist.js -s EXPORTED_FUNCTIONS='["_wasm_eval","_wasm_random_digit","_malloc","_free"]' -s EXPORTED_RUNTIME_METHODS='["ccall"]' -s ALLOW_MEMORY_GROWTH=1 --preload-file models/mnist

(I have no idea why we need ggml-aarch64.c here, this is the file missing in the current README). Finally start an http server in the web dir:

cd web
cd python3 -m http.server

If you still have problems, check the JS console for errors.

@jboero
Copy link

jboero commented Aug 17, 2024

@jboero now would be a good time to help with code review if you're interested.

Yes I'll test this out in the airport today. I see the comment on performance. That's been my experience too though purely CPU training would always be slow.

@JohannesGaessler
Copy link
Collaborator Author

@rgerganov thank you, I was able to get the web code to work. The problem was that I was trying to run it with emrun and that for whatever reason didn't work. With the Python HTTP server I was able to get it working.

@JohannesGaessler
Copy link
Collaborator Author

I should maybe also mention that the MNIST example on master (and the web demo in this PR) are not 100% correct. The PyTorch code implicitly normalizes the MNIST data to the interval $[0, 1]$ but the inputs used by the GGML code are in the interval $[0, 255]$. Because the network only uses dense layers and ReLU this does not matter for the order of the logits/probabilities but the loss/confidence is wrong. With the code in this PR I get identical results between GGML and PyTorch/TensorFlow.

@ggerganov
Copy link
Owner

Looks like the CI is failing because of the removed metal sources

@jboero
Copy link

jboero commented Aug 17, 2024

Is it just me or is the MNIST download site in docs giving 403?
https://yann.lecun.com/exdb/mnist/

mnist

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 17, 2024

For the direct download I had the same issue. Download via PyTorch worked.

@jboero
Copy link

jboero commented Aug 17, 2024

OK I see it attempts a backup link and works. I'm having build issues though - maybe my env is too new? Fedora 40 gcc 14:
Actually this is a different error than before. Missing mnist_model_eval for some reason. Will dig more.

[ 97%] Linking CXX executable ../../bin/magika
[ 98%] Linking CXX executable ../../bin/mnist-train
[100%] Linking CXX executable ../../bin/mnist-eval
[100%] Built target gpt-2-ctx
[100%] Built target gpt-2-backend
[100%] Built target gpt-2-alloc
[100%] Built target gpt-2-sched
[100%] Built target gpt-j-quantize
[100%] Built target gpt-2-quantize
/usr/bin/ld: CMakeFiles/mnist-eval.dir/mnist-eval.cpp.o: in function `main':
mnist-eval.cpp:(.text.startup+0x23a): undefined reference to `mnist_model_eval(mnist_model const&, float const*, float const*, int)'
[100%] Built target gpt-2-batched
[100%] Built target magika
collect2: error: ld returned 1 exit status
make[2]: *** [examples/mnist/CMakeFiles/mnist-eval.dir/build.make:100: bin/mnist-eval] Error 1
make[1]: *** [CMakeFiles/Makefile2:1308: examples/mnist/CMakeFiles/mnist-eval.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
[100%] Built target gpt-j
[100%] Built target mnist-train
make: *** [Makefile:146: all] Error 2
jboero@xps ~/c/g/build (mnist-train) [2]> gcc --version
gcc (GCC) 14.2.1 20240801 (Red Hat 14.2.1-1)
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@JohannesGaessler
Copy link
Collaborator Author

I had at some point accidentally pushed a version that did not compile. Make sure you use the latest commit I force-pushed.

@jboero
Copy link

jboero commented Aug 17, 2024

Right much better. I can see it's slow and single threaded but it works. Also I can crank up the CPU speed live because it's just one thread and it doesn't overheat me too much. (no CUDA on this ultrabook but I'll retry with CUDA next week)

Looks good to me but I will review line by line during my flight.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml CNN training performance is quite bad, ~10x slower than TensorFlow (presumably because im2col -> matrix multiplication is inefficient)

The pool and repeat_back ops could benefit from multi-threading - should bring a factor of x2 to the ggml CPU performance. The rest of the performance gap indeed is likely due to inefficient convolution implementation.

Feel free to merge - sent you a collab invite

@jboero
Copy link

jboero commented Aug 20, 2024

Yes I agree the code looks fine just slow and not very parallelized. Anyway it's a great example. Thanks @JohannesGaessler

@JohannesGaessler JohannesGaessler merged commit bada316 into ggerganov:master Aug 20, 2024
4 checks passed
@JohannesGaessler
Copy link
Collaborator Author

I just realized that I pressed the wrong button when merging. I only rebased my commits but did not squash them. Notably this means that there are now several commits on master that will not compile due to another rebase that I had done. There have not been any other commits to master since so if we're going to fix this we should do it now.

@ggerganov
Copy link
Owner

Can you try to force push on master? I'm not at PC and probably will be on one tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants