Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ophys Example with sample data on GPU: OOM Error #81

Open
ChrisWiesbrock opened this issue Dec 20, 2021 · 5 comments
Open

Ophys Example with sample data on GPU: OOM Error #81

ChrisWiesbrock opened this issue Dec 20, 2021 · 5 comments

Comments

@ChrisWiesbrock
Copy link

Hello there!

We try to run the example tiny ophys training on our computers. By now, it is running perfectly fine, when we just use the CPU, but we face the same error on different systems, if we run it on a GPU. So far, we tried a 2070 Ti, a 1070 and google colab. We use CUDA 11, cuDNN 8.0.4 and tensorflow 2.4.4. The example tiny ephys training is running fine on these GPUs. Is our hardware to weak or is there anything else we might can solve?

Have a nice day and all the best!

Chris

This is the error, we get:

ResourceExhaustedError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_13456/3701641197.py in
----> 1 training_class.run()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\deepinterpolation\trainor_collection.py in run(self)
243 use_multiprocessing=self.use_multiprocessing,
244 callbacks=self.callbacks_list,
--> 245 initial_epoch=0,
246 )
247 else:

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
1134 workers=workers,
1135 use_multiprocessing=use_multiprocessing,
-> 1136 return_dict=True)
1137 val_logs = {'val_' + name: val for name, val in val_logs.items()}
1138 epoch_logs.update(val_logs)

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\keras\engine\training.py in evaluate(self, x, y, batch_size, verbose, sample_weight, steps, callbacks, max_queue_size, workers, use_multiprocessing, return_dict)
1382 with trace.Trace('test', step_num=step, _r=1):
1383 callbacks.on_test_batch_begin(step)
-> 1384 tmp_logs = self.test_function(iterator)
1385 if data_handler.should_sync:
1386 context.async_wait()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\def_function.py in call(self, *args, **kwds)
826 tracing_count = self.experimental_get_tracing_count()
827 with trace.Trace(self._name) as tm:
--> 828 result = self._call(*args, **kwds)
829 compiler = "xla" if self._experimental_compile else "nonXla"
830 new_tracing_count = self.experimental_get_tracing_count()

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
893 # If we did not create any variables the trace we have is good enough.
894 return self._concrete_stateful_fn._call_flat(
--> 895 filtered_flat_args, self._concrete_stateful_fn.captured_inputs) # pylint: disable=protected-access
896
897 def fn_with_cond(inner_args, inner_kwds, inner_filtered_flat_args):

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
1917 # No tape is watching; skip to running the function.
1918 return self._build_call_outputs(self._inference_function.call(
-> 1919 ctx, args, cancellation_manager=cancellation_manager))
1920 forward_backward = self._select_forward_and_backward_functions(
1921 args,

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
558 inputs=args,
559 attrs=attrs,
--> 560 ctx=ctx)
561 else:
562 outputs = execute.execute_with_cancellation(

H:\Programme\Anaconda\envs\deepinter\lib\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
58 ctx.ensure_initialized()
59 tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60 inputs, attrs, num_outputs)
61 except core._NotOkStatusException as e:
62 if name is not None:

ResourceExhaustedError: OOM when allocating tensor with shape[20,256,256,256] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node model/concatenate_2/concat-0-TransposeNHWCToNCHW-LayoutOptimizer}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_test_function_1959]

Function call stack:
test_function

@jeromelecoq
Copy link
Collaborator

Hi Chris,
I am happy to help with this.
What is your batch size for 2p? This can increase the need for GPU memory quite quickly.

@ChrisWiesbrock
Copy link
Author

Hi Jerome,

sorry for the late response. I hope you had great holidays.

We reduced the batch size to 1 in order to see if this is the cause of the problem, but we still get the same error message.

We run it in Jupyter Notebook. Are there any known issues about that?

@jtchang
Copy link

jtchang commented Jan 26, 2022

I ran into this issue on Windows 10, python 3.7, TF2.4.4 on a GTX 1070.

It looks like the fit function is loading the entire validation set which is why it gives an oom error.

Lowering the window and test set size lets me run it on the GPU, but the results are not quite as good.

edit: I did a little more digging, and the GPU was loading the entire validation because of the cache_validation call, instead of passing the generator which respects the batch size. Setting the "caching_validation" field in training_params to False prevents this from happening.

@ChrisWiesbrock
Copy link
Author

@jtchang

Thank you so much for your edit! This did the job for me as well.

Now the training is running smoothly in google colab without the OOM-Issue.

@tomomano
Copy link

@jtchang This saved my day! Thanks!

For thoese facing the same issue, you want to add this line in your training code.

training_param["caching_validation"] = False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants