You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I just tried to submit ~ 300 jobs to our HPC cluster without specifying run_uid. The default behavior run_uid="YYYY_MM_DD_HH_mm" means that, because dozens of jobs were starting at the same minute, I had collisions occur as they all tried to write to the same {run_uid}_generator.json and {run_uid}_inference.json (for ease on myself, I was writing all of my inferred data products to the same directory). I'm not sure if something should be done about this or not. Options are
Add some random salt to the default run_uid. This will make it hard to associate run_uid_generator.json files with finished data products after the fact (I'm not sure if that is a concern or not)
Make run_uid a required parameter.
Test for the existence of "run_uid_generator.json" and emit a warning "trying to write file {run_uid_generator.json}, but that file already exists" so that users are more rapidly able to infer why their jobs crashed.
Maybe this is just user error on my part, but it took a bit of digging for me to figure out what was going wrong.
The text was updated successfully, but these errors were encountered:
I just tried to submit ~ 300 jobs to our HPC cluster without specifying
run_uid
. The default behaviorrun_uid="YYYY_MM_DD_HH_mm"
means that, because dozens of jobs were starting at the same minute, I had collisions occur as they all tried to write to the same{run_uid}_generator.json
and{run_uid}_inference.json
(for ease on myself, I was writing all of my inferred data products to the same directory). I'm not sure if something should be done about this or not. Options areAdd some random salt to the default run_uid. This will make it hard to associate run_uid_generator.json files with finished data products after the fact (I'm not sure if that is a concern or not)
Make run_uid a required parameter.
Test for the existence of "run_uid_generator.json" and emit a warning "trying to write file {run_uid_generator.json}, but that file already exists" so that users are more rapidly able to infer why their jobs crashed.
Maybe this is just user error on my part, but it took a bit of digging for me to figure out what was going wrong.
The text was updated successfully, but these errors were encountered: