default run_uid behavior not conducive to HPC #64

danielsf · 2021-10-29T21:25:05Z

I just tried to submit ~ 300 jobs to our HPC cluster without specifying run_uid. The default behavior run_uid="YYYY_MM_DD_HH_mm" means that, because dozens of jobs were starting at the same minute, I had collisions occur as they all tried to write to the same {run_uid}_generator.json and {run_uid}_inference.json (for ease on myself, I was writing all of my inferred data products to the same directory). I'm not sure if something should be done about this or not. Options are

Add some random salt to the default run_uid. This will make it hard to associate run_uid_generator.json files with finished data products after the fact (I'm not sure if that is a concern or not)
Make run_uid a required parameter.
Test for the existence of "run_uid_generator.json" and emit a warning "trying to write file {run_uid_generator.json}, but that file already exists" so that users are more rapidly able to infer why their jobs crashed.

Maybe this is just user error on my part, but it took a bit of digging for me to figure out what was going wrong.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default run_uid behavior not conducive to HPC #64

default run_uid behavior not conducive to HPC #64

danielsf commented Oct 29, 2021

default run_uid behavior not conducive to HPC #64

default run_uid behavior not conducive to HPC #64

Comments

danielsf commented Oct 29, 2021