You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The test_esmf.py currently spawns separate child processes to run collect_artifacts.sh after build and test phases for each of the combos executed. All of those collect_artifacts.sh scripts execute collect_artifacts.py which contains Git command from under the local esmf-test-artifacts clone directory. This causes race-conditions between all of those collect_artifacts.py instances.
There is currently code inside collect_artifacts.py that is supposed to function as a lock mechanism to prevent the race-condition. The locking implementation is file-based, and with file-system (FS) issues, does not guarantee to function. In fact, on lustre FS it does not work reliably at all!
One solution might be to prevent multiple collect_artifacts.py instances in the first place. Instead maybe there should be only one of them, but it is responsible to process all of the running build & test jobs. This could be managed by a simple file that contains all of the job-ids to wait on. The single collect_artifacts.py instances then just loops over those ids, looking if any of them is done, and if so handles the collection. There is virtually no potential of conflict in this approach between the collect-processing of the different combos, and at the same time it should be just as flexible, i.e. what ever gets done gets processed asap, i.e. no serialization of the order, since the single instance collect_artifacts.py loops over all ids, checking which ones are done for collection. The process finishes once all ids have finished and have been processed.
The text was updated successfully, but these errors were encountered:
The
test_esmf.py
currently spawns separate child processes to runcollect_artifacts.sh
after build and test phases for each of the combos executed. All of thosecollect_artifacts.sh
scripts executecollect_artifacts.py
which contains Git command from under the localesmf-test-artifacts
clone directory. This causes race-conditions between all of thosecollect_artifacts.py
instances.There is currently code inside
collect_artifacts.py
that is supposed to function as a lock mechanism to prevent the race-condition. The locking implementation is file-based, and with file-system (FS) issues, does not guarantee to function. In fact, on lustre FS it does not work reliably at all!One solution might be to prevent multiple
collect_artifacts.py
instances in the first place. Instead maybe there should be only one of them, but it is responsible to process all of the running build & test jobs. This could be managed by a simple file that contains all of the job-ids to wait on. The singlecollect_artifacts.py
instances then just loops over those ids, looking if any of them is done, and if so handles the collection. There is virtually no potential of conflict in this approach between the collect-processing of the different combos, and at the same time it should be just as flexible, i.e. what ever gets done gets processed asap, i.e. no serialization of the order, since the single instancecollect_artifacts.py
loops over all ids, checking which ones are done for collection. The process finishes once all ids have finished and have been processed.The text was updated successfully, but these errors were encountered: