Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Self Test of develop branch [Do Not Merge] #2085

Conversation

TerrenceMcGuinness-NOAA
Copy link
Collaborator

This is a CI Self Test on Develop: Seeing failures on Orion during development of CI. Also need to baseline Hera.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI/CD Issue related to CI/CD labels Nov 21, 2023
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA changed the title CI Self Test of Development branch CI Self Test of develop branch Nov 21, 2023
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera labels Nov 21, 2023
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA changed the title CI Self Test of develop branch CI Self Test of develop branch [Do Not Merge] Nov 21, 2023
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA marked this pull request as draft November 21, 2023 14:24
@emcbot emcbot added CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress and removed CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Nov 21, 2023
@emcbot
Copy link

emcbot commented Nov 21, 2023

Automated global-workflow Testing Results:

Machine: Hera
Start: Tue Nov 21 14:24:28 UTC 2023 on hfe05
---------------------------------------------------
Checkout: Completed at Tue Nov 21 14:27:07 UTC 2023
Build: Completed at Tue Nov 21 15:13:34 UTC 2023
Case setup: Completed for experiment C48_ATM_2d389c8a
Case setup: Completed for experiment C48_S2SA_gefs_2d389c8a
Case setup: Completed for experiment C48_S2SW_2d389c8a
Case setup: Completed for experiment C96C48_hybatmDA_2d389c8a
Case setup: Completed for experiment C96_atm3DVar_2d389c8a

@emcbot emcbot added CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Nov 21, 2023
@emcbot
Copy link

emcbot commented Nov 21, 2023

Automated global-workflow Testing Results:

Machine: Orion
Start: Tue Nov 21 08:24:19 CST 2023 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Checkout: Completed at Tue Nov 21 08:25:15 CST 2023
Build: Completed at Tue Nov 21 09:14:44 CST 2023
Case setup: Completed for experiment C48_ATM_2d389c8a
Case setup: Completed for experiment C48_S2SA_gefs_2d389c8a
Case setup: Completed for experiment C48_S2SW_2d389c8a
Case setup: Completed for experiment C96_atm3DVar_2d389c8a
Case setup: Completed for experiment C96C48_hybatmDA_2d389c8a

@emcbot emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Nov 21, 2023
@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C96_atm3DVar_2d389c8a Terminated: *** FAILED *** on Orion
Experiment C96_atm3DVar_2d389c8a Terminated with 1 tasks failed at Tue Nov 21 10:10:33 CST 2023
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/PR/2085/RUNTESTS/COMROT/C96_atm3DVar_2d389c8a/logs/2021122106/gdasanal.log

srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 84 with the number of requested nodes 11. Ignoring --ntasks-per-node.

+ exglobal_atmos_analysis.sh[918]: /bin/cp -p /work2/noaa/stmp/GFS_CI_ROOT/PR/2085/global-workflow/exec/gsi.x /work/noaa/stmp/mterry/RUNDIRS/C96_atm3DVar_2d389c8a/anal.303567
++ exglobal_atmos_analysis.sh[919]: basename /work2/noaa/stmp/GFS_CI_ROOT/PR/2085/global-workflow/exec/gsi.x
+ exglobal_atmos_analysis.sh[919]: srun -l --export=ALL -n 84 --cpus-per-task=5 /work/noaa/stmp/mterry/RUNDIRS/C96_atm3DVar_2d389c8a/anal.303567/gsi.x
srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 84 with the number of requested nodes 11. Ignoring --ntasks-per-node.
16: [Orion-25-42:342733:0:342733]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 14

@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C48_S2SA_gefs_2d389c8a SUCCESS Tue Nov 21 16:27:09 UTC 2023

@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C48_ATM_2d389c8a SUCCESS Tue Nov 21 18:15:17 UTC 2023

@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C96C48_hybatmDA_2d389c8a SUCCESS Tue Nov 21 19:27:20 UTC 2023

@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C96_atm3DVar_2d389c8a SUCCESS Tue Nov 21 19:30:19 UTC 2023

@WalterKolczynski-NOAA
Copy link
Contributor

Experiment C96_atm3DVar_2d389c8a Terminated: *** FAILED *** on Orion
Experiment C96_atm3DVar_2d389c8a Terminated with 1 tasks failed at Tue Nov 21 10:10:33 CST 2023
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/PR/2085/RUNTESTS/COMROT/C96_atm3DVar_2d389c8a/logs/2021122106/gdasanal.log

srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 84 with the number of requested nodes 11. Ignoring --ntasks-per-node.

+ exglobal_atmos_analysis.sh[918]: /bin/cp -p /work2/noaa/stmp/GFS_CI_ROOT/PR/2085/global-workflow/exec/gsi.x /work/noaa/stmp/mterry/RUNDIRS/C96_atm3DVar_2d389c8a/anal.303567
++ exglobal_atmos_analysis.sh[919]: basename /work2/noaa/stmp/GFS_CI_ROOT/PR/2085/global-workflow/exec/gsi.x
+ exglobal_atmos_analysis.sh[919]: srun -l --export=ALL -n 84 --cpus-per-task=5 /work/noaa/stmp/mterry/RUNDIRS/C96_atm3DVar_2d389c8a/anal.303567/gsi.x
srun: warning: can't honor --ntasks-per-node set to 8 which doesn't match the requested tasks 84 with the number of requested nodes 11. Ignoring --ntasks-per-node.
16: [Orion-25-42:342733:0:342733]     address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 14

CI misidentified the real problem, which begins on the last line shown here. May be insufficient memory, but I think we need an expert to look at it. @DavidHuber-NOAA can you take a look and determine whether this is a settings problem, machine problem, or a GSI problem?

Tests were passing until this started late Friday. Maybe some combination of the PRs that were merged, even though they individually passed?

@emcbot
Copy link

emcbot commented Nov 21, 2023

Experiment C48_S2SW_2d389c8a SUCCESS Tue Nov 21 20:15:16 UTC 2023

@emcbot emcbot added CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Nov 21, 2023
@emcbot
Copy link

emcbot commented Nov 21, 2023

All CI Test Cases Passed:

Experiment C48_S2SA_gefs_2d389c8a **SUCCESS** at Tue Nov 21 16:27:09 UTC 2023
Experiment C48_ATM_2d389c8a **SUCCESS** at Tue Nov 21 18:15:17 UTC 2023
Experiment C96C48_hybatmDA_2d389c8a **SUCCESS** at Tue Nov 21 19:27:20 UTC 2023
Experiment C96_atm3DVar_2d389c8a **SUCCESS** at Tue Nov 21 19:30:19 UTC 2023
Experiment C48_S2SW_2d389c8a **SUCCESS** at Tue Nov 21 20:15:16 UTC 2023

@DavidHuber-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA I am not sure exactly what to make of this error. It's obviously an MPI error, but beyond that I am not sure. I would be happy to test some things out on Orion, though. Is there a way to run this test on Orion directly?

@WalterKolczynski-NOAA
Copy link
Contributor

@DavidHuber-NOAA This might actually be a different thing than the other issue. This one seems ephemeral (I can't reproduce it), the other (higher-res) seems consistent.

@WalterKolczynski-NOAA WalterKolczynski-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Nov 29, 2023
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Nov 29, 2023
@emcbot
Copy link

emcbot commented Nov 29, 2023

Automated global-workflow Testing Results:

Machine: Orion
Start: Tue Nov 28 19:13:04 CST 2023 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Checkout: Completed at Tue Nov 28 19:14:05 CST 2023
Build: Completed at Tue Nov 28 20:06:38 CST 2023
Case setup: Completed for experiment C48_ATM_2d389c8a
Case setup: Completed for experiment C48_S2SA_gefs_2d389c8a
Case setup: Completed for experiment C48_S2SW_2d389c8a
Case setup: Completed for experiment C96_atm3DVar_2d389c8a
Case setup: Completed for experiment C96C48_hybatmDA_2d389c8a

@emcbot
Copy link

emcbot commented Nov 29, 2023

Experiment C48_S2SA_gefs_2d389c8a SUCCESS at Tue Nov 28 21:20:16 CST 2023 on Orion

@emcbot
Copy link

emcbot commented Nov 29, 2023

Experiment C48_ATM_2d389c8a SUCCESS at Tue Nov 28 21:26:06 CST 2023 on Orion

@emcbot
Copy link

emcbot commented Nov 29, 2023

Experiment C96C48_hybatmDA_2d389c8a SUCCESS at Tue Nov 28 22:36:17 CST 2023 on Orion

@emcbot
Copy link

emcbot commented Nov 29, 2023

Experiment C96_atm3DVar_2d389c8a SUCCESS at Tue Nov 28 22:40:21 CST 2023 on Orion

@emcbot
Copy link

emcbot commented Nov 29, 2023

Experiment C48_S2SW_2d389c8a SUCCESS at Wed Nov 29 00:36:07 CST 2023 on Orion

@emcbot emcbot added CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Nov 29, 2023
@emcbot
Copy link

emcbot commented Nov 29, 2023

All CI Test Cases Passed:

Experiment C48_S2SA_gefs_2d389c8a *** SUCCESS *** at Tue Nov 28 21:20:16 CST 2023 on Orion
Experiment C48_ATM_2d389c8a *** SUCCESS *** at Tue Nov 28 21:26:06 CST 2023 on Orion
Experiment C96C48_hybatmDA_2d389c8a *** SUCCESS *** at Tue Nov 28 22:36:17 CST 2023 on Orion
Experiment C96_atm3DVar_2d389c8a *** SUCCESS *** at Tue Nov 28 22:40:21 CST 2023 on Orion
Experiment C48_S2SW_2d389c8a *** SUCCESS *** at Wed Nov 29 00:36:07 CST 2023 on Orion

@DavidHuber-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA Apologies, I am looking through my emails and I don't see the the other issue you are referring to. Can you point me to it again? Glad to see that Orion is running again, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Issue related to CI/CD CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants