Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT CLOSE (see RDHPCS Helpdesk response)] Weekly CI Tests Friday Jan 05, 2024 #2202

Closed
wants to merge 1 commit into from

Conversation

emcbot
Copy link

@emcbot emcbot commented Jan 5, 2024

[DO NOT MERGE] Weekly CI Tests Friday Jan 05, 2024

@emcbot emcbot marked this pull request as draft January 5, 2024 22:46
@emcbot emcbot added CI/CD Issue related to CI/CD CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion labels Jan 5, 2024
@emcbot
Copy link
Author

emcbot commented Jan 5, 2024

CI Update on Orion at 01/05/24 04:48:08 PM
============================================
Cloning and Building global-workflow PR: 2202
with PID: 66697 on host: Orion-login-1

@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera labels Jan 5, 2024
@emcbot
Copy link
Author

emcbot commented Jan 5, 2024

CI Update on Hera at 01/05/24 10:48:09 PM
============================================
Cloning and Building global-workflow PR: 2202
with PID: 40636 on host: hfe05

@emcbot emcbot added CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress and removed CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jan 5, 2024
@emcbot
Copy link
Author

emcbot commented Jan 5, 2024

Automated global-workflow Testing Results:

Machine: Hera
Start: Fri Jan  5 22:53:41 UTC 2024 on hfe05
---------------------------------------------------
Build: Completed at 01/05/24 11:39:14 PM
Case setup: Completed for experiment C384C192_hybatmda_e54438e5
Case setup: Completed for experiment C384_S2SWA_e54438e5
Case setup: Completed for experiment C384_atm3DVar_e54438e5

@emcbot emcbot added CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Jan 5, 2024
@emcbot
Copy link
Author

emcbot commented Jan 5, 2024

Automated global-workflow Testing Results:

Machine: Orion
Start: Fri Jan  5 16:50:23 CST 2024 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build: Completed at 01/05/24 05:39:22 PM
Case setup: Completed for experiment C384_atm3DVar_e54438e5
Case setup: Completed for experiment C384C192_hybatmda_e54438e5
Case setup: Completed for experiment C384_S2SWA_e54438e5

@emcbot emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Jan 6, 2024
@emcbot
Copy link
Author

emcbot commented Jan 6, 2024

Experiment C384_atm3DVar_e54438e5  *** FAILED *** on Orion
Experiment C384_atm3DVar_e54438e5  with 1 tasks failed at 01/05/24 06:40:25 PM
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/RUNTESTS/COMROT/C384_atm3DVar_e54438e5/logs/2023040200/gfsfcst.log

@emcbot emcbot added CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jan 6, 2024
@emcbot
Copy link
Author

emcbot commented Jan 6, 2024

Experiment C384C192_hybatmda_e54438e5  *** FAILED *** on Hera
Experiment C384C192_hybatmda_e54438e5  with 1 tasks failed at 01/06/24 12:48:09 AM
Error logs:
/scratch1/NCEPDEV/global/Terry.McGuinness/GFS_CI_ROOT/PR/2202/RUNTESTS/COMROT/C384C192_hybatmda_e54438e5/logs/2023040200/gfsanalcalc.log

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Jan 8, 2024

Hera failed with an MPI layout issue in gfs analcalc causing aprun's MPI collective operations to fail under slurm when running calc_anl.x

srun -n 127 --verbose --export=ALL /scratch1/NCEPDEV/stmp2/Terry.McGuinness/RUNDIRS/C384C192_hybatmda_e54438e5/analcalc.300847/calcanl_ensres_06/calc_anl.x submitted
srun: Warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 127 with the number of requested nodes 4. Ignoring --ntasks-per-node.

With the subsequent MPI-I/O system errors:

ADIOI_GEN_SetLock:: Cannot send after transport endpoint shutdown ADIOI_GEN_SetLock:offset 67270586, length 524288
Abort(1) on node 53 (rank 53 in comm 0): application called
MPI_Abort(MPI_COMM_WORLD, 1) - process 53
This requires fcntl(2) to be implemented. As of 8/25/2011 it is not.
Generic MPICH Message: File locking failed in ADIOI_GEN_SetLock(fd 15,cmd
F_SETLK64/6,type F_UNLOCK/2,whence 0) with return value FFFFFFFF and errno 6C.

@aerorahul Its discrepancy has been sent to the RDHPCS Help desk.
It is currently being processed as ticket number 2024011054000182.

@DavidHuber-NOAA / @WalterKolczynski-NOAA
A quick turn around from the RDHPCS Help Desk Analyst Wei Yu recomended to use the following MPI Hint flags:

export ROMIO_PRINT_HINTS=1
export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Jan 8, 2024
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

CI Update on Orion at 01/08/24 09:44:07 AM
============================================
Cloning and Building global-workflow PR: 2202
with PID: 98371 on host: Orion-login-1

@TerrenceMcGuinness-NOAA
Copy link
Collaborator

TerrenceMcGuinness-NOAA commented Jan 8, 2024

The fails on Orion were with accessing staged executables needed by srun on /stmp at runtime on the compute nodes that were indeed there on the head nodes:


mterry (Orion-login-3) ~ $ grep "couldn't chdir to" /work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/RUNTESTS/COMROT/C384_atm3DVar_e54438e5/logs/2023040200/gfsfcst.log | tail -4
  80: slurmstepd: error: couldn't chdir to `/work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306': No such file or directory: going to /tmp instead
  80: slurmstepd: error: couldn't chdir to `/work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306': No such file or directory: going to /tmp instead
  80: slurmstepd: error: couldn't chdir to `/work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306': No such file or directory: going to /tmp instead
  80: slurmstepd: error: couldn't chdir to `/work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306': No such file or directory: going to /tmp instead
mterry (Orion-login-3) ~ $ 
mterry (Orion-login-3) ~ $ file /work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306/ufs_model.x
/work/noaa/stmp/mterry/RUNDIRS/C384_atm3DVar_e54438e5/fcst.266306/ufs_model.x: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.32, BuildID[sha1]=ac4e1f1a1d8b6303c3479c0a59513023ab71e11b, not stripped

Purged /stmp and will report these discrepancies to RDHPC Helpdesk as we have seen similar anomalies on Hercules. We suspect my be related to a known Luster issue with Rocky 9 and this information may be helpful to the System Admins.

@emcbot emcbot added CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

Automated global-workflow Testing Results:

Machine: Orion
Start: Mon Jan  8 09:46:29 CST 2024 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build: Completed at 01/08/24 10:36:41 AM
Case setup: Completed for experiment C384_atm3DVar_e54438e5
Case setup: Completed for experiment C384C192_hybatmda_e54438e5
Case setup: Completed for experiment C384_S2SWA_e54438e5

@emcbot emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

Experiment C384_S2SWA_e54438e5  *** FAILED *** on Orion
Experiment C384_S2SWA_e54438e5  with 1 tasks failed at 01/08/24 10:56:17 AM
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/RUNTESTS/COMROT/C384_S2SWA_e54438e5/logs/2016070100/gfsfcst.log

@DavidHuber-NOAA
Copy link
Contributor

DavidHuber-NOAA commented Jan 8, 2024

The log file for Orion is reporting the following error:

[Orion-24-70:306707:0:306707] address.c:1052 Assertion `*addr_version == UCP_OBJECT_VERSION_V2' failed: addr version 8

I reported this to the Orion admins. Issue number 2024010854000105.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Jan 8, 2024
@emcbot emcbot added CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion and removed CI-Orion-Ready **CM use only** PR is ready for CI testing on Orion labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

CI Update on Orion at 01/08/24 01:32:40 PM
============================================
Cloning and Building global-workflow PR: 2202
with PID: 164713 on host: Orion-login-1

@emcbot emcbot added CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Building **Bot use only** CI testing is cloning/building on Orion labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

Automated global-workflow Testing Results:

Machine: Orion
Start: Mon Jan  8 13:35:03 CST 2024 on Orion-login-1.HPC.MsState.Edu
---------------------------------------------------
Build: Completed at 01/08/24 02:21:47 PM
Case setup: Completed for experiment C384_atm3DVar_e54438e5
Case setup: Completed for experiment C384C192_hybatmda_e54438e5
Case setup: Completed for experiment C384_S2SWA_e54438e5

@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

Experiment C384_atm3DVar_e54438e5 SUCCESS on Orion at 01/08/24 05:22:09 PM

@emcbot emcbot added CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Jan 8, 2024
@emcbot
Copy link
Author

emcbot commented Jan 8, 2024

Experiment C384_atm3DVar_e54438e5 *** SUCCESS *** at 01/08/24 05:22:09 PM
Experiment C384_S2SWA_e54438e5  *** FAILED *** on Orion
Experiment C384_S2SWA_e54438e5  with 1 tasks failed at 01/08/24 05:40:29 PM
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/RUNTESTS/COMROT/C384_S2SWA_e54438e5/logs/2016070100/gfsocnpost_f090-f102.log

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress and removed CI-Orion-Failed **Bot use only** CI testing on Orion for this PR has failed labels Jan 9, 2024
@TerrenceMcGuinness-NOAA
Copy link
Collaborator

mterry (Orion-login-3) 2016070100 $ tail -3 gfsocnpost_f090-f102.log.0
+ run_reg2grb2.sh[66]: ln -sf /work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/global-workflow/fix/reg2grb2/mask.0p25x0p25.grb2 ./iceocnpost.g2
+ run_reg2grb2.sh[67]: /work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/global-workflow/exec/reg2grb2.x
slurmstepd: error: *** JOB 16197778 ON Orion-01-02 CANCELLED AT 2024-01-08T17:35:21 DUE TO TIME LIMIT ***
mterry (Orion-login-3) 2016070100 $ 

Timed out after making link. Restarted.

@emcbot
Copy link
Author

emcbot commented Jan 9, 2024

Experiment C384C192_hybatmda_e54438e5 SUCCESS on Orion at 01/09/24 10:44:11 AM

@emcbot
Copy link
Author

emcbot commented Jan 9, 2024

Experiment C384_S2SWA_e54438e5 SUCCESS on Orion at 01/09/24 10:44:15 AM

@emcbot emcbot added CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully and removed CI-Orion-Running **Bot use only** CI testing on Orion for this PR is in-progress labels Jan 9, 2024
@emcbot
Copy link
Author

emcbot commented Jan 9, 2024

All CI Test Cases Passed on Orion:

Experiment C384_atm3DVar_e54438e5 *** SUCCESS *** at 01/08/24 05:22:09 PM
Experiment C384_S2SWA_e54438e5 *** FAILED *** on Orion
Experiment C384_S2SWA_e54438e5 with 1 tasks failed at 01/08/24 05:40:29 PM
Error logs:
/work2/noaa/stmp/GFS_CI_ROOT/ORION/PR/2202/RUNTESTS/COMROT/C384_S2SWA_e54438e5/logs/2016070100/gfsocnpost_f090-f102.log
Experiment C384C192_hybatmda_e54438e5 *** SUCCESS *** at 01/09/24 10:44:11 AM
Experiment C384_S2SWA_e54438e5 *** SUCCESS *** at 01/09/24 10:44:15 AM

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA changed the title [DO NOT MERGE] Weekly CI Tests Friday Jan 05, 2024 [DO NOT CLOSE (ticket pending in RDHPCS Helpdesk)] Weekly CI Tests Friday Jan 05, 2024 Jan 10, 2024
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA changed the title [DO NOT CLOSE (ticket pending in RDHPCS Helpdesk)] Weekly CI Tests Friday Jan 05, 2024 [DO NOT CLOSE (see RDHPCS Helpdesk response)] Weekly CI Tests Friday Jan 05, 2024 Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Issue related to CI/CD CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed CI-Orion-Passed **Bot use only** CI testing on Orion for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants