Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: dockerfile build and CI error reporting #494

Merged
merged 9 commits into from
Aug 9, 2024

Conversation

scrasmussen
Copy link
Member

ISSUE FIXED: Docker Github Actions CI was failing

DESCRIPTION OF CHANGES:

  • Installing OpenMPI and python3-netcdf4 packages to meet MPI and NetCDF requirements. For local development cmake-curses-gui was needed, also added non-GUI Emacs for development.
  • Docker prefers apt-get to install packages but f90nml wasn't available for that, needed to use pip with --break-system-packages. No need for a virtual environment now.
  • Cleaned up Docker warnings by switching to LABEL and ENV foo=bar

@grantfirl
Copy link
Collaborator

@scrasmussen @mkavulich @dustinswales This fixes creating the container, although most of the tests are actually failing in the container, so we probably need to debug that within the container.

@scrasmussen scrasmussen marked this pull request as draft July 25, 2024 16:22
@scrasmussen scrasmussen force-pushed the bugfix/dockerfile-build branch 12 times, most recently from 4a6f0ec to ffb73cb Compare July 29, 2024 17:23
@scrasmussen scrasmussen marked this pull request as ready for review July 29, 2024 17:42
@scrasmussen
Copy link
Member Author

scrasmussen commented Jul 29, 2024

This PR is now ready to be reviewed.

DESCRIPTION OF SECOND SET OF CHANGES:

  • Make sure Docker CI will pull the correct PR and build it, not just make the main branch
  • Converted stop statements to error stop statements so Python script can detect errors. Adding integers after like stop 1 would have also worked. Note, because of this some previously passing CI tests that were failing now accurately report failure on the Github CI
  • Added error summary when a batch of test runs are completed. Current Docker CI run produces the following output:
image
  • A line from the summary can be copied and searched in the output to directly go to the location of the terminal output for more detail on the failure.
image
  • Make sure terminal output on an error is logged so it gets printed to the CI output. Makes debugging easier.

@scrasmussen scrasmussen changed the title bugfix: dockerfile build bugfix: dockerfile build and CI error reporting Jul 29, 2024
@grantfirl
Copy link
Collaborator

@scrasmussen Were all of the whitespace changes intentional? The reason that I ask is that the GFS_typedefs.F90 and CCPP_typedefs.F90 files are often compared with those found in fv3atm, and I'd like to not introduce more trivial differences between the two repos. If it's important, it's fine, but we should introduce the same whitespace changes to the fv3atm repo sometime to sync them up.

&& cd bin \
&& cmake ../src \
&& make -j4
&& ./contrib/get_aerosol_climo.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't download this new aerosol data by default: it balloons the size of the container from 3.3 GB to a whopping 15.5 GB. We can update the instructions so users can download this data interactively if need be.

Actually now that I think about it, maybe it's best to have all the contrib/ data downloaded separately from the container, using a bind mount to get that data into the container. That would reduce the whole container down to 1.9 GB. Do you think that makes sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was setting up the Docker file to download the repo, build it, and prepare everything to run. We could certainly slim down the Docker file to not download the contrib/ data. Happy to set it up however we want it!

@grantfirl
Copy link
Collaborator

@scrasmussen Do you know if there are any new CI failures actually introduced by this PR or that they are simply being reported more accurately?

@grantfirl
Copy link
Collaborator

Can anyone else see why the docker CI task is still failing? I'm not seeing the actual error - reporting stops at 75% of some task.

@mkavulich
Copy link
Collaborator

@grantfirl @scrasmussen It's only an educated guess but I bet the test is failing because of the enormous size of the new aerosol data, hitting some limit either in the Github runner or the Docker setup therein. Another reason to have that data downloaded separately.

@scrasmussen
Copy link
Member Author

@grantfirl @mkavulich

So the docker fails [35/50] and the run_scm_rts fails [14/50] so the docker does have additional failures.
I thought that may be because of the lack of aerosol data, but I think that doesn't matter much because the docker image runs out of memory if it tries to download the aerosol data, so I'll remove that commit

  • Run SCM RTS Report
image

@scrasmussen
Copy link
Member Author

@scrasmussen Were all of the whitespace changes intentional? The reason that I ask is that the GFS_typedefs.F90 and CCPP_typedefs.F90 files are often compared with those found in fv3atm, and I'd like to not introduce more trivial differences between the two repos. If it's important, it's fine, but we should introduce the same whitespace changes to the fv3atm repo sometime to sync them up.

I guess intentional in that I my editor is setup to remove all trailing whitespace whenever I save. But it is not intentional in that is important in the grand scheme of things. It follows most style-guides but I'm happy to undo those changes for GFS_typedefs.F90 and CCPP_typedefs.F90 because it causes more issues!

@scrasmussen
Copy link
Member Author

@grantfirl @mkavulich

So the docker fails [35/50] and the run_scm_rts fails [14/50] so the docker does have additional failures. I thought that may be because of the lack of aerosol data, but I think that doesn't matter much because the docker image runs out of memory if it tries to download the aerosol data, so I'll remove that commit

* Run SCM RTS Report
image

One question about the error report, should I remove the status column?

In it the error status is reported, which isn't that useful because a user really needs to copy the line to search for the matching string in the output, then read through the output to where it fails.

The first time the string is printed it is printed as.

| arm_sgp_summer_1997_A | SCM_HRRR_gf           | input_HRRR_gf.nml           |

It doesn't have the error code because of course at this point the error hasn't been detected. So maybe I should remove the error column?

And then after the whole report add a line saying something like "copy a test combination line and search for line to see individual run"?

…mands into multiple RUN statements for quicker debugging by making the failing step easier to find

Installing ccmake in cmake-curses-gui fixes some CMake issues that were occurring, unsure why. Changing venv install location, old one wasn't working. Python packages needed to be updated. Switching to parallel builds. Make sure python package f90nml can be found in paths
…hub actions flag to run_scm.py so it will stop on first error
…ases exists

Continue docker run so report of all failed tests gets produced.
@mkavulich
Copy link
Collaborator

@scrasmussen I don't see any reason to change how the output appears right now. There's no labels on the first info print anyway so all the user sees is casename | suitename | namelistname |, it may not be super pretty but our standard output isn't all that pretty to begin with. I appreciate having the error column as a quick reference at the end, to see if multiple cases are failing for the same reason at a glance for example.

One thing I would change is to print a summary of all test results, instead of just failures. Or maybe include this as an option? This isn't necessary though, just preference.

Copy link
Collaborator

@mkavulich mkavulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scrasmussen @grantfirl based on today's discussion at the CCPP meeting I think it's good to go ahead and merge this change to fix the Docker problems, the remaining issues will be taken care of in other PRs.

@grantfirl
Copy link
Collaborator

@scrasmussen @mkavulich What's the story with 1a54e8d? Is the change to physics really needed? If so, we need a PR into main for ccpp-physics that needs to get merged first.

@scrasmussen
Copy link
Member Author

@scrasmussen @mkavulich What's the story with 1a54e8d? Is the change to physics really needed? If so, we need a PR into main for ccpp-physics that needs to get merged first.

This leads to a more accurate error message. If the return statement on the aerosol data isn't there, then it will still fail but later in the subroutine. But the reason for failure isn't clear unless the user dives into the code, this change clearly reports the reason for the error. I just created PR#1081 in the CCPP-Physics

@scrasmussen
Copy link
Member Author

One thing I would change is to print a summary of all test results, instead of just failures. Or maybe include this as an option? This isn't necessary though, just preference.

@mkavulich Good idea! I'll add this right now

@grantfirl
Copy link
Collaborator

@scrasmussen I merged NCAR/ccpp-physics#1081 into main. Could you please update the ccpp/physics submodule pointer to point to the top of main? As soon as you finish @mkavulich 's request, I agree that we should just merge this in.

@scrasmussen
Copy link
Member Author

@scrasmussen I merged NCAR/ccpp-physics#1081 into main. Could you please update the ccpp/physics submodule pointer to point to the top of main? As soon as you finish @mkavulich 's request, I agree that we should just merge this in.

@grantfirl those two things should be good to go (maybe just wait for the CI to finish to make sure things look correct?), if I missed anything or you need anything else please let me know!

@grantfirl
Copy link
Collaborator

@scrasmussen Were all of the whitespace changes intentional? The reason that I ask is that the GFS_typedefs.F90 and CCPP_typedefs.F90 files are often compared with those found in fv3atm, and I'd like to not introduce more trivial differences between the two repos. If it's important, it's fine, but we should introduce the same whitespace changes to the fv3atm repo sometime to sync them up.

I guess intentional in that I my editor is setup to remove all trailing whitespace whenever I save. But it is not intentional in that is important in the grand scheme of things. It follows most style-guides but I'm happy to undo those changes for GFS_typedefs.F90 and CCPP_typedefs.F90 because it causes more issues!

In the past, all of us have agreed to turn off automatic white space changes in code editors because it makes reviewing harder. We should merge this due to the release, but for future PRs, I think that we don't want to include a bunch of whitespace changes.

@grantfirl grantfirl merged commit 93eaf52 into NCAR:main Aug 9, 2024
18 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants