Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add mixed-precision and agc to gradaccum optimizer #131

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

dPys
Copy link

@dPys dPys commented Jan 21, 2024

  • Optimizer now supports mixed-precision with loss-scaling
  • Purges eager ops from optimizer in favor of pure TF and @tf.functions, supports static graph mode, reduces excessive graph tracing
  • Cleans up docstring, simplifies conditional branching where possible, abstracts repetitive code into helper methods, adds missing properties expected by the API for some TF versions (e.g. lr)

@dPys

@andreped andreped self-requested a review January 22, 2024 00:12
Copy link
Owner

@andreped andreped left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dPys Great job! Looks very interestin!

I have enabled such that you can run CIs for this PR. Note that linting fails straight away. Please, fix that, and run tests to see if the proposed modifications are compatible with the unit/integration tests.

@andreped
Copy link
Owner

@dPys Do you have time to address my concerns, such that tests can be performed? I cannot add new features/changes/PRs before tests are passing.

@dPys
Copy link
Author

dPys commented Jan 29, 2024

Hi @andreped , yes -- currently very sick / out-of-commission, but I should be able to get to this in the coming days. Stay tuned!

setup.py Outdated Show resolved Hide resolved
gradient_accumulator/accumulators.py Outdated Show resolved Hide resolved
gradient_accumulator/agc.py Outdated Show resolved Hide resolved
gradient_accumulator/agc.py Outdated Show resolved Hide resolved
@andreped andreped self-requested a review February 5, 2024 07:58
Copy link
Owner

@andreped andreped left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that tests seem to be failing:
https://github.com/andreped/GradientAccumulator/actions/runs/7779143998/job/21215129854?pr=131

You should run these locally and see if you can resolve them there first. Easiest way to run these tests is using pytest, see here:
https://github.com/andreped/GradientAccumulator/blob/main/.github/workflows/test.yml#L92

@dPys dPys requested a review from andreped February 6, 2024 19:47
@dPys
Copy link
Author

dPys commented Feb 6, 2024

Tests should now be passing (or at least very close) :-)

@andreped
Copy link
Owner

andreped commented Feb 7, 2024

I recommend running the tests listed here locally and verifying that these run locally:

A good idea would be to build the wheel in the original virtual env you are working in locally, then installing it in a new virtual env, before running all new tests. That's a nice way of verifying that the wheel works as it should.

I have enabled such that CIs can be ran now, without requiring that you have made a merged PR previously. So now, running new tests should be possible for future commits in this PR.


EDIT: I also see that you have modified the tests/test_mixed_precision.py test, which I guess you used when debugging, but it should not be altered in the main branch. So feel free to revert that.

Copy link

codecov bot commented Feb 7, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (2eb285e) 97.58% compared to head (58e42e8) 98.00%.
Report is 2 commits behind head on main.

Files Patch % Lines
gradient_accumulator/accumulators.py 98.09% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #131      +/-   ##
==========================================
+ Coverage   97.58%   98.00%   +0.41%     
==========================================
  Files           5        5              
  Lines         248      350     +102     
==========================================
+ Hits          242      343     +101     
- Misses          6        7       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dPys
Copy link
Author

dPys commented Feb 7, 2024

I recommend running the tests listed here locally and verifying that these run locally:

A good idea would be to build the wheel in the original virtual env you are working in locally, then installing it in a new virtual env, before running all new tests. That's a nice way of verifying that the wheel works as it should.

I have enabled such that CIs can be ran now, without requiring that you have made a merged PR previously. So now, running new tests should be possible for future commits in this PR.

EDIT: I also see that you have modified the tests/test_mixed_precision.py test, which I guess you used when debugging, but it should not be altered in the main branch. So feel free to revert that.

Whoops, I had accidentally pushed the wrong commit before to trigger the build. Looks like that has now been corrected. As for the tests/test_mixed_precision.py test change, this was actually intentional in order to cover the mixed-precision case for the optimizer, which was not covered

@dPys
Copy link
Author

dPys commented Feb 7, 2024

happy to add further unit tests to cover the remaining uncovered lines of the new optimizer class following code review

@andreped
Copy link
Owner

andreped commented Feb 7, 2024

happy to add further unit tests to cover the remaining uncovered lines of the new optimizer class following code review

You are free to make that now, or do that in a new PR when this is merged. But let's see if all tests pass first. Looks very promising :] Great work!!!

@andreped
Copy link
Owner

andreped commented Feb 7, 2024

All tests seem to succeed. I see that you have removed a lot of the documentations on various methods that we use for autodoc -> to build documentations. We need to follow this standard, at least to verify that the generation of documentations still work and show what we want the user to see.

You can see here for how I build the docs, and use that when adding documentations to the different classes and methods:
https://github.com/andreped/GradientAccumulator/tree/main/docs

happy to add further unit tests to cover the remaining uncovered lines of the new optimizer class following code review

Lastly, yes, the codecov seemed to have take a massive hit. We need to up the unit tests again, to ensure that our code has been properly tested. This does not need to be perfect, but we should verify that as many of the methods and components at least don't crash at first usage.

Feel free to ask questions when you make an attempt :]

@dPys
Copy link
Author

dPys commented Feb 7, 2024

All tests seem to succeed. I see that you have removed a lot of the documentations on various methods that we use for autodoc -> to build documentations. We need to follow this standard, at least to verify that the generation of documentations still work and show what we want the user to see.

You can see here for how I build the docs, and use that when adding documentations to the different classes and methods: https://github.com/andreped/GradientAccumulator/tree/main/docs

happy to add further unit tests to cover the remaining uncovered lines of the new optimizer class following code review

Lastly, yes, the codecov seemed to have take a massive hit. We need to up the unit tests again, to ensure that our code has been properly tested. This does not need to be perfect, but we should verify that as many of the methods and components at least don't crash at first usage.

Feel free to ask questions when you make an attempt :]

Absolutely. I'm very used to a high standard for code coverage, both in and out of open-source projects, so can very much appreciate this aspect. I've essentially refactored the majority of the optimizer, so the reduction in code coverage comes as no surprise. Provided that you are on-board with all of the proposed changes in functionality and code organization, then I'm certainly happy to put in the time to get the test code coverage to where it needs to be :-)

@andreped
Copy link
Owner

andreped commented Feb 7, 2024

Provided that you are on-board with all of the proposed changes in functionality and code organization, then I'm certainly happy to put in the time to get the test code coverage to where it needs to be :-)

It would be better for me to have everything ready and at a similar state as the original code, before I do my full review, but the solution looks fine for now, as all tests seem to pass. So feel free to get code coverage to the same or better than than it was :] Looking forward to testing this!

Will have to run some benchmarks to see if there are any performances differences, as well as test if it still works in a distributed setting (to some extent). But that I will do after I have done my full review.

@dPys
Copy link
Author

dPys commented Feb 7, 2024

Provided that you are on-board with all of the proposed changes in functionality and code organization, then I'm certainly happy to put in the time to get the test code coverage to where it needs to be :-)

It would be better for me to have everything ready and at a similar state as the original code, before I do my full review, but the solution looks fine for now, as all tests seem to pass. So feel free to get code coverage to the same or better than than it was :] Looking forward to testing this!

Will have to run some benchmarks to see if there are any performances differences, as well as test if it still works in a distributed setting (to some extent). But that I will do after I have done my full review.

Excited to be working on this with you!

@dPys
Copy link
Author

dPys commented Feb 7, 2024

As for the distributed setting, I've tested it for tensorflow 2.0-2.10, but after 2.10 when the optimizer API changed, my guess is that apply_gradients could very well break (as it already does in main ?) so that part might have to get addressed in a follow-up PR...

@andreped
Copy link
Owner

andreped commented Feb 7, 2024

As for the distributed setting, I've tested it for tensorflow 2.0-2.10, but after 2.10 when the optimizer API changed, my guess is that apply_gradients could very well break (as it already does in main ?) so that part might have to get addressed in a follow-up PR...

Nah, no need to address distributed support in this PR. That is outside the scope. This is also outside the scope of this PR. Lets just get this PR through :]


Regarding distributed support, see issue #132, where we aim to find a solution for this. We can make a new PR if we ever get a solution for this, but I have yet to see a solution that does what I expect it to do.

Copy link
Owner

@andreped andreped left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks technically sound, but lets get the code cov to a similar high level as it was before this PR, and then I can do a more comprehensive review and thorough testing :]

@andreped
Copy link
Owner

andreped commented Feb 8, 2024

I see that you are making great progress on code coverage :]

I guess you are aware of it, but if you want to see exactly what remains to be covered, you can see the current report here:
https://app.codecov.io/gh/andreped/GradientAccumulator/pull/131?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=checks&utm_campaign=pr+comments&utm_term=André+Pedersen

Feel free to tag me in when you feel like it is good enough :] I noticed now that I had some "no-pragma" statements, which you removed, which naturally would degrade the coverage (as I just ignored some altogether). But I had some because some of the methods were not that critical, but then again having coverage on these as well would be better.

@dPys
Copy link
Author

dPys commented Feb 11, 2024

Done

@andreped
Copy link
Owner

Done

Great work, @dPys!! I will have a look tomorrow :] Looking forward to it!

@andreped
Copy link
Owner

andreped commented Feb 28, 2024

Sorry for the really late reply, @dPys! I have been quite preoccupied lately with work. I can review this PR after work today :]

@dPys
Copy link
Author

dPys commented Apr 16, 2024

Hi @andreped ! Just wanted to follow-up to see whether the clouds have parted for you yet with work? eager to hear any final thoughts you might have on this contribution, and really just hopeful to see it merged :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants