-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SigmoidFocalLoss operation #3143
base: develop
Are you sure you want to change the base?
Conversation
561ccf3
to
53b5359
Compare
It seems a part of this PR included into #3146 |
587d5e0
to
5427607
Compare
I have updated my code following comments in #3146 . Please take another look at my PR. |
e6ac834
to
e53a4e3
Compare
@junliume can you take a look at Windows build state, plz. I added MIOPEN_INTERNALS_EXPORT but it still fail. |
e53a4e3
to
182ea0b
Compare
@iq136boy can you help me check the log in windows build stage, please? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I ask you to merge reduction
and unreduction
kernels and cpu code, as I suggested in #3166?
driver/sigmoid_focal_loss_driver.hpp
Outdated
loss = alphaT * loss; | ||
} | ||
|
||
workspaceHost[id] = static_cast<Tcheck>(loss / divisor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need an external workspace for host functions - you are free to allocate memory right in this function (which is usually not possible or even does not make much sense on GPU).
But actually you don't need a workspace for CPU reduction, it's not a parallel algorithm without external synchronization.
workspaceHost[id] = static_cast<Tcheck>(loss / divisor); | |
outputHost[0] += static_cast<Tcheck>(loss / divisor); |
Just don't forget to initialize outputHost[0]
with 0.0f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I declare workspaceHost in the driver as this allow us to compare the workspace in CPU with GPU before executing reduction.
Additionally, outputHost[0] += static_cast<Tcheck>(loss / divisor)
results in verification fail as we use reduction algo in GPU. To avoid verification fail, I have to mimic the same reduction algo in CPU check function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shortly: CPU verification code must be as generic as possible and must not mimic to any particular GPU implementation.
I declare workspaceHost in the driver as this allow us to compare the workspace in CPU with GPU before executing reduction.
Basically workspace content is not specified and can't be compared. We can use multiple implementations for the same algorithm, and we can't do this check in common case and since this CPU implementation is used for that common case, it can't contain algorithm specific features.
Additionally,
outputHost[0] += static_cast<Tcheck>(loss / divisor)
results in verification fail as we use reduction algo in GPU. To avoid verification fail, I have to mimic the same reduction algo in CPU check function.
Mathematically it must be the same, but since there are FP rounding errors and other FP stuff, it may result into bigger error, but still, it must work. If the error is too big, then the algorithm is not applicable and can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the verify method in CPU to naive accumulation
0fdbf27
to
a41a005
Compare
a066413
to
afc738f
Compare
2f827fd
to
41b9300
Compare
@iq136boy For this PR only, can you and your colleagues give comments about documentation problem? It's great to learn by examples. Please provide us the parts that you guys belive that it's importance to have or to be mentioned in documents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR is blocked by the following CI issue. The PR needs to restart CI after the issue is fixed.
[2024-09-23T14:07:43.339Z] Exception occurred: org.kohsuke.github.HttpException: {"message":"API rate limit exceeded for user ID 49319081. If you reach out to GitHub Support for help, please include the request ID
This PR implement torchvision.ops.sigmoid_focal_loss operation. There is no constraint here, MIOpen is faster than ROCm in all cases.
Average improvement over ROCm
Reduced kernels:
Unreduced kernels:
Detail benchmark
Float32
Float16
BFloat16