Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP 503s from Azure Devops Test Result API in dnceng-public/public #11872

Closed
5 tasks
MichalStrehovsky opened this issue Dec 7, 2022 · 15 comments
Closed
5 tasks
Assignees

Comments

@MichalStrehovsky
Copy link
Member

MichalStrehovsky commented Dec 7, 2022

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=104274

Build leg reported

System.Net.Http.Json.Unit.Tests.WorkItemExecution

Pull Request

dotnet/runtime#79332

Action required for the engineering services team

To triage this issue (First Responder / @dotnet/dnceng):

  • Open the failing build above and investigate
  • Add a comment explaining your findings

If this is an issue that is causing build breaks across multiple builds and would get benefit from being listed on the build analysis check, follow the next steps:

  1. Add the label "Known Build Error"
  2. Edit this issue and add an error string in the Json below that can help us match this issue with future build breaks. You should use the known issues documentation

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

If I follow the link from https://github.com/dotnet/runtime/pull/79332/checks?check_run_id=9936286963 to the console log, the console log says success. Build analysis said deadletter.

https://helix.dot.net/api/2019-06-17/jobs/878ed4a3-d870-465c-9160-161f619b846d/workitems/System.Net.Http.Json.Unit.Tests/console

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@MattGal
Copy link
Member

MattGal commented Dec 7, 2022

@MichalStrehovsky not sure this should be using the "known build error" feature here but I will investigate and let you know what's happening here either way.

@MichalStrehovsky
Copy link
Member Author

Thanks! Sorry, I just followed the "Report infrastructure issue" link from build analysis and this was the template. None of the labels were intentional.

@MattGal
Copy link
Member

MattGal commented Dec 7, 2022

From the logs of this work item's last attempt, the problem here is that despite passing, when trying to insert test results into Azure DevOps the server returns http 503:

2022-12-07T08:30:50.479Z	ERROR  	azure_devops_result_publisher(305)	_send	Failed request with ClientRequestError, saved request to: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-79332-merge-878ed4a3d870465c91/System.Net.Http.Json.Unit.Tests.Attempt.3/__failed_azdo_request_content.json?helixlogtype=result
2022-12-07T08:30:50.479Z	ERROR  	azure_devops_result_publisher(164)	log_error	got error: Traceback (most recent call last):
  File "C:\h\scripts\helix-scripts\requests\adapters.py", line 489, in send
    resp = conn.urlopen(
  File "C:\h\scripts\helix-scripts\urllib3\connectionpool.py", line 725, in urlopen
    return self.urlopen(
  File "C:\h\scripts\helix-scripts\urllib3\connectionpool.py", line 725, in urlopen
    return self.urlopen(
  File "C:\h\scripts\helix-scripts\urllib3\connectionpool.py", line 725, in urlopen
    return self.urlopen(
  File "C:\h\scripts\helix-scripts\urllib3\connectionpool.py", line 711, in urlopen
    retries = retries.increment(method, url, response=response, _pool=self)
  File "C:\h\scripts\helix-scripts\urllib3\util\retry.py", line 399, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='dev.azure.com', port=443): Max retries exceeded with url: /dnceng-public/public/_apis/test/Runs/2232394/Results (Caused by ResponseError('too many 503 error responses'))

This is definitely an infrastructure problem with our Azure DevOps org. When this happens and all the retries are exhausted, the work item reports a -4 exit code (it has to; runtime tests will return 0 exit code when tests fail, so we need to make sure that any failures are detected) and the work item then goes to another machine for a retry.

When this retry happenened, the problem kept occurring; after the third attempt the work item is considered un-runnable and goes to deadletter, as you've seen.

I'll retitle this issue to reflect what's going on, and create a ticket asking for investigation. Please let us know if this becomes 100% repro (We'll keep an eye on it too) and the severity of the incident can be bumped up in that case.

@MattGal MattGal changed the title Build analysis reports test as deadlettered, but the logs say it succeeded in the end HTTP 503s from Azure Devops Test Result API in dnceng-public/public Dec 7, 2022
@MattGal
Copy link
Member

MattGal commented Dec 7, 2022

@MichalStrehovsky I created https://portal.microsofticm.com/imp/v3/incidents/details/353884008 to ask for Azure DevOps to perform an investigation here. Looking at the time window where your test failed, it only occurred to something like 0.2% of the work items in the same 2 hour period, so unless this starts happening more consistently we'll keep this at severity 3.

@MattGal
Copy link
Member

MattGal commented Dec 7, 2022

Put in tracking until we hear back from the IcM

@MattGal
Copy link
Member

MattGal commented Dec 13, 2022

Checked in today, no update on the IcM since 12/9 when it was assigned to the DRI, pinged for status.

@MattGal
Copy link
Member

MattGal commented Dec 20, 2022

checked in again, graphed values over time and made myself available to meet

@MattGal
Copy link
Member

MattGal commented Jan 4, 2023

I met with two folks from the IDC Azure Test Results team today and was able to show them several examples of 503s being hit in our logging. They will investigate and get back to me.

@MattGal
Copy link
Member

MattGal commented Jan 11, 2023

No updates since 1/4 meeting, so I pinged the IcM.

@MattGal
Copy link
Member

MattGal commented Feb 1, 2023

IcM seems to have made some traction with related Azure teams. @ilyas1974 we only saw 2 of these in the last week and both succeeded via Helix infra retry. Should we just close it?

@MattGal
Copy link
Member

MattGal commented Feb 15, 2023

Checked in today. 614 instances in the past 10 days. Pinged IcM.

@dougbu
Copy link
Member

dougbu commented Feb 16, 2023

How does this issue differ from #11723, other than not tracking failures automatically here❔

@MattGal
Copy link
Member

MattGal commented Feb 16, 2023

This is about two different services. While ostensibly both "Azure Devops", this is about 503s when posting test results from Helix machines, while #11723 is about NuPkg feeds. They are handled by different teams despite being in the same overall organization, so I think it merits being tracked differently.

@MattGal
Copy link
Member

MattGal commented Feb 28, 2023

Assigning to @ilyas1974 as I am no longer able to care about this issue.

@MattGal MattGal assigned ilyas1974 and unassigned MattGal Feb 28, 2023
@ilyas1974
Copy link
Contributor

duplicate of #11723

@ilyas1974 ilyas1974 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants