Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[windows][CI/CD] ADOT collector delayed start #1788

Closed
wants to merge 2 commits into from

Conversation

Kausik-A
Copy link
Contributor

Description:

Sets ADOT collector agent as Automatic (delayed start) services to mitigate known go windows issues with 1.9.2: golang/go#23479

Services would not restart across reboots on Automatic services, they would timeout before coming up and the service control manager would give up spawning them.

Link to tracking Issue: #1767

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Kausik-A Kausik-A requested a review from a team as a code owner January 27, 2023 19:55
Copy link
Contributor

@bryan-aguilar bryan-aguilar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment on the linked issue it appears that more investigation/testing is required on whether this proposed fix is the best path forward.

@DevOpsFu
Copy link

FWIW, I have seen the same issue running the OTel collector contrib build on Windows (also packaged up and configured on the target system with WiX).

We already run this with the Automatic (Delayed Start) startup type, and we still see this issue on reboots, particularly after Windows updates. So unfortunately this probably won't mitigate the issue.

@bryan-aguilar
Copy link
Contributor

@DevOpsFu thanks for the insight! Do any other mitigations for lessen the effect? We are still exploring solutions around this problem and any feedback would help.

@DevOpsFu
Copy link

@DevOpsFu thanks for the insight! Do any other mitigations for lessen the effect? We are still exploring solutions around this problem and any feedback would help.

I've not yet found anything that would mitigate the issue. At least nothing that isn't a horrible hack and I wouldn't consider using, e.g.:

  • Setting the service to manual startup and starting it WAY after system boot to give the system time to settle down first.
  • Tweaking the ServicesPipeTimeout system registry value.

Even the Windows service recovery options to automatically restart the service on failure do not work - presumably because the service needs to at least start up successfully once and then die for the recovery measures to kick in, and this isn't happening here.

Here are some useful links though:

  1. A good summary of what might be the root cause: Windows Service Timeout shirou/gopsutil#570 (comment)
  2. An example of how the code might be refactored to move the loading of DLLs into an init function rather than global scope: Windows Service Timeout shirou/gopsutil#570 (comment)
  3. The core golang issue for this bug: runtime: Windows service timeout during system startup  golang/go#23479

I've not yet delved deeply into the Otel collector code (I'm not a seasoned Go developer either), but it seems to make sense to start by looking for any areas around the Windows-specific code that might be loading DLLs outside of a function declaration. I had a very quick look earlier today and couldn't find anything using GitHub's search functions. Maybe my searches were too specific though - or perhaps this isn't the issue at all.

From my limited understanding of the root cause though, the problem seems to be that something is making startup take a long time on first invocation of the executable. Normally this isn't a problem and probably is never visible on a fast system, but if you have the perfect storm of a system that has:

  • Fewer resources to start with
  • Just been rebooted after applying a load of OS updates, causing a load of I/O contention

Then it creates the perfect set of conditions for the startup to take longer than 30s, and then the Windows SCM considers the service to be dead.

@DevOpsFu
Copy link

Having done some more digging into this tonight, it seems to me that it might not be possible to solve this one within the OTel collector itself without changing the service startup code. This issue describes the problem really nicely.

A viable workaround which would not require any hacks is to remove the responsibility of communicating with the Windows SCM from the OTel collector entirely - i.e. run it in interactive mode via a service wrapper such as NSSM or WinSW.

I prefer WinSW myself, and it lends itself well to being packaged up and installed by something like WiX. Rather than using the built in install command to register the service, you can continue to set it up using a <ServiceInstall> block like you're already doing. In the associated XML file for the service, make sure to set the NO_WINDOWS_SERVICE environment variable to something nonzero so that the collector will definitely not start in service mode.

An example WinSW XML file might look like this:

<service>
  <id>otel-collector</id>
  <name>Open Telemetry Collector Service</name>
  <description>Some interesting description</description>
  <env name="NO_WINDOWS_SERVICE" value="1" />
  <logpath>%BASE%\logs</logpath>
  <log mode="append" />

  <executable>%BASE%\otelcol.exe</executable>
  <arguments>
    --config="%BASE%\otel-config.yaml"
  </arguments>
</service>

Hope this helps!

@DevOpsFu
Copy link

Sorry for spamming this issue (plus sorry for not making these comments against the issue rather than this PR, I only just realised! 🤦 )

I tested the OTel collector on a Windows system this morning under heavy CPU load. Using the binary directly as a service yielded the issue with the service failing to start up quickly enough for the Windows SCM.

Testing the OTel collector running under WinSW was successful; the service responded to the Windows SCM quickly and the service status was shown as running. The OTel collector itself took about 7 minutes to fully start up in the background (this was on a system with the CPU load at a constant 100%). Right now, in the absence of any solution in the initialization code in the OTel collector, I'd say that this is the best mitigation for this issue.

@bryan-aguilar
Copy link
Contributor

@DevOpsFu Thank you very very much for sharing your learning with us. I know we appreciate this and will definitely be working of your initial research as we look for the correct long term solution. I have had the initialization shared with me from other team members and it is something that we are looking at.

We do have control over that in the ADOT distribution but the effort required to implement it is not clear to me. In the prometheus example the implementation was pretty straight forward so I think we could try to take a stab at that. I'm hoping with a significant amount testing we can reduce this issue. The good thing about the initialization fix proposal is that this would be something that we could contribute back to upstream if it works out well for the ADOT distribution.

@github-actions
Copy link
Contributor

This PR is stale because it has been open 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[windows][CI/CD] ADOT collector delayed start
3 participants