Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RSDK-9213] restart on fd error #351

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mattjperez
Copy link
Member

No description provided.

@mattjperez mattjperez self-assigned this Nov 14, 2024
@mattjperez mattjperez requested a review from a team as a code owner November 14, 2024 21:49
@acmorrow
Copy link
Member

The ticket says that we should either raise the fd limit, or restart when out of fds.

  • Did we investigate raising the fd limit? If so, what were the results? Related: did we check whether there is any sort of sdkconfig that makes file descrriptor exhaustion a crash?
  • This only handles fd exhaustion in the app signaling path, but that isn't the only place we create files (sockets). Is there something particularly unique about failing to obtain an fd in the app signaling path that leads to a hang such that it requires special handling, but that we can get away with not doing elsewhere?

@mattjperez
Copy link
Member Author

mattjperez commented Nov 15, 2024

The ticket says that we should either raise the fd limit, or restart when out of fds.

* Did we investigate raising the fd limit? If so, what were the results? Related: did we check whether there is any sort of sdkconfig that makes file descrriptor exhaustion a crash?

* This only handles fd exhaustion in the app signaling path, but that isn't the only place we create files (sockets). Is there something particularly unique about failing to obtain an fd in the app signaling path that leads to a hang such that it requires special handling, but that we can get away with not doing elsewhere?

There is an option in sdkconfig, however we don't appear to be setting it yet and I don't see any sane default set in the documentation, might need to dive into source. We could choose to set it, but we would need to make sure we do that in all the sdkconfigs and ensure whoever uses our ffi has these settings somehow (maybe an option in esp-idf component options). I didn't see any direct way to set behavior when fds are exhausted

As for handling outside of signaling:

  • we could write a general wrapper/macro for all EspErrors, that way they can be identified at any point throughout the stack
  • signaling happens fairly regularly, no? as the exhaustion would be system-wide, should another part of the stack experience this error first, signaling may eventually run into the same issue, triggering the restart anyway.

To answer the second question in your second point better, I believe this prevents incoming connections from succeeding, though I've only encountered this particular error once live. I can try setting the sdkconfig with a low limit to observe the error better and see what the implications are.

UPDATE: looking at esp-idf, the default is 0 (no limit).

@acmorrow
Copy link
Member

UPDATE: looking at esp-idf, the default is 0 (no limit).

I don't think that configuration parameter controls the number of FDs - I think it is specifically about protecting FAT files from corruption due to simultaneous operations.

@acmorrow
Copy link
Member

@mattjperez - I suspect the limit we are hitting may be https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/kconfig-reference.html#config-lwip-max-sockets.

Have you been able to create a repro for https://viam.atlassian.net/browse/RSDK-9213? If not, could you please try to find one? Maybe try increasing the max connections and then instantiate several clients.

And then you could try rebuilding with a higher value for CONFIG_LWIP_MAX_SOCKETS and see if it cures it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants