Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Failed to obtain RX buffer #595

Open
alexzeit opened this issue Aug 15, 2024 · 6 comments
Open

[Bug] Failed to obtain RX buffer #595

alexzeit opened this issue Aug 15, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@alexzeit
Copy link

Describe the bug

In case of restart of zenoh-pico mcu sporadic got <"err> eth_stm32_hal: Failed to obtain RX buffer" and subscriber stopped receiving the messages.
This happens only if publisher e.p python is already running or have been started before the mcu with zenoh-pico have been started. An additional delay (e.g. 10s) before z_declare_subscriber does not help.
Used peer connection.

To reproduce

  1. configure python publisher and subscriber peer mode and interval=2ms
  2. configure zenoh-pico peer mode subscriber and publisher interval=1ms
  3. start python publisher and subscriber
  4. start zenoh-pico publisher and subscriber
  5. restart mcu with zenoh-pico

System info

  • zenoh-pico v 1.0.0.6 is running on Nucleo 144 H753ZI (Cortex-M7 480MHz)
    zephyr v3.7.0.0 SDK v0.16.8
  • zenoh python v 1.0.0.6 is running on ubuntu 22.04 on NXP RDB3 Eval board
@alexzeit alexzeit added the bug Something isn't working label Aug 15, 2024
@jean-roland
Copy link
Contributor

Thanks for the report @alexzeit, will try to reproduce on my F767ZI.

@jean-roland
Copy link
Contributor

jean-roland commented Aug 27, 2024

Just to clarify you have a Python publisher sending a message every 2ms to a pico subscriber
And a pico publisher sending a message every 1ms to a Python subscriber
All in peer mode.
Do you have two boards or the publisher and subscriber are on the same nucleo?

Actually, it would probably be easier if you could send me the project files you used.

@alexzeit
Copy link
Author

Hi Jean-Roland
yes, but the same behaviour I have observed with c++ pubsub and 1ms in peer mode. We have one boards where publisher and subscriber are running in separate threads of zephyr rtos.

@jean-roland
Copy link
Contributor

jean-roland commented Aug 30, 2024

Alright, so it seems the error message is produced by Zephyr when it ran out of RX buffers to store messages.

My guess is it breaks the connection and since we do not yet have connectivity event support (see Issue #333) the only possibility is to restart the node.

Alternatively, you can try increasing the number of RX buffers, that should reduce the occurrence rate, see https://docs.zephyrproject.org/2.7.5/reference/kconfig/CONFIG_NET_BUF_RX_COUNT.html and https://docs.zephyrproject.org/2.7.5/reference/kconfig/CONFIG_NET_PKT_RX_COUNT.html

That also means pico has a hard time keeping up with this message rate, and as we discussed before we're going to look into performance after the 1.0 release.

@alexzeit
Copy link
Author

alexzeit commented Aug 31, 2024

Yes, it seems to be by zephyr, but this is caused by zenoh core. I think the issue is that zenoh starts the Ethernet receiver but it takes time until it starts to consume the bytes from eth Rx buffer. Because in other case, where the python publisher is not running during zenoh start up, this issue is not happening. I have tried to increase the rx buffer, but this did not solve the problem

@jean-roland
Copy link
Contributor

So I tried reproducing the issue on my board with a pub/sub with 1ms frequency without success (or failure?). Is it possible for you to send me the files you used for the board and PC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants