Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

Closed
3oris opened this issue Aug 26, 2024 · 11 comments
Closed

Comments

@3oris
Copy link

3oris commented Aug 26, 2024

The problem

In my setup on Home-Assistant Green with about 90 matter over thread devices (mostly nanoleaf and eve) after about 4 days the matter server somewhat freezes (entities being shown as offline, although they are online/usable in other fabrics) with a constant display of >25% of CPU usage and >10% of ram usage.

When running healthy the CPU usage will mostly show <1%. During that time ram usage increases constantly from 3% to >10% to the point when all matter devices all of a sudden show as offline.

See logs below

What version of Home Assistant Core has the issue?

core-2024.8.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

matter

Link to integration documentation on our website

No response

Diagnostics information

core_matter_server_2024-08-26T15-21-30.760Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Observing an extremely long re-subscription back-off, e.g.

Previous subscription failed with Error: 50, re-subscribing in 13140 ms...

Additional information

6 BRs

  • 5 nest hub fuchsia 18
  • 1 OTBR hosted outside HA Green
@3oris 3oris changed the title Matter server "freezes" every 3-4 days with >10% ram usage and constantly >25% CPU usage Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage Aug 26, 2024
@home-assistant
Copy link

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

  • @home-assistant close Closes the issue.
  • @home-assistant rename Awesome new title Renames the issue.
  • @home-assistant reopen Reopen the issue.
  • @home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.
  • @home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
  • @home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


matter documentation
matter source
(message by IssueLinks)

@3oris
Copy link
Author

3oris commented Aug 29, 2024

also happens in core-2024.8.3

@marcelveldt
Copy link
Member

This issue is most possibly related to #123835

We are investigating the "devices get unavailable" issue but the cpu/ram usage is new to me, could be some side effect from the actual issues. You have a large amount of devices so that is maybe why others are not seeing that.

@olivievranska
Copy link

I saw similar issues for other devices and integrations when I was running HAG. After tests and trials I moved my HA installation to Raspberry Pi 5 8 GB & 256 GB NVMe and I realised that issues were related to weak HAG because are gone now. HAG is good if you want to play with few devices but if you want to manage more and/or in different integrations, you should consider to move to powerful hardware.

@3oris
Copy link
Author

3oris commented Sep 10, 2024

@marcelveldt -- in the meantime I had scheduled my matter server to restart every 3 days...

Since I also follow #124503 and do observe the same behavior described over there, I just updated to matter server 6.5.0b2. I will turn off the automatic server restart now, and report back if one of the changes in b2 also positively affects my issue.

@3oris
Copy link
Author

3oris commented Sep 10, 2024

@marcelveldt -- First observations:

  1. Way faster recovery time for the "whole system" to be back up again after a matter server restart. Meaning 95% of the devices get back online within 5 minutes, where earlier it took over an hour for most devices to display an online state in the HA UI
  2. All online devices now show correct state (sometime with a noticeable delay though, probably due to thread limitations)
  3. RAM usage tripled:
    In my earlier bug report I stated that RAM usage started a ~3% and grew till the server dies to >10%. Now with 6.5.0b2 RAM usage starts at ~10% right away. As I understand you made a few changes that affect subscription management, so that might have an immediate effect.

Let's see what happens during the course of the next 3-4 days.

@marcelveldt
Copy link
Member

Thanks for testing, yeah with your amount of devices. About 10% memory usage on a HA Green sounds about right. There is probably some optimization to do for us but on the other hand its not that crazy considering the fact you have 90 devices for which we keep all attributes and subscriptions in memory. So if cpu stays stable/minimal and memory doesn't rise over time, we can consider the issue fixed.

BTW we got a lot of good response on the 6.5.0b2 so we just promoted it to stable

@3oris
Copy link
Author

3oris commented Sep 16, 2024

Hi, @marcelveldt I have since updated to 6.5.0 and 6.5.1 so the server didn't even have the chance to run 4 days is a row.

Still,

  • CPU usage is typically below 6% in general so far (95% of time even <1%)
  • In contrast to the beta release, with 6.5.x final memory consumption started at ~3% again and never got any higher than 6%

So this is already good progress.

I saw a few devices losing connection though (being offline for about a day), that could still be pinged from HA then, and consequently/promptly got back online.

So, I wonder if it has to do with this change home-assistant-libs/python-matter-server#882 which happened in 6.5.1.

(fyi, @agners )

@marcelveldt
Copy link
Member

To me, your original issue is resolved. That CPU and memory consumption is perfectly fine for your amount of devices on the HA Green. Great!

As for the availability issue;
What are you using for border router(s) ? We have multiple reports that people with multiple Apple border routers run into issues where devices become unresponsive. Do you see errors like CASE session timeouts in the logs ?

If you also have Apple border routers, track progress here;
#123835

Otherwise create a new issue report for the new issue so we can close/finalize this one.

@3oris
Copy link
Author

3oris commented Sep 16, 2024

To me, your original issue is resolved. That CPU and memory consumption is perfectly fine for your amount of devices on the HA Green. Great!

Yes, as said on my side it should be able to survive 5-6 days, 4 days was usually the threshold for when the Matter server itself just lot connection to all devices at once.

As for the availability issue; What are you using for border router(s) ? We have multiple reports that people with multiple Apple border routers run into issues where devices become unresponsive. Do you see errors like CASE session timeouts in the logs ?

If you also have Apple border routers, track progress here; #123835

Thanks for the heads-up. As per the description I have

  • 1 OTBR hosted on RPI 3
  • 5 Nest Hubs 2nd gen (albeit updated to F20 now)

So I should have a look into the logs and open a new issue.

@3oris
Copy link
Author

3oris commented Sep 17, 2024

Follow-up here: #126136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants