Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

3oris · 2024-08-26T15:39:37Z

The problem

In my setup on Home-Assistant Green with about 90 matter over thread devices (mostly nanoleaf and eve) after about 4 days the matter server somewhat freezes (entities being shown as offline, although they are online/usable in other fabrics) with a constant display of >25% of CPU usage and >10% of ram usage.

When running healthy the CPU usage will mostly show <1%. During that time ram usage increases constantly from 3% to >10% to the point when all matter devices all of a sudden show as offline.

See logs below

What version of Home Assistant Core has the issue?

core-2024.8.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

matter

Link to integration documentation on our website

No response

Diagnostics information

core_matter_server_2024-08-26T15-21-30.760Z.log

Example YAML snippet

No response

Anything in the logs that might be useful for us?

Observing an extremely long re-subscription back-off, e.g.

Previous subscription failed with Error: 50, re-subscribing in 13140 ms...

Additional information

6 BRs

5 nest hub fuchsia 18
1 OTBR hosted outside HA Green

home-assistant · 2024-08-26T19:45:55Z

Hey there @home-assistant/matter, mind taking a look at this issue as it has been labeled with an integration (matter) you are listed as a code owner for? Thanks!

Code owner commands

Code owners of matter can trigger bot actions by commenting:

@home-assistant close Closes the issue.
@home-assistant rename Awesome new title Renames the issue.
@home-assistant reopen Reopen the issue.
@home-assistant unassign matter Removes the current integration label and assignees on the issue, add the integration domain after the command.
@home-assistant add-label needs-more-information Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue.
@home-assistant remove-label needs-more-information Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

_{^{(message by CodeOwnersMention)}}

matter documentation
matter source
_{^{(message by IssueLinks)}}

3oris · 2024-08-29T09:19:56Z

also happens in core-2024.8.3

marcelveldt · 2024-09-02T09:20:15Z

This issue is most possibly related to #123835

We are investigating the "devices get unavailable" issue but the cpu/ram usage is new to me, could be some side effect from the actual issues. You have a large amount of devices so that is maybe why others are not seeing that.

olivievranska · 2024-09-08T13:12:11Z

I saw similar issues for other devices and integrations when I was running HAG. After tests and trials I moved my HA installation to Raspberry Pi 5 8 GB & 256 GB NVMe and I realised that issues were related to weak HAG because are gone now. HAG is good if you want to play with few devices but if you want to manage more and/or in different integrations, you should consider to move to powerful hardware.

3oris · 2024-09-10T12:53:06Z

@marcelveldt -- in the meantime I had scheduled my matter server to restart every 3 days...

Since I also follow #124503 and do observe the same behavior described over there, I just updated to matter server 6.5.0b2. I will turn off the automatic server restart now, and report back if one of the changes in b2 also positively affects my issue.

3oris · 2024-09-10T13:09:41Z

@marcelveldt -- First observations:

Way faster recovery time for the "whole system" to be back up again after a matter server restart. Meaning 95% of the devices get back online within 5 minutes, where earlier it took over an hour for most devices to display an online state in the HA UI
All online devices now show correct state (sometime with a noticeable delay though, probably due to thread limitations)
RAM usage tripled:
In my earlier bug report I stated that RAM usage started a ~3% and grew till the server dies to >10%. Now with 6.5.0b2 RAM usage starts at ~10% right away. As I understand you made a few changes that affect subscription management, so that might have an immediate effect.

Let's see what happens during the course of the next 3-4 days.

marcelveldt · 2024-09-10T13:55:20Z

Thanks for testing, yeah with your amount of devices. About 10% memory usage on a HA Green sounds about right. There is probably some optimization to do for us but on the other hand its not that crazy considering the fact you have 90 devices for which we keep all attributes and subscriptions in memory. So if cpu stays stable/minimal and memory doesn't rise over time, we can consider the issue fixed.

BTW we got a lot of good response on the 6.5.0b2 so we just promoted it to stable

3oris · 2024-09-16T09:18:42Z

Hi, @marcelveldt I have since updated to 6.5.0 and 6.5.1 so the server didn't even have the chance to run 4 days is a row.

Still,

CPU usage is typically below 6% in general so far (95% of time even <1%)
In contrast to the beta release, with 6.5.x final memory consumption started at ~3% again and never got any higher than 6%

So this is already good progress.

I saw a few devices losing connection though (being offline for about a day), that could still be pinged from HA then, and consequently/promptly got back online.

So, I wonder if it has to do with this change home-assistant-libs/python-matter-server#882 which happened in 6.5.1.

(fyi, @agners )

marcelveldt · 2024-09-16T09:34:35Z

To me, your original issue is resolved. That CPU and memory consumption is perfectly fine for your amount of devices on the HA Green. Great!

As for the availability issue;
What are you using for border router(s) ? We have multiple reports that people with multiple Apple border routers run into issues where devices become unresponsive. Do you see errors like CASE session timeouts in the logs ?

If you also have Apple border routers, track progress here;
#123835

Otherwise create a new issue report for the new issue so we can close/finalize this one.

3oris · 2024-09-16T15:47:14Z

To me, your original issue is resolved. That CPU and memory consumption is perfectly fine for your amount of devices on the HA Green. Great!

Yes, as said on my side it should be able to survive 5-6 days, 4 days was usually the threshold for when the Matter server itself just lot connection to all devices at once.

As for the availability issue; What are you using for border router(s) ? We have multiple reports that people with multiple Apple border routers run into issues where devices become unresponsive. Do you see errors like CASE session timeouts in the logs ?

If you also have Apple border routers, track progress here; #123835

Thanks for the heads-up. As per the description I have

1 OTBR hosted on RPI 3
5 Nest Hubs 2nd gen (albeit updated to F20 now)

So I should have a look into the logs and open a new issue.

3oris · 2024-09-17T16:00:18Z

Follow-up here: #126136

3oris changed the title ~~Matter server "freezes" every 3-4 days with >10% ram usage and constantly >25% CPU usage~~ Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage Aug 26, 2024

mib1185 added the integration: matter label Aug 26, 2024

3oris closed this as completed Sep 17, 2024

3oris mentioned this issue Sep 20, 2024

Matter Server: All device offline all of a sudden #126136

Closed

github-actions bot locked and limited conversation to collaborators Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

3oris commented Aug 26, 2024 •

edited

Loading

home-assistant bot commented Aug 26, 2024

3oris commented Aug 29, 2024

marcelveldt commented Sep 2, 2024

olivievranska commented Sep 8, 2024

3oris commented Sep 10, 2024

3oris commented Sep 10, 2024 •

edited

Loading

marcelveldt commented Sep 10, 2024

3oris commented Sep 16, 2024 •

edited

Loading

marcelveldt commented Sep 16, 2024

3oris commented Sep 16, 2024

3oris commented Sep 17, 2024

Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

Matter server "freezes" every 3-4 days with >10% ram usage and constant >25% CPU usage #124647

Comments

3oris commented Aug 26, 2024 • edited Loading

The problem

What version of Home Assistant Core has the issue?

What was the last working version of Home Assistant Core?

What type of installation are you running?

Integration causing the issue

Link to integration documentation on our website

Diagnostics information

Example YAML snippet

Anything in the logs that might be useful for us?

Additional information

home-assistant bot commented Aug 26, 2024

3oris commented Aug 29, 2024

marcelveldt commented Sep 2, 2024

olivievranska commented Sep 8, 2024

3oris commented Sep 10, 2024

3oris commented Sep 10, 2024 • edited Loading

marcelveldt commented Sep 10, 2024

3oris commented Sep 16, 2024 • edited Loading

marcelveldt commented Sep 16, 2024

3oris commented Sep 16, 2024

3oris commented Sep 17, 2024

3oris commented Aug 26, 2024 •

edited

Loading

3oris commented Sep 10, 2024 •

edited

Loading

3oris commented Sep 16, 2024 •

edited

Loading