skip generating auto_restart.conf for syncd #51

yanjundeng · 2023-03-30T22:56:18Z

[Issue Desc]
When kill critical process inside syncd container, it will trigger restart of syncd/swss and other containers.
During the process, we see the following ERR messages in syslog:

Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_LAG_ID = oid:0x200000000000007
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_PORT_ID = oid:0x10000000000000f
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- processSingleVid: failed to create object SAI_OBJECT_TYPE_LAG_MEMBER: SAI_STATUS_OBJECT_IN_USE

In fact, besides the above ERR messages, there are some other ERR messages related to syncd.

[Rootcause Analysis]
After killing one critical process, for example “syncd” inside syncd container, the supervisor monitor “supervisor-proc-exit-listener” gets the event.
The monitor then tries to send SIGTERM to its parent to stop all other processes inside syncd container.
When syncd is stopped, these tun interfaces “EthernetXX” inside host become Down and are removed. We can see many following logs during this period:

Mar 21 04:48:32.976780 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet232(ok:down) to state db
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981661 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet232(ok) from state db
Mar 21 04:48:33.032850 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035246 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet216(ok:down) to state db
Mar 21 04:48:33.035347 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035572 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet216(ok) from state db

Once the syncd container is stopped, the systemd tries to run its syncd stop script according to the “ExecStop” setting to do some postStop handling.
Later, the swss service tries stop itself by “docker-wait-any” since its peer syncd exited.
At the same time, the syncd tries to auto-restart by systemd since its auto-restart is TRUE. Yet it need lock “/tmp/swss-syncd-lock” before initializing.

After swss stops all process inside its container, it unlocks the “/tmp/swss-syncd-lock” and tries to stop its dependent and peer services in turn.
The syncd got the lock and begin to initializing. During this process of syncd initializing, we hit some syncd ERR msgs reported by this Jira.

After a while, the swss script tries to stop its peer service : syncd service.
After stopped all its dependent and peer services, the swss is auto-restarted by systemd.
After swss container is started, it then tries to start its dependent and peer service, include syncd service. Then the system recovers to normal.

According to the above analysis of the whole process, these syncd ERR messages happen during the period of the first syncd auto-restart.
As to this restart, the DB is not flushed, especially ASIC_DB. Since the stop of syncd triggers tun interfaces Down and remove, which could affect the state DB and ASIC DB.
Then, this incorrect ASIC DB could affect the initializing of syncd.

[Solution]
The syncd should avoid using this intermediate incorrect DB.
There are two possible solutions for this issue:

when stop swss, flush ASIC_DB for non-warm stop.
Don't auto-restart syncd by itself, let its peer swss watch and start it. ( same as 202012)

As to option 1, we can flush ASIC_DB when executing STOP for swss service if it’s not a WARM case. After flushing the ASIC_DB, the syncd could initialize from a clean DB.
As to option 2, it’s same as 202012 branch, we should remove the auto-restart config for syncd service, i.e, the following file:
/etc/systemd/system/syncd.service.d/auto_restart.conf

This auto_restart config file is generated by “src/sonic-host-services/scripts/hostcfgd”, imported by
[hostcfgd] Initialize Restart= in feature's systemd config by the v… · sonic-net/sonic-buildimage@8a76cdc (github.com)

The 202012 branch doesn’t have this merge.
We can simply skip “syncd” in “hostcfgd” when generating “auto_restart.conf”.

Since the status of syncd service is monitored by swss service, we can always let swss service take charge of the restart of syncd service.

We choose solution 2 here.

[Similar issue]
Marvell reported a similar issue, which should have the same rootcause:
syslog errors observed during syncd init process · Issue #1248 · sonic-net/SONiC (github.com)

Our fix solution could resolve this kinda issue together.

vaibhavhd · 2023-04-11T23:53:54Z

scripts/hostcfgd

@@ -383,6 +383,10 @@ class FeatureHandler(object):
        # On multi-ASIC device, creates systemd configuration file for each feature instance
        # residing in difference namespace.
        for feature_name in feature_names:
+            # As to syncd service, swss will monitor its stop and restart it, don't need systemd to auto-restart it.
+            if "syncd" in feature_name:


This change will permanently disable syncd auto restart, right?

Intention for this change stems from the fact that swss restart will auto-trigger syncd restart .
Therefore, there should be a verification of such a dependency before disabling auto restart.

I think better design will be that syncd auto restart sequence checks (at runtime) the dependency of swss, and then aborts the syncd auto restart. Benefits:

Any future change on swss dependency will not break syncd autorestart logic as the check for dependency will happen at runtime and not at image build time.
So if syncd is removed from dependent service list then yncd autorestart will still work.

cc: @prsunny , @yxieca for awareness.

@vaibhavhd vaibhavhd, Currently, the dependency between syncd and swss is defined in swss.sh. Now the unnecessary auto-restart of syncd is triggered by systemd. And the swss also use systemctl to re-start syncd. Seems we could not differentiate these two auto-restart from syncd service's definition. Any idea to check the dependency and block one at runtime?

BTW, do we have any plan to remove the dependency between SWSS and syncd in future? If we could remove their dependency in future, for example, we could just restart syncd only, or just restart swss only. If we could make it, we must be able to ensure the ASIC DB is consistent and correct during these process. If so, it's safe to add the auto-restart flag back for syncd in future.

In fact, the current issue in syncd we hit is kinda like that we have removed the dependency. During that short window, the syncd is auto-restarted by systemd and it still tries to use the legacy ASIC_DB during startup.

Why we let swss take charge of auto-restart of syncd here? That's because before swss start syncd and other services, it always flush DB at first. So that the syncd's restart could get a clear ASIC DB.

Syncd and swss are designed to start/stop together due to they needs to be synchronized.

However, we do let the swss to take more control than syncd from day 1.

In this sense, I think syncd auto-stop is sufficient as long as swss can pick up the signal (by docker wait any) and restart both services.. @prsunny do you have concerns?

@yanjundeng do you know who added the syncd auto restart after 202012? Let's get the author of that change also chime in here.

kevinskwang · 2023-04-13T01:31:31Z

As discussed, please check the error at SAI/SDK level by first.

skip auto_restart.conf for syncd

f60ce33

kevinskwang requested review from saiarcot895 and prsunny April 3, 2023 01:58

kevinskwang requested a review from vaibhavhd April 10, 2023 08:30

vaibhavhd requested a review from yxieca April 11, 2023 22:25

vaibhavhd reviewed Apr 11, 2023

View reviewed changes

kevinskwang mentioned this pull request Apr 13, 2023

skip generating auto_restart.conf for syncd #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

skip generating auto_restart.conf for syncd #51

skip generating auto_restart.conf for syncd #51

yanjundeng commented Mar 30, 2023 •

edited

Loading

vaibhavhd Apr 11, 2023

vaibhavhd Apr 11, 2023

yanjundeng Apr 12, 2023 •

edited

Loading

yanjundeng Apr 12, 2023

yxieca Apr 18, 2023

yxieca Apr 18, 2023

kevinskwang commented Apr 13, 2023

skip generating auto_restart.conf for syncd #51

Are you sure you want to change the base?

skip generating auto_restart.conf for syncd #51

Conversation

yanjundeng commented Mar 30, 2023 • edited Loading

vaibhavhd Apr 11, 2023

Choose a reason for hiding this comment

vaibhavhd Apr 11, 2023

Choose a reason for hiding this comment

yanjundeng Apr 12, 2023 • edited Loading

Choose a reason for hiding this comment

yanjundeng Apr 12, 2023

Choose a reason for hiding this comment

yxieca Apr 18, 2023

Choose a reason for hiding this comment

yxieca Apr 18, 2023

Choose a reason for hiding this comment

kevinskwang commented Apr 13, 2023

yanjundeng commented Mar 30, 2023 •

edited

Loading

yanjundeng Apr 12, 2023 •

edited

Loading