Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

skip generating auto_restart.conf for syncd #51

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yanjundeng
Copy link

@yanjundeng yanjundeng commented Mar 30, 2023

[Issue Desc]
When kill critical process inside syncd container, it will trigger restart of syncd/swss and other containers.
During the process, we see the following ERR messages in syslog:

Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_INGRESS_DISABLE = false
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_LAG_ID = oid:0x200000000000007
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- listFailedAttributes: SAI_LAG_MEMBER_ATTR_PORT_ID = oid:0x10000000000000f
Mar 21 02:36:12.410980 dut-400g ERR syncd#syncd: :- processSingleVid: failed to create object SAI_OBJECT_TYPE_LAG_MEMBER: SAI_STATUS_OBJECT_IN_USE

In fact, besides the above ERR messages, there are some other ERR messages related to syncd.

[Rootcause Analysis]
After killing one critical process, for example “syncd” inside syncd container, the supervisor monitor “supervisor-proc-exit-listener” gets the event.
The monitor then tries to send SIGTERM to its parent to stop all other processes inside syncd container.
When syncd is stopped, these tun interfaces “EthernetXX” inside host become Down and are removed. We can see many following logs during this period:

Mar 21 04:48:32.976780 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet232(ok:down) to state db
Mar 21 04:48:32.981504 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet232 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:183 master:0 type:tun
Mar 21 04:48:32.981661 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet232(ok) from state db
Mar 21 04:48:33.032850 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:16 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035246 dut-400g NOTICE swss#portsyncd: :- onMsg: Publish Ethernet216(ok:down) to state db
Mar 21 04:48:33.035347 dut-400g NOTICE swss#portsyncd: :- onMsg: nlmsg type:17 key:Ethernet216 admin:0 oper:0 addr:a4:78:06:a1:5a:00 ifindex:182 master:0 type:tun
Mar 21 04:48:33.035572 dut-400g NOTICE swss#portsyncd: :- onMsg: Delete Ethernet216(ok) from state db

Once the syncd container is stopped, the systemd tries to run its syncd stop script according to the “ExecStop” setting to do some postStop handling.
Later, the swss service tries stop itself by “docker-wait-any” since its peer syncd exited.
At the same time, the syncd tries to auto-restart by systemd since its auto-restart is TRUE. Yet it need lock “/tmp/swss-syncd-lock” before initializing.

After swss stops all process inside its container, it unlocks the “/tmp/swss-syncd-lock” and tries to stop its dependent and peer services in turn.
The syncd got the lock and begin to initializing. During this process of syncd initializing, we hit some syncd ERR msgs reported by this Jira.

After a while, the swss script tries to stop its peer service : syncd service.
After stopped all its dependent and peer services, the swss is auto-restarted by systemd.
After swss container is started, it then tries to start its dependent and peer service, include syncd service. Then the system recovers to normal.

According to the above analysis of the whole process, these syncd ERR messages happen during the period of the first syncd auto-restart.
As to this restart, the DB is not flushed, especially ASIC_DB. Since the stop of syncd triggers tun interfaces Down and remove, which could affect the state DB and ASIC DB.
Then, this incorrect ASIC DB could affect the initializing of syncd.

[Solution]
The syncd should avoid using this intermediate incorrect DB.
There are two possible solutions for this issue:

  1. when stop swss, flush ASIC_DB for non-warm stop.
  2. Don't auto-restart syncd by itself, let its peer swss watch and start it. ( same as 202012)

As to option 1, we can flush ASIC_DB when executing STOP for swss service if it’s not a WARM case. After flushing the ASIC_DB, the syncd could initialize from a clean DB.
As to option 2, it’s same as 202012 branch, we should remove the auto-restart config for syncd service, i.e, the following file:
/etc/systemd/system/syncd.service.d/auto_restart.conf

This auto_restart config file is generated by “src/sonic-host-services/scripts/hostcfgd”, imported by
[hostcfgd] Initialize Restart= in feature's systemd config by the v… · sonic-net/sonic-buildimage@8a76cdc (github.com)

The 202012 branch doesn’t have this merge.
We can simply skip “syncd” in “hostcfgd” when generating “auto_restart.conf”.

Since the status of syncd service is monitored by swss service, we can always let swss service take charge of the restart of syncd service.

We choose solution 2 here.

[Similar issue]
Marvell reported a similar issue, which should have the same rootcause:
syslog errors observed during syncd init process · Issue #1248 · sonic-net/SONiC (github.com)

Our fix solution could resolve this kinda issue together.

@kevinskwang kevinskwang requested a review from vaibhavhd April 10, 2023 08:30
@vaibhavhd vaibhavhd requested a review from yxieca April 11, 2023 22:25
@@ -383,6 +383,10 @@ class FeatureHandler(object):
# On multi-ASIC device, creates systemd configuration file for each feature instance
# residing in difference namespace.
for feature_name in feature_names:
# As to syncd service, swss will monitor its stop and restart it, don't need systemd to auto-restart it.
if "syncd" in feature_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will permanently disable syncd auto restart, right?

Intention for this change stems from the fact that swss restart will auto-trigger syncd restart .
Therefore, there should be a verification of such a dependency before disabling auto restart.

I think better design will be that syncd auto restart sequence checks (at runtime) the dependency of swss, and then aborts the syncd auto restart. Benefits:

Any future change on swss dependency will not break syncd autorestart logic as the check for dependency will happen at runtime and not at image build time.
So if syncd is removed from dependent service list then yncd autorestart will still work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @prsunny , @yxieca for awareness.

Copy link
Author

@yanjundeng yanjundeng Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd vaibhavhd, Currently, the dependency between syncd and swss is defined in swss.sh. Now the unnecessary auto-restart of syncd is triggered by systemd. And the swss also use systemctl to re-start syncd. Seems we could not differentiate these two auto-restart from syncd service's definition. Any idea to check the dependency and block one at runtime?

BTW, do we have any plan to remove the dependency between SWSS and syncd in future? If we could remove their dependency in future, for example, we could just restart syncd only, or just restart swss only. If we could make it, we must be able to ensure the ASIC DB is consistent and correct during these process. If so, it's safe to add the auto-restart flag back for syncd in future.

In fact, the current issue in syncd we hit is kinda like that we have removed the dependency. During that short window, the syncd is auto-restarted by systemd and it still tries to use the legacy ASIC_DB during startup.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we let swss take charge of auto-restart of syncd here? That's because before swss start syncd and other services, it always flush DB at first. So that the syncd's restart could get a clear ASIC DB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Syncd and swss are designed to start/stop together due to they needs to be synchronized.

However, we do let the swss to take more control than syncd from day 1.

In this sense, I think syncd auto-stop is sufficient as long as swss can pick up the signal (by docker wait any) and restart both services.. @prsunny do you have concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjundeng do you know who added the syncd auto restart after 202012? Let's get the author of that change also chime in here.

@kevinskwang
Copy link

As discussed, please check the error at SAI/SDK level by first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants