Root Cause Analysis (RCA) on disappeared SLE BCI 15 SP5 based containers on registry.suse.com #25
Closed
dirkmueller
announced in
Announcements
Replies: 1 comment 1 reply
-
Also consider generating an SBOM for the images before publish and check for the presence of critical packages, their version numbers, os-release metadata etc. Google has a container test project as well to help with this. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Between Saturday June 22nd 04:15 UTC and Tuesday June 25th 09:30 UTC several containers from SLE BCI disappeared or were replaced by previous versions on registry.suse.com.
On Saturday morning, a release event of an updated SLE BCI container led to all SLE BCI containers being deleted from registry.suse.com. This was recognized and reported on Saturday by internal and external users. SUSE investigated the reports and identified that the container build and publishing system failed to complete an important aggregation step. Rather than collecting the containers to publish to the registry, it incorrectly finished with no aggregations due to an erroneous code change in the build system. This led to the temporary loss of containers with metadata in the public registry.suse.com instance.
In some cases, prior versions of the containers appeared back under the ":latest" or other respective floating tags.The containers reverted to the state of the registry as of ~ July 2023, which led to a number of failure scenarios in CI systems and for certain patches that may have been applied since July 2023 to no longer be effective. Also, some users may have experienced difficulties with launching SLE BCI containers.
A restoration of the missing containers in registry.suse.com started on Monday, June 24th around 08:30 UTC. This process was completed by Tuesday, June 25 at 09:32 UTC.
No data was modified or tampered with from the outside. No intrusion occurred, and no integrity was violated within our systems.
Technical details
On Friday, June 20 2024, SUSE Build Operations Team deployed a code change to aggregate helm charts to the Internal Build Service. Due to a logic error not caught by tests, this code change led to the deletion rather than the aggregation of containers.
On Saturday morning, the usual automated updating pipeline of SLE BCI ran, leading to a new aggregation with a now empty result. This resulted in the temporary deletion of all SLE 15 SP5 based containers on the registry.
Within hours, users and customers began alerting SUSE of the issue. The issue was analyzed by SUSE and escalated internally to the respective teams through Saturday and Sunday. On Sunday, part of the incident ( the issue that triggered http 500 errors on some registry pages) was resolved. On Monday around 08:13 UTC the remaining issue was resolved by deploying a fix and re-running the aggregation step. Over the course of about 24 hours, all content was restored on the SUSE registry. After the completion of the restoration, these issues were resolved.
Learnings
SUSE has taken the following learnings from this incident which already have been or will be implemented shortly.
Improve response times in the event of incidents.
Beta Was this translation helpful? Give feedback.
All reactions