Split brain using Akka.Cluster.Sharding, Akka.Management, Akka.Discovery.KubernetesApi #2494

garethjames-imburse · 2024-05-14T16:29:30Z

I’m hoping someone can shed some light on why our Kubernetes deployments are so sensitive to split brains with our Helm chart and Akka configuration the way it is.

The documentation is fairly brief, not always explaining how the various settings work and when to use them, so it's unclear to us if we're following best practices for our deployment scenario.

We have five applications (alpha, beta, charlie, delta, echo) which are deployed from a single Helm chart as stateful sets to Kubernetes. Each stateful set has three replicas. The pods that are created are as follows:

pod-alpha-0
pod-alpha-1
pod-alpha-2
pod-bravo-0
pod-bravo-1
pod-bravo-2
pod-charlie-0
pod-charlie-1
pod-charlie-2
pod-delta-0
pod-delta-1
pod-delta-2
pod-echo-0
pod-echo-1
pod-echo-2

We are using Akka.Cluster.Sharding and Akka.Management + Akka.Discovery.KubernetesApi to form the cluster. This works well generally, except for approximately 3% of the time we end up with a split brain when performing a rolling deployment. This seems like an unusually high percentage and is causing some problems.

The HOCON we were using initially was as follows:

akka {
    cluster {
        downing-provider-class = "Akka.Cluster.SBR.SplitBrainResolverProvider, Akka.Cluster"
        split-brain-resolver {
            active-strategy = keep-majority
        }
    }

    discovery {
        method = "kubernetes-api"
        kubernetes-api {
            class = "Akka.Discovery.KubernetesApi.KubernetesApiServiceDiscovery, Akka.Discovery.KubernetesApi"

            api-ca-path = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
            api-token-path = "/var/run/secrets/kubernetes.io/serviceaccount/token"
            api-service-host-env-name = "KUBERNETES_SERVICE_HOST"
            api-service-port-env-name = "KUBERNETES_SERVICE_PORT"

            pod-namespace-path = "/var/run/secrets/kubernetes.io/serviceaccount/namespace"
            pod-domain = "cluster.local"
            pod-label-selector = "actorsystem={0}"
            use-raw-ip = false
            container-name = ""
        }
    }

    extensions = ["Akka.Management.Cluster.Bootstrap.ClusterBootstrapProvider, Akka.Management"]

    management {
        http {
            port = 8558
            hostname = "" # <- Overridden in Helm chart template with pod IP address as env var
        }
        cluster.bootstrap {
            new-cluster-enabled = on
            contact-point-discovery {
                service-name = "myactorsystem"
                port-name = "management"
                required-contact-point-nr = 2
                stable-margin = 5s
                contact-with-all-contact-points = true
            }
        }
    }
}

Following the section Deployment Considerations from the Akka.Management repo docs, we made the following changes to the configuration:

akka.cluster.shutdown-after-unsuccessful-join-seed-nodes = 30s
akka.discovery.kubernetes-api.container-name = "..." # <- Overridden in Helm chart template with container name as env var
akka.management.cluster.bootstrap.new-cluster-enabled=off
akka.management.cluster.bootstrap.contact-point-discovery.stable-margin = 15s

After making these changes, while testing deployments, things appear to work as expected (just as they do most of the time). When being a bit more aggressive and randomly killing a handful of pods, we would often end up with none of the nodes being in a cluster (verified with PBM).

The last adjustments we made were as follows:

akka.cluster.shutdown-after-unsuccessful-join-seed-nodes = 30s
akka.discovery.kubernetes-api.container-name = "..." # <- Overridden in Helm chart template with container name as env var
akka.management.cluster.bootstrap.new-cluster-enabled=on
akka.management.cluster.bootstrap.contact-point-discovery.contact-with-all-contact-points = false
akka.management.cluster.bootstrap.contact-point-discovery.required-contact-point-nr = 5
akka.management.cluster.bootstrap.contact-point-discovery.stable-margin = 15s

This seems to have yielded the best results overall, but we're concerned that setting new-cluster-enabled=off has not proved very useful and that we're still vulnerable to split brains during deployment.

Does anyone have any experience and/or advice for similar scenarios using these Akka features?

The text was updated successfully, but these errors were encountered:

Aaronontheweb · 2024-06-11T19:01:55Z

Hi @garethjames-imburse - sorry for the delay on this. Have you run into this problem again since?

garethjames-imburse · 2024-06-20T08:57:18Z

Hi @Aaronontheweb, we were unable to get anywhere setting new-cluster-enabled=off - very often when killing pods we ended up with no cluster (as we required at least 2 contact points to form one). Instead we have tuned a whole lot of the other settings available with some trial and error, and deployments appear to be stable now.

Aaronontheweb · 2024-06-24T17:01:21Z

@garethjames-imburse would you mind sharing some of your configuration settings? I'm very interested in seeing if we can reproduce this issue in our test lab at all, since we rely heavily on K8s service discovery there.

garethjames-imburse · 2024-07-04T14:46:16Z

@Aaronontheweb, apologies for the delay - thank you for inviting us to share our configuration. I've reached out to you separately to discuss further but I'll paste any useful information back here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split brain using Akka.Cluster.Sharding, Akka.Management, Akka.Discovery.KubernetesApi #2494

Split brain using Akka.Cluster.Sharding, Akka.Management, Akka.Discovery.KubernetesApi #2494

garethjames-imburse commented May 14, 2024 •

edited

Loading

Aaronontheweb commented Jun 11, 2024

garethjames-imburse commented Jun 20, 2024 •

edited

Loading

Aaronontheweb commented Jun 24, 2024

garethjames-imburse commented Jul 4, 2024

Split brain using Akka.Cluster.Sharding, Akka.Management, Akka.Discovery.KubernetesApi #2494

Split brain using Akka.Cluster.Sharding, Akka.Management, Akka.Discovery.KubernetesApi #2494

Comments

garethjames-imburse commented May 14, 2024 • edited Loading

Aaronontheweb commented Jun 11, 2024

garethjames-imburse commented Jun 20, 2024 • edited Loading

Aaronontheweb commented Jun 24, 2024

garethjames-imburse commented Jul 4, 2024

garethjames-imburse commented May 14, 2024 •

edited

Loading

garethjames-imburse commented Jun 20, 2024 •

edited

Loading