Health check does not seem to work on shard-worker when redis is down #1627

aisbaa · 2024-02-09T13:37:14Z

I've noticed that when redis gets restarted workers become unavailable for 5-10 minutes. Despite workers not working properly, they still reports that its being healthy. I tried 3 approaches for health checks: is tcp port is open, is http request on /metrics, grpc health check request, all of them indicate that service is healthy.

While trying to build a target with bazel it reports that no workers are available:

WARNING: Uploading BEP referenced local file /home/user/.cache/bazel/_bazel_aistis/b97476d719d716accead0f2d5b93104f/execroot/__main__/bazel-out/k8-fastbuild/bin/src/... hash: "664187f8ab2293dd73c447f1dffad1c017b30741fe13c81f43104c255e116828"                                                                                                                                                                            size_bytes: 48311816
: UNAVAILABLE: no available workers                                                                                                                                                                                                                
WARNING: Uploading BEP referenced local file /home/user/.cache/bazel/_bazel_aistis/b97476d719d716accead0f2d5b93104f/execroot/__main__/bazel-out/k8-fastbuild/bin/src/... hash: "664187f8ab2293dd73c447f1dffad1c017b30741fe13c81f43104c255e116828"
size_bytes: 48311816
: UNAVAILABLE: no available workers
WARNING: Uploading BEP referenced local file /home/user/.cache/bazel/_bazel_aistis/b97476d719d716accead0f2d5b93104f/execroot/__main__/bazel-out/k8-fastbuild/bin/src/... hash: "bf0ba0c0768ba81de802e1da7dfea43d3ea75cff00692a2bee6a0ca7451871dc"
size_bytes: 43277416
: UNAVAILABLE: no available workers
WARNING: Uploading BEP referenced local file /home/user/.cache/bazel/_bazel_aistis/b97476d719d716accead0f2d5b93104f/execroot/__main__/bazel-out/k8-fastbuild/bin/src/...  hash: "bf0ba0c0768ba81de802e1da7dfea43d3ea75cff00692a2bee6a0ca7451871dc"
size_bytes: 43277416
: UNAVAILABLE: no available workers
WARNING: Uploading BEP referenced local file /tmp/bazel_20240209122150_7zuty hash: "b3fb4189066629798f27c0f45f8bfb93be652b8cb60ab385c239d67c70ebd2b2"
size_bytes: 447227
: UNAVAILABLE: no available workers

After restarting workers build completed with no warnings.

The text was updated successfully, but these errors were encountered:

werkt · 2024-02-15T14:33:13Z

Buildfarm is currently intolerant to redis configuration changes. Redis clusters that change IPs/names are not reflected in server/worker grouped communication. Its something I've been meaning to experiment with tolerating, but the current guidance is to only run buildfarm on stable redis clusters, and to bring it down when clusters' layouts (including slot allocations) change. A number of modifications will need to be made to have the entire cluster respond correctly to reconfigured clusters.

aisbaa · 2024-02-16T09:33:04Z

Thanks for explanation @werkt, from what I've observed is that if I restart buildfarm after redis comes back up, it starts working as usual. I can give a demo if your up for zoom call.

One more anecdotal data point, we have old internal buildfarm build (before 2.8.0) which closes the tcp port if redis goes down. This allows me to use tcp liveness probe. In case of redis restart buildfarm gets restarted by kubernetes.

Edit:

Redis clusters that change IPs/names are not reflected in server/worker grouped communication.

For redis we use kubernetes ClusterIP service which is like load balance with static IP address. We also pass redis address as domain (f.e.: redis://redis.buildfarm-java:6379).

werkt · 2024-02-19T14:01:02Z

I think I see - you're questioning the discrepancy between 'No available workers' and the threefold service availability on the workers.

The problem needs to be looked at on the servers first - conditions in the cluster are leading to them, not the workers, responding with UNAVAILABLE.

The servers need to see the manifest of current storage workers in redis to read/write blobs. I don't know if a failing connection to redis will inspire it into something other than the UNAVAILABLE behavior from above.
Further, if all of the workers listed in storage are inaccessible as named in the storage map, you will get the UNAVAILABLE response above.
I suggest trying this with a single instance, seeding broken values, introduce redis shutdowns or node removals, and see the results. With explicit steps for an unexpected reproducer case, you can augment this issue and/or write a unit test.

aisbaa · 2024-02-20T13:16:21Z

I think we are going away from the main problem I was referring to. Which is buildfarm shard-worker is not healthy (as observed in the dashboard), but liveness and readiness probes report that shard-worker is fine.

Manually sending curl request to worker-shard readiness/liveness endpoit returns 200:

% curl -I http://10.32.8.185:9090/
HTTP/1.1 200 OK
Date: Tue, 20 Feb 2024 13:12:18 GMT
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 37180

werkt · 2024-02-20T18:23:19Z

200 on the prometheus endpoint doesn't necessarily indicate 'fine', but I agree that there's no presentation difference on any public interface that a worker can't talk to the backplane if it happens to be down. Ostensibly, the worker is fine, and coping with the fact that it can't talk to redis at a particular moment.

Starting a worker does fail currently if it cannot connect to redis, which happens because backplane initialization was never made tolerant to it. With little effort, I could actually make it start cleanly and function normally, without any connection available to redis.
Killing my connected redis server during steady state and then putting it back up however does allow the worker to continue without a restart.
Workers with a disconnected redis are able to serve files as normal CAS members, and without an execution pipeline enabled, would essentially serve as standalone grpc CAS endpoints.

So the statement "Health check does not seem to work..." is not accurate: the worker is healthy. Redis, hosting the backplane, is not.

What do you want the worker to do differently when the backplane cannot be reached?

aisbaa · 2024-03-04T07:54:17Z

Thank you for explanation, it does make sense. I initially though that redis is required for all operations and killing it makes the cluster unavailable.

To add to that, I did find one worker in bad state and prometheus endpoint was not responding for that worker. So the health checks must be working too.

Do you know a simple way to put a worker in a bad state so that it would report that it is not healthy?

aisbaa · 2024-03-18T15:37:57Z

I've got a new data point. We are using automated GKE node pool upgrade, which happened over the weekend, which also restarted redis. While servers have to survive the redis restart, workers failed to rejoin servers.

Health check says that everything is fine:

 > curl -I http://10.80.17.157:9090/
HTTP/1.1 200 OK
Date: Mon, 18 Mar 2024 15:07:02 GMT
Content-type: text/plain; version=0.0.4; charset=utf-8
Content-length: 33036

Logs indicate that worker failed to reconnect to redis:

Exception in thread "Worker.failsafeRegistration" redis.clients.jedis.exceptions.JedisClusterMaxAttemptsException: No more cluster attempts left.
        at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:152)
        at redis.clients.jedis.JedisClusterCommand.run(JedisClusterCommand.java:27)
        at redis.clients.jedis.JedisCluster.hset(JedisCluster.java:477)
        at build.buildfarm.common.redis.RedisHashMap.insert(RedisHashMap.java:61)
        at build.buildfarm.instance.shard.RedisShardBackplane.addWorkerByType(RedisShardBackplane.java:632)
        at build.buildfarm.instance.shard.RedisShardBackplane.lambda$addWorker$5(RedisShardBackplane.java:614)
        at build.buildfarm.common.redis.RedisClient.call(RedisClient.java:117)
        at build.buildfarm.instance.shard.RedisShardBackplane.addWorker(RedisShardBackplane.java:610)
        at build.buildfarm.worker.shard.Worker.addWorker(Worker.java:445)
        at build.buildfarm.worker.shard.Worker$1.registerIfExpired(Worker.java:497)
        at build.buildfarm.worker.shard.Worker$1.run(Worker.java:507)
        at java.base/java.lang.Thread.run(Thread.java:833)
        Suppressed: redis.clients.jedis.exceptions.JedisConnectionException: Unexpected end of stream.
...
        Suppressed: redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool
                at redis.clients.jedis.util.Pool.getResource(Pool.java:59)
                at redis.clients.jedis.JedisPool.getResource(JedisPool.java:234)
                at redis.clients.jedis.JedisSlotBasedConnectionHandler.getConnectionFromSlot(JedisSlotBasedConnectionHandler.java:89)
                at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:102)
                ... 11 more
        Caused by: redis.clients.jedis.exceptions.JedisConnectionException: Failed connecting to host bf-redis-standalone.cache-buildfarm:6379
                at redis.clients.jedis.Connection.connect(Connection.java:225)
                at redis.clients.jedis.BinaryClient.connect(BinaryClient.java:100)
                at redis.clients.jedis.BinaryJedis.connect(BinaryJedis.java:1896)
                at redis.clients.jedis.JedisFactory.makeObject(JedisFactory.java:117)
                at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:571)
                at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:298)
                at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:223)
                at redis.clients.jedis.util.Pool.getResource(Pool.java:50)
                ... 14 more
        Caused by: java.net.ConnectException: Connection refused
                at java.base/sun.nio.ch.Net.pollConnect(Native Method)
                at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
                at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:549)
                at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
                at java.base/java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
                at java.base/java.net.Socket.connect(Socket.java:639)
                at redis.clients.jedis.Connection.connect(Connection.java:201)
                ... 21 more

Bazel build completes, but gives us these warnings:

INFO: Elapsed time: 129.918s, Critical Path: 62.01s
INFO: 928 processes: 57 internal, 871 processwrapper-sandbox.
INFO: Build completed successfully, 928 total actions
WARNING: Uploading BEP referenced local file /tmp/bazel_20240318152138_ogvru hash: "48c7622b5379be1139c26015f69eb9c6bf41b8e0fe2fd5b04ea8197ac74dee0c"
size_bytes: 2530893
: UNAVAILABLE: no available workers
INFO: Build Event Protocol files produced successfully.

Health checks are configured as follows:

> kubectl describe pod --context brazil -n cache-buildfarm buildfarm-worker-0 | grep -iE '(Liveness|Readiness|Startup):' 
    Liveness:   http-get http://:metrics/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:metrics/ delay=0s timeout=2s period=10s #success=1 #failure=3
    Startup:    http-get http://:metrics/ delay=0s timeout=2s period=10s #success=1 #failure=10

aisbaa · 2024-03-18T15:48:45Z

What do you want the worker to do differently when the backplane cannot be reached?

I would love that worker would report that it is not healthy if it can not talk to backplane, in that case kubernetes would restart it.

Workers with a disconnected redis are able to serve files as normal CAS members, and without an execution pipeline enabled, would essentially serve as standalone grpc CAS endpoints.

Is there a flag to tell worker that it must connect to redis and it must exit if it fails to do so?

werkt · 2024-03-22T18:23:27Z

I would love that worker would report that it is not healthy if it can not talk to backplane, in that case kubernetes would restart it.

Without the below - where we shut down in the case where our connectivity to redis can't be established (any node), then we need to have a signal light that you'd get to see - I'd expect this to be on the prometheus endpoint, but will require more than just querying the endpoint and checking for 200: the specific metric we select to indicate connectivity failures would need to be observed.

Is there a flag to tell worker that it must connect to redis and it must exit if it fails to do so?

I'll interpret this as 'can we make one', because there isn't one currently. Answer is yes.

aisbaa · 2024-04-02T07:52:46Z

A prometheus metric would work, I could add a shell script that greps for metric and checks the value. As an alternative, would you consider GRCP health check to provide this signal. I think it would work nicely with potential flag/setting to report not healthy if connection to redis is lost.

amishra-u · 2024-06-18T04:12:39Z

@aisbaa I think you are facing this problem #1552

Workers attempt to register to Redis at 10-second intervals. If Redis goes down, the connection request fails, and since the exception is unhandled, the thread responsible for registering the worker terminates.

Applying the above pull request should resolve your issue.

I would love that worker would report that it is not healthy if it can not talk to backplane, in that case kubernetes would restart it.

In this situation, the worker can still communicate with the backplane; it's just that the thread responsible for registering the worker to the backplane has terminated.

aisbaa changed the title ~~Health check does not seem to work on shared worker when redis is down~~ Health check does not seem to work on shard-worker when redis is down Feb 20, 2024

werkt added this to the v2.10 milestone Mar 5, 2024

werkt modified the milestones: v2.10, v2.11 Apr 1, 2024

werkt modified the milestones: v2.11, v2.12 Sep 24, 2024

werkt modified the milestones: v2.12, v2.13 Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check does not seem to work on shard-worker when redis is down #1627

Health check does not seem to work on shard-worker when redis is down #1627

aisbaa commented Feb 9, 2024

werkt commented Feb 15, 2024

aisbaa commented Feb 16, 2024 •

edited

Loading

werkt commented Feb 19, 2024

aisbaa commented Feb 20, 2024

werkt commented Feb 20, 2024

aisbaa commented Mar 4, 2024

aisbaa commented Mar 18, 2024

aisbaa commented Mar 18, 2024

werkt commented Mar 22, 2024

aisbaa commented Apr 2, 2024

amishra-u commented Jun 18, 2024 •

edited

Loading

Health check does not seem to work on shard-worker when redis is down #1627

Health check does not seem to work on shard-worker when redis is down #1627

Comments

aisbaa commented Feb 9, 2024

werkt commented Feb 15, 2024

aisbaa commented Feb 16, 2024 • edited Loading

werkt commented Feb 19, 2024

aisbaa commented Feb 20, 2024

werkt commented Feb 20, 2024

aisbaa commented Mar 4, 2024

aisbaa commented Mar 18, 2024

aisbaa commented Mar 18, 2024

werkt commented Mar 22, 2024

aisbaa commented Apr 2, 2024

amishra-u commented Jun 18, 2024 • edited Loading

aisbaa commented Feb 16, 2024 •

edited

Loading

amishra-u commented Jun 18, 2024 •

edited

Loading