Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres Connection Timeout Crash #1606

Open
omad opened this issue Jun 27, 2024 · 9 comments
Open

Postgres Connection Timeout Crash #1606

omad opened this issue Jun 27, 2024 · 9 comments
Assignees

Comments

@omad
Copy link
Member

omad commented Jun 27, 2024

I'm hunting a crash for some users running ODC code on the NCI, somewhere in database connection handling. A unique quirk of the NCI environment is that idle TCP Connections are killed after 5 minutes, whereas in most environments it's 2 hours. This is probably relevant, but I'm not sure.

An example stack trace
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 60, in decorated
    raise ex
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/dass_logs.py", line 52, in decorated
    result = fn(*args, **kwargs)
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 851, in scene_select
    l1_count, uuids2archive = l1_scenes_to_process(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 588, in l1_scenes_to_process
    files2process, uuids2archive, duplicates = l1_filter(
  File "/g/data/v10/private/modules/ard-scene-select-py3-dea/20240405/lib/python3.10/site-packages/scene_select/ard_scene_select.py", line 446, in l1_filter
    for l1_dataset in dc.index.datasets.search(product=l1_product, time=month_as_range(year, month)):
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 548, in search
    for product, datasets in self._do_search_by_product(query,
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_datasets.py", line 691, in _do_search_by_product
    with self._db_connection() as connection:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/_transaction.py", line 60, in _db_connection
    with self._index._active_connection(transaction=transaction) as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/index/postgres/index.py", line 157, in _active_connection
    with self._db._connect() as conn:
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/datacube/drivers/postgres/_connections.py", line 231, in _connect
    connection.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 1251, in close
    conn.close()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1166, in close
    self._checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 1008, in _checkin
    _finalize_fairy(
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 800, in _finalize_fairy
    connection_record.checkin()
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 542, in checkin
    finalizer(connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 677, in _reset_characteristics
    characteristic.reset_characteristic(self, dbapi_connection)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/characteristics.py", line 50, in reset_characteristic
    dialect.reset_isolation_level(dbapi_conn)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 748, in reset_isolation_level
    self.set_isolation_level(dbapi_conn, self.default_isolation_level)
  File "/g/data/v10/public/modules/dea/20231204/lib/python3.10/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 851, in set_isolation_level
    connection.set_isolation_level(level)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

I can reliably cause exceptions using plain psycopg2, by attempting to use a connection that's been idle for more than 5 minutes.

I haven't been able to reproduce the exception using the ODC Index API. I've been inserting time.sleep(330) between different calls, which is what I could do with plain psycopg2.

I suspect that ODC is more resilient due to setting pool_recycle=60 on the SQLAlchemy connection pool.

# If a connection is idle for this many seconds, SQLAlchemy will renew it rather
# than assuming it's still open. Allows servers to close idle connections without clients
# getting errors.
pool_recycle=pool_timeout,

One line in the traceback that I'm suspicious of is this line 231, since I think exiting the context would close the connection. Manually calling .close() seems unnecessary.

with self._engine.connect() as connection:
try:
connection.execution_options(isolation_level="AUTOCOMMIT")
yield _api.PostgresDbAPI(connection)
finally:
connection.close()

Summary

  • This script doesn't always crash.
  • This isn't the only time we've seen crashes that look like this on the NCI.
  • I've run out of steam investigating this, since I can't get a small reproducible example. For now at least.
  • I really want something reproducible before attempting a fix.
  • The connection handling code in ODC is far more complicated than I would have expected, and I'm not convinced it needs to be that way. However, maybe it does, I'm just not sure. :)
@Ariana-B
Copy link
Contributor

Here's a test script that causes a connection timeout error for some users:

#!/usr/bin/env python3
import datacube
import logging

def main(product="ga_ls8c_ard_3"):
    # This is ok
    # logging.basicConfig(level=logging.INFO)

    # setting up a file handler causes the new dea module to error out
    logging.basicConfig(filename = 'file.log',
                    level = logging.DEBUG,
                    format = '%(asctime)s:%(levelname)s:%(name)s:%(message)s')
    dc = datacube.Datacube(app="gen-list")
    for l1_dataset in dc.index.datasets.search(product=product):
        logging.info("This is a test")
        
if __name__ == "__main__":
    main(product="ga_s2am_ard_3")

@jeremyh
Copy link
Contributor

jeremyh commented Jul 11, 2024

The suspicious part, I think, is:

    for l1_dataset in dc.index.datasets.search(product=product):
  1. It's a huge query — it's returning all datasets for a product. This will take a while as it's a lot of data over the wire.
  2. It's returning an iterator, and (probably) streaming the results back with an open db cursor.

While the cursor is active, the pool_recycle time wouldn't help us — since it can't recycle a connection half way through getting its results back.

Perhaps each chunk of the results that the iterator is returning take long enough to "use" that they go beyond the timeout?

In this particular example we're only logging, so I wouldn't think that we're being too slow consuming our chunk of the iterator. But if we're logging to the filesystem we do know the lustre filesystem sometimes has very large pauses — a handful of "large pauses" within one chunk of the iterator could push it over the connection timeout? (ie, nothing has been read from the cursor/connection within that timeout).

(Only a guess at this point.)

@SpacemanPaul
Copy link
Contributor

SpacemanPaul commented Jul 11, 2024

I agree that is almost certainly the root cause.

What happens if we greedily suck up the results before logging? e.g.

for l1_dataset in list(dc.index.datasets.search(product=product)):
        logging.info("This is a test")

(Obviously it hangs with no feedback to the user for a long time while it sucks down the whole product worth of datasets into memory - but does it timeout?)

@SpacemanPaul
Copy link
Contributor

Also try:

batch_size=1000
i = 0
for prod, doc, uris  in dc.index.datasets.get_all_docs_for_product(
        dc.index.products.get(product)
        batch_size=batch_size
):
        logging.info(f"Dataset {doc['id']} is test {i % batch_size} of batch {i // batch_size}")
        i += 1      

@Ariana-B
Copy link
Contributor

@SpacemanPaul your first example also fails on my end, although with no stack trace - it simply outputs Killed.
The second example on the other hand seems to work.

@jeremyh
Copy link
Contributor

jeremyh commented Jul 11, 2024

@SpacemanPaul your first example also fails on my end, although with no stack trace - it simply outputs Killed. The second example on the other hand seems to work.

Killed means the script hit a limit on the login nodes (exceeding the memory, cpu or time limit) — they want people to qsub jobs to the batch system to run anything that needs more resources.

@SpacemanPaul
Copy link
Contributor

That makes sense. I think the batch-get API above is probably the best approach to this class of problems.

@omad
Copy link
Member Author

omad commented Jul 12, 2024

I've been digging a bit deeper, and now suspect it's caused by the (mis?)-configuration of PGBouncer on the NCI.

I ran the test script, and traced the network calls it made before crashing.

strace -e trace=%net --relative-timestamps -o ~/utils/stracenetlog.txt python ~/odc1606test.py

     0.000093 recvfrom(7, "\": 6390220.0}, \"lr\": {\"x\": 60978"..., 62212, 0, NULL, NULL) = 4096
     0.000090 recvfrom(7, "a06c8818b3\", \"crs\": \"epsg:32751\""..., 65438, 0, NULL, NULL) = 12288
     0.000099 recvfrom(7, "a_s2am_nbar_3-2-1_54HUD_2024-01-"..., 60873, 0, NULL, NULL) = 12288
     0.000159 recvfrom(7, "\", \"path\": \"ga_s2am_nbart_3-2-1_"..., 56854, 0, NULL, NULL) = 8192
     0.001896 recvfrom(7, "01/17/20240117T015348/ga_s2am_ar"..., 57805, 0, NULL, NULL) = 106   <-- This is where we finish receiving the results from the network connection.

*** LONG PAUSE HERE ***

   100.631082 sendto(7, "Q\0\0\0001SET default_transaction_iso"..., 50, MSG_NOSIGNAL, NULL, 0) = 50   <-- Then we try to set transaction isolation on the connection... for some reason.
     0.000276 recvfrom(7, "E\0\0\0(SFATAL\0C08P01\0Mclient_idle_"..., 65536, 0, NULL, NULL) = 41  <-- And receive an error reply
     0.000358 recvfrom(7, "", 65536, 0, NULL, NULL) = 0
     0.000102 recvfrom(7, "", 65536, 0, NULL, NULL) = 0
     2.241205 +++ exited with 1 +++

Searching for FATAL client_idle_ points towards PGBouncer configuration, for client_idle_timeout.
image

On our NCI PGBouncer server, the value has been changed from the default 0, to 90.

So it looks like PGBouncer has closed the connection, but the ODC Client hasn't realised, and continues trying to change transaction isolation settings.

I'm not sure exactly where we want to change things, I suspect there's a couple of things that aren't quite right, including the pgbouncer settings, and why ODC doesn't know the connection has been closed. I also don't recall why that pgbouncer setting is even set. It seems unnecessary to me, although it may at some point have helped work around the 300 second TCP idle closures on the NCI.

@SpacemanPaul
Copy link
Contributor

Interesting. So we have:

  1. a viable workaround through batched reads; and
  2. a potential infrastructure configuration issue that may fix the issue directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants