Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Timeouts when Using as a Library #44

Open
kevinkreiser opened this issue Sep 6, 2024 · 0 comments
Open

Configurable Timeouts when Using as a Library #44

kevinkreiser opened this issue Sep 6, 2024 · 0 comments

Comments

@kevinkreiser
Copy link

kevinkreiser commented Sep 6, 2024

hi! thanks for making this excellent tooling and making sure the data is accessible to everyone who wants to use it. im currently installing this repo's pip package and calling core.geodataframe to pull a given layer from the data in my own scripting.

i have a very crappy internet connection (over the cellular network). im not exactly sure how the geoparquet format works in terms of deciding which bits of the files it needs to pull out but it must do a decent amount of back and forth (http range requests probably?) when fetching data for a given bbox. what im seeing locally is all kinds of timeout errors. they seem to vary slightly but the bulk of them look similar to the following:

IOError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached. Detail: Python exception: Traceback (most recent call last):
  File "/home/kk/scratch/venv/lib/python3.10/site-packages/overturemaps/core.py", line 45, in <genexpr>
    non_empty_batches = (b for b in batches if b.num_rows > 0)
  File "pyarrow/_dataset.pyx", line 3769, in _iterator
  File "pyarrow/_dataset.pyx", line 3387, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached

do you have any advice for this scenario? can i configure the timeout to give my request a bit longer to do the back and forth to get the data i need from a given parquet file? ive not yet seen if i can control curl, aws or pyarrow externally but will research that more shortly. thanks in advance!

i should also mention i found a similar issue which talks about some different potential problems with aws configuration apache/arrow#36007

EDIT:

ive modified the bit in core.py:record_batch_reader to set timeouts:

    s3_options = { 'anonymous': True, 'region': 'us-west-2',
                   'connect_timeout': 60, 'request_timeout': 120
    }
    dataset = ds.dataset(
        path, filesystem=fs.S3FileSystem(**s3_options)
    )

and sadly the result is a different error:

AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 56, Failure when receiving data from the peer

perhaps this is aws itself hanging up on me because i am too slow?

EDIT:

it seems if i reduce the amount of parallelism to 1, that is to say no parallelism: 1 overture type with 1 bbox at a time, then it will consistently return results. maybe this is because the aws (boto) apis are doing parallelism underneath and im pushing the limits on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant