You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi! thanks for making this excellent tooling and making sure the data is accessible to everyone who wants to use it. im currently installing this repo's pip package and calling core.geodataframe to pull a given layer from the data in my own scripting.
i have a very crappy internet connection (over the cellular network). im not exactly sure how the geoparquet format works in terms of deciding which bits of the files it needs to pull out but it must do a decent amount of back and forth (http range requests probably?) when fetching data for a given bbox. what im seeing locally is all kinds of timeout errors. they seem to vary slightly but the bulk of them look similar to the following:
IOError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached. Detail: Python exception: Traceback (most recent call last):
File "/home/kk/scratch/venv/lib/python3.10/site-packages/overturemaps/core.py", line 45, in <genexpr>
non_empty_batches = (b for b in batches if b.num_rows > 0)
File "pyarrow/_dataset.pyx", line 3769, in _iterator
File "pyarrow/_dataset.pyx", line 3387, in pyarrow._dataset.TaggedRecordBatchIterator.__next__
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Could not open Parquet input source 'overturemaps-us-west-2/release/2024-08-20.0/theme=transportation/type=segment/part-00004-ba565738-b231-4d1d-961a-46858c2454e8-c000.zstd.parquet': AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached
do you have any advice for this scenario? can i configure the timeout to give my request a bit longer to do the back and forth to get the data i need from a given parquet file? ive not yet seen if i can control curl, aws or pyarrow externally but will research that more shortly. thanks in advance!
i should also mention i found a similar issue which talks about some different potential problems with aws configuration apache/arrow#36007
EDIT:
ive modified the bit in core.py:record_batch_reader to set timeouts:
AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 56, Failure when receiving data from the peer
perhaps this is aws itself hanging up on me because i am too slow?
EDIT:
it seems if i reduce the amount of parallelism to 1, that is to say no parallelism: 1 overture type with 1 bbox at a time, then it will consistently return results. maybe this is because the aws (boto) apis are doing parallelism underneath and im pushing the limits on it?
The text was updated successfully, but these errors were encountered:
hi! thanks for making this excellent tooling and making sure the data is accessible to everyone who wants to use it. im currently installing this repo's pip package and calling
core.geodataframe
to pull a given layer from the data in my own scripting.i have a very crappy internet connection (over the cellular network). im not exactly sure how the geoparquet format works in terms of deciding which bits of the files it needs to pull out but it must do a decent amount of back and forth (http range requests probably?) when fetching data for a given bbox. what im seeing locally is all kinds of timeout errors. they seem to vary slightly but the bulk of them look similar to the following:
do you have any advice for this scenario? can i configure the timeout to give my request a bit longer to do the back and forth to get the data i need from a given parquet file? ive not yet seen if i can control curl, aws or pyarrow externally but will research that more shortly. thanks in advance!
i should also mention i found a similar issue which talks about some different potential problems with aws configuration apache/arrow#36007
EDIT:
ive modified the bit in
core.py:record_batch_reader
to set timeouts:and sadly the result is a different error:
perhaps this is aws itself hanging up on me because i am too slow?
EDIT:
it seems if i reduce the amount of parallelism to 1, that is to say no parallelism: 1 overture type with 1 bbox at a time, then it will consistently return results. maybe this is because the aws (boto) apis are doing parallelism underneath and im pushing the limits on it?
The text was updated successfully, but these errors were encountered: