Why GeoParquet Is A Poor Cloud Native Format #82

PostholerCom · 2023-10-26T02:03:55Z

PostholerCom
Oct 26, 2023

(Geo)Parquet is an extremely useful column oriented data format. When working with local, massive data sets having many millions of features, the performance of this format is second to none. Plenty of test cases demonstrate this.

The whole idea of cloud native data is to transfer the minimal amount of data from the cloud to the client. Put another way, a subset of the original source is transferred, this is either truncated or a sampling of the original data set. You would never transfer millions of features (or pixels) over the internet without serious performance consquences.

Ideally, on the client you would be dealing with a few thousand features or less. In modern JavaScript, doing spatial analysis or other types of processing on that many features is often trivial.

Parquet's columnar format provides no performance benefit in a cloud native setting. Your development environment will require the non-trivial addition of packages like Apache Arrow. The reality is GeoParquet offers no advantage over an SOZip'ed cloud native .fgb file (FlatGeobuf). SOZip and FlatGeobuf are both built internally in GDAL, Parquet is not.

It's disheartening to see GeoParquet being pushed as some extraordinary cloud native format. No cloud native data sources will benefit from Parquet. Let's not manipulate folks to flush resources down another 'Big Data' rabbit hole they don't need.

Note, the term cloud native refers to processing cloud data with a app/web client without intermediate backend servers/services.

PostholerCom · 2023-10-26T18:53:19Z

PostholerCom
Oct 26, 2023
Author

Anyone care to explain their position besides a 'thumbs down'?

If your cloud native data is stored on S3, GCS or ABS, you would have to download significant amounts of data to realize the performance benefit of Parquet's columnar format.

That simple fact contradicts the entire purpose of cloud native data, which strives to minimize the amount of data moved across a network, without intermediate processing.

Parquet's performance advantage can only be realized when sitting near a processor.

1 reply

geospatial-jeff Oct 26, 2023

Geoparquet will support spatial partitioning which allows clients to download parts of the file via HTTP GET range requests. While this is not well supported in geoparquet (the spec and tooling are young), this functionality already exists in "normal" parquet files. For example DuckDB uses range requests to access remote parquet files. From the DuckDB docs:

For Parquet files, DuckDB can use a combination of the Parquet metadata and HTTP range requests to only download the parts of the file that are actually required by the query.

vincentsarago · 2023-10-26T21:54:09Z

vincentsarago
Oct 26, 2023
Collaborator

Anyone care to explain their position besides a 'thumbs down'?

sure, you say Plenty of test cases demonstrate this. but do not post any example 🤷‍♂️

The whole idea of cloud native data is to transfer the minimal amount of data from the cloud to the client.

Which is why GeoParquet allows range-request 🤷

You would never transfer millions of features (or pixels) over the internet without serious performance consquences.

I think the cloud optimized format has multiple understanding, take a Cloud Optimized Geotiff, if you don't add Overviews, it's still a valid cloud optimized geotiff but to display the whole data you will have to download the whole file .... that doesn't make sense right! still it's a cloud optimized format because you can easily access one pixel without downloading the whole file 🤷

Parquet's columnar format provides no performance benefit in a cloud native setting. Your development environment will require the non-trivial addition of packages like Apache Arrow. The reality is GeoParquet offers no advantage over an SOZip'ed cloud native .fgb file (FlatGeobuf). SOZip and FlatGeobuf are both built internally in GDAL, Parquet is not.

Well I could argue that GDAL is an heavy dependency has well (I love GDAL, but it's really heavy, especially if you want to focus on one or two file format)

It's disheartening to see GeoParquet being pushed as some extraordinary cloud native format.

Again, I think there are multiple ways to understand the definition of Cloud Native... but in case of the Cloud Optimized Geotiff (which is one of the first format that introduced this notion) it was the fact that you could access any part of the data doing range-request which was the most important feature.

Note, the term cloud native refers to processing cloud data with a app/web client without intermediate backend servers/services.

I totally disagree with this statement! so this is why I 👎

Let's agree to disagree!

2 replies

PostholerCom Oct 26, 2023
Author

Remove GDAL/libgdal dependency from your GIS environment and let's see how much work you get done.

So cloud native is whatever we define it as? Interesting. Cloud native data and cloud services are two distinct things and clearly defined.

(Geo)Parquet and its columnar format bring nothing to the table. It's just another format that allows range requests. But it brings a lot of overhead and baggage with it.

vincentsarago Oct 26, 2023
Collaborator

Note: I'm speaking for myself here

Remove GDAL/libgdal dependency from your GIS environment and let's see how much work you get done.

That's a totally different subject!!!

So cloud native is whatever we define it as?

I never said that, but people tend to have different POV, my personal POV is that a cloud native format has to accept range-request and that we can access any part of the data without downloading the whole file). Hopefully people are free and invited to participate to those new specification.

(Geo)Parquet and its columnar format bring nothing to the table. It's just another format that allows range requests. But it brings a lot of overhead and baggage with it.

But it's still a cloud native/optimized format which you (in your title) mean to disregard. I appreciate opinions and collaboration but just saying it's a bad format for your use case or because you think other format are better doesn't make any point here!

geospatial-jeff · 2023-10-27T03:55:32Z

geospatial-jeff
Oct 27, 2023

@PostholerCom Advancement in cloud native data formats have come out of constructive criticism, a healthy amount of skepticism, and the desire to challenge the status quo. I'm really glad you brought up the SOZip format because it is a great example of this.

As far as I am aware ZIP was first proposed as a cloud native geospatial format by tapacatl in 2018 which used compressed archives to reduce the number of static files required to store large vector tilesets while providing random access through HTTP range requests. While tapacatl and its successor tapacatl2 were focused exclusively on storing tilesets, the idea of using compressed formats quickly grew with the release of cotar in 2021 and SOZip in 2022 which are both more generic implementations that apply to other types of data beyond strictly tilesets.

All of us here are smart enough to know that no data format is a silver bullet. The proper data format depends on the use case, and everyone's use case is different ("hike your own hike" comes to mind here). And thats totally fine! There is nothing wrong with that! In fact its even encouraged! If SOZip works for your use case you should continue to use it! Just like many other people in the industry will continue to use geoparquet because it works for their use cases!

As someone who recently got into thru-hiking (colorado trail this past summer, loved every second) I think your use case is a very interesting one for exploring cloud native geospatial formats. You are in a perfect position to provide constructive criticism to advance the geoparquet spec (along with many other spec!). You seem to have a solid understanding of geospatial data formats, and you obviously care enough about these topics to contribute to the conversation.

Let's not manipulate folks to flush resources down another 'Big Data' rabbit hole they don't need.

What I don't understand is why, when given the opportunity, you choose to instead belittle and demean the very community that you are a part of. I am fortunate enough to have just attended FOSS4G NA where there was much discussion about geoparquet and I can promise you that nobody is trying to manipulate anyone. Manipulation suggests ill-intent which is incredibly disrespectful to the many developers who have dedicated their time and energy, often without anything in return, to contribute to the geoparquet specification and surrounding software ecosystems.

I would highly recommend you instead focus on providing criticism that is more constructive. Let me help show what that could look like:

You expressed concerns about including arrow dependencies in your development environment. I've personally only worked with pyarrow and I've never had an issue installing or packaging it along with other dependencies. Have you run into any issues installing any of the arrow bindings? If so I would encourage opening an issue on the appropriate repo(s) so that maintainers of these packages can improve the installation/packaging experience for everyone.
You mention that geoparquet provides no benefit over SOZip. Based on your experience using SOZip and other cloud native formats are there any improvements the geoparquet spec can make to become a more compelling cloud native file format?
You argue that SOZip and FlatGeoBuff are both supported by GDAL while parquet is not. GDAL did add support for both geoparquet and geoarrow in 3.5. Have you tried out the geoparquet/geoarrow format drivers yet? Do you have any feedback to improve them? How do they compare to the other format drivers you commonly use?

The bullet points above are respectfully challenging the current state of geoparquet while providing actionable insights that the community can use to iterate and improve which is what software development is all about.

Anyone care to explain their position besides a 'thumbs down'?

The thumbs down are not because of your opinions but because of your tone. If you don't respect your peers they won't respect you in return. That is a lose-lose for everyone involved which is a situation I personally try to avoid because it's not equitable for anyone. If you don't have anything constructive to say please don't say anything at all. And if you do have something constructive to say please say it because those opinions are valued more than I think you realize.

0 replies

PostholerCom · 2023-10-27T05:29:35Z

PostholerCom
Oct 27, 2023
Author

So we're both on the same page, the initial post read "Why GeoParquet is a Poor Cloud Native Format", not cloud services or server side processing or furthering a spec.

* You expressed concerns about including arrow dependencies in your development environment.  I've personally only worked with `pyarrow` and I've never had an issue installing or packaging it along with other dependencies.  Have you run into any issues installing any of the arrow bindings?  If so I would encourage opening an issue on the appropriate repo(s) so that maintainers of these packages can improve the installation/packaging experience for everyone.

No, I don't work with arrow, other than a dependency for my GDAL builds, I have no concerns other than that.

* You mention that geoparquet provides no benefit over SOZip.  Based on your experience using SOZip and other cloud native formats are there any improvements the geoparquet spec can make to become a more compelling cloud native file format?

The columnar nature of parquet has ZERO performance benefit when your massive parquet file is sitting in an S3 bucket on the opposite coast. Parquet's performance benefit is when your data is next to a processor, not over a network. There are ZERO improvements to the spec that will overcome that fact. As a cloud native format, parquet brings nothing new to the table.

Think about it. A massive parquet file and the performance it's capable of, is absolutely useless when it's sitting in an S3 bucket. That's why GeoParquet is a poor cloud native format. Just use FGB without the baggage of parquet.

* You argue that SOZip and FlatGeoBuff are both supported by GDAL while parquet is not.  GDAL did add support for both [geoparquet](https://gdal.org/drivers/vector/parquet.html) and [geoarrow](https://gdal.org/drivers/vector/arrow.html) in 3.5.  Have you tried out the geoparquet/geoarrow format drivers yet?  Do you have any feedback to improve them?  How do they compare to the other format drivers you commonly use?

That's not what I said. I said FGB and SOZip are supported internally, parquet is not, it has external dependencies when building GDAL. Parquet/Arrow are obviously supported by GDAL.

2 replies

geospatial-jeff Oct 27, 2023

Geoparquet uses HTTP range requests to allow efficient access over the network; the same conventions as many other cloud native formats. This is not yet well supported in current geoparquet implementations because optimized spatial partitioning of vector datasets is a hard problem, but you may be interested in some of the work done here - https://github.com/kylebarron/spatially-partitioned-geoparquet. As I mentioned in a previous comment, DuckDB already has an implementation that uses HTTP range requests to efficiently run analytical queries on remote parquet files from the browser without a server in between. This falls squarely into your definition of "cloud native".

I would highly recommend you try and install geoparquet/arrow support into your development environment. It's hard to make a good-faith argument that the barrier-to-entry to install these dependencies is too high when you've never tried it yourself. Personally I'm a huge nerd and I love trying out random cloud native file formats, even if I know I probably won't use them long term. I always find this to be a valuable exercise. It's hard to comment more on dependencies without understanding exactly how you are managing, packaging, and deploying your code so I'll let this lie.

That's not what I said. I said FGB and SOZip are supported internally, parquet is not, it has external dependencies when building GDAL. Parquet/Arrow are obviously supported by GDAL.

Apologies I misunderstood your point. It is true that GDAL (at least the base docker container) doesn't include the geoparquet or arrow drivers out of the box. I agree with you that improved support for geoparquet in GDAL is a big step in the right direction. Please keep in mind that most cloud native formats have originated outside of GDAL, lag-time is expected as GDAL adds support.

docker run ghcr.io/osgeo/gdal:ubuntu-small-3.7.2 ogrinfo --formats | grep Parquet

or furthering a spec

Again, if you don't have any interest in being part of the solution then why are you commenting? Please either be constructive or keep your opinions to yourself.

rouault Nov 15, 2023

docker run ghcr.io/osgeo/gdal:ubuntu-small-3.7.2 ogrinfo --formats | grep Parquet

small is small :-) For Parquet support, use ghcr.io/osgeo/gdal:ubuntu-full-3.8.0 or ghcr.io/osgeo/gdal:alpine-normal-3.8.0
https://github.com/OSGeo/gdal/tree/master/docker#full-ghcrioosgeogdalubuntu-full-latest-aliased-to-osgeogdal documents the content of each variant

PostholerCom · 2023-10-27T17:05:37Z

PostholerCom
Oct 27, 2023
Author

DuckDB already has an implementation that uses HTTP range requests to efficiently run analytical queries on remote parquet files from the browser without a server in between. This falls squarely into your definition of "cloud native".

It certainly does. And used in this manner it brings nothing new to the table. For the performance benefits of parquet's columnar layout to be realized a huge amount of data needs to be moved into the browser. The network lag and probably the inability of the browser to handle that data renders this unusable. Skip the baggage of parquet and duckdb in your browser and just use FGB.

I would highly recommend you try and install geoparquet/arrow support into your development environment. It's hard to make a good-faith argument that the barrier-to-entry to install these dependencies is too high when you've never tried it yourself.

Through tinkering, in July I became aware that the GDAL parquet driver was not reading JSON structures in Overture OSM data. I reported this to gdal-dev and Even quickly fixed it. I have contributed in some small way and I'm not operating in a vacuum.

Again, if you don't have any interest in being part of the solution then why are you commenting? Please either be constructive or keep your opinions to yourself.

Over the years I've experienced the joy/curse of watching the endless parade of 'The Next Big Thing'. I've seen companies who bought into dubious tech ideas, throw software at the problem and are left with a Rube Goldberg tech stack that no knows how it works. Maybe you've seen this too?

In my opinion, the concept of cloud native data IS one of those things that has legs and will be around tomorrow. It simplifies the problem, not add complexity. Parquet is an excellent format when sitting next to a processor. As a cloud native format it brings nothing new to the table. Clearly, you're emotionally attached to parquet, please stop trying to sell it as something it's not.

Parquet is a poor cloud native format.

1 reply

geospatial-jeff Oct 27, 2023

Clearly, you're emotionally attached to parquet, please stop trying to sell it as something it's not.

I'm actually a pretty hard core raster guy. I don't really have a horse in the race when it comes to cloud native vector formats. I may not care too much about vector data but I am however emotionally invested in the people who are working behind the scenes to make the geoparquet format (and all the other cloud native formats) possible. You have every right to express your opinion that geoparquet is a poor cloud native data format but you have no right to attack people who disagree with your opinion (ad hominem). Again, the 👎 on your initial post is not because of your opinion but because of your tone.

I think it's great that you contributed to the GDAL parquet driver. Thanks for that contribution! I just wish you treated the people in this forum with the same respect that you give to the maintainers of GDAL....

sharkinsspatial · 2023-10-27T19:13:19Z

sharkinsspatial
Oct 27, 2023

The bulk of your opposition to parquet as format appears to stem from focusing on a narrow use case and a limited interpretation of cloud native computing. I'm relatively new to this tabular space since my work is more focused n-dimensional array storage formats, but at a high level the principles are the same in both these worlds. In both cases we have massive volumes of data and we would like to optimze storage footprint and network transit impacts through some combination of compression, intelligent indexing and chunk storage partitioning.

The use case you describe, reducing network transit impacts for "thousands of features" via requests for spatial subsets in high latency browser environments is very valid but as I mentioned above, very narrow. My daily work is focused on improving the scalability of data access for massively parallel compute clusters which routinely process the "millions of features" you describe. This computational power was previously only accessible to researchers with HPC access and even more restrictively, the necessary data stored in their HPC environments. Cloud native computing has democratized access to high performance computing resources and more importantly massively scalable object storage has made data previously guarded in institutional HPCs accessible to any researcher. This intreptation of cloud native is well described in the Cloud-Native Geospatial Foundation's descriptions.

With respect to the efficacy of parquet as a format, the parallel algorithms we use are often optimized so that each worker node is accessing only subset of the entire dataset in order to perform computations allowing us to work on massive datasets which cannot be managed in memory. Parquet's partitioning, columnar selection and predicate pushdown filtering excel in these applications and when coupled with object storage allows us to scale our parallel computations quickly and efficiently. I'd suggest reading through this description of reading data into a dask dataframe to better understand how powerful this is. With respect to implementations, if you find Apache Arrow cumbersome as a dependency there are excellent alternatives such as fastparquet which is widely used in the Dask ecosystem.

So really, I think when you said

Parquet is a poor cloud native format.

You might have meant

Parquet might not be the most optimal format for my very narrow use case and my limited intreptation of cloud native computing.

Finally, our community is able to build and use these amazing tools because of a beautiful tradition of open collaborative discourse. We often have strong opinions but strive to express those opinions through concrete demonstrations that allow us to objectively compare and contrast technology choices. I personally try to follow the mantra of "contribute before criticizing". The flatgeobuf section of the guide could definitely be improved with expanded information on the benefits of Seek-Optimized-Zip. PRs are always appreciated.

0 replies

PostholerCom · 2023-10-27T19:34:36Z

PostholerCom
Oct 27, 2023
Author

Thank you for the detailed response.

My daily work is focused on improving the scalability of data access for massively parallel compute clusters which routinely process the "millions of features" you describe. This computational power was previously only accessible to researchers with HPC access and even more restrictively, the necessary data stored in their HPC environments.

I would argue that yours is the 'very narrow use case'. You could probably count on one hand the number of folks doing what you're doing.

The 'widest possible use case' is probably much more mundane. Web developers putting restaurants or social venues on Leaflet maps or national real estate agency doing the same. Or maybe every building footprint in the US and the FEMA flood zone(s) it occupies. It's not sexy or fancy. This is what the vast majority of cloud native data users will probably look like. No need for the baggage Parquet brings, FGB does nicely.

1 reply

wildintellect Nov 1, 2023
Maintainer

I would appreciate if you could contribute to the section on row based vs column based data access and storage. There is a clearly a place and use case for both, and as you can see in the guide we are not saying there is only 1 way to do cloud optimized formats. If you don't think FGB is represented well enough, please suggest clarifications and examples for inclusion.

https://download.osgeo.org/gdal/presentations/SOZip%20%28FOSS4G%202023%29.pdf makes a good case, but there are very few accessible examples for users to follow beyond that presentation. We'd love to have Python, R, JS, and Julia examples of data access.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why GeoParquet Is A Poor Cloud Native Format #82

{{title}}

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why GeoParquet Is A Poor Cloud Native Format #82

PostholerCom Oct 26, 2023

Replies: 7 comments · 7 replies

PostholerCom Oct 26, 2023 Author

geospatial-jeff Oct 26, 2023

vincentsarago Oct 26, 2023 Collaborator

PostholerCom Oct 26, 2023 Author

vincentsarago Oct 26, 2023 Collaborator

geospatial-jeff Oct 27, 2023

PostholerCom Oct 27, 2023 Author

geospatial-jeff Oct 27, 2023

rouault Nov 15, 2023

PostholerCom Oct 27, 2023 Author

geospatial-jeff Oct 27, 2023

sharkinsspatial Oct 27, 2023

PostholerCom Oct 27, 2023 Author

wildintellect Nov 1, 2023 Maintainer

PostholerCom
Oct 26, 2023

Replies: 7 comments 7 replies

PostholerCom
Oct 26, 2023
Author

vincentsarago
Oct 26, 2023
Collaborator

PostholerCom Oct 26, 2023
Author

vincentsarago Oct 26, 2023
Collaborator

geospatial-jeff
Oct 27, 2023

PostholerCom
Oct 27, 2023
Author

PostholerCom
Oct 27, 2023
Author

sharkinsspatial
Oct 27, 2023

PostholerCom
Oct 27, 2023
Author

wildintellect Nov 1, 2023
Maintainer