Why GeoParquet Is A Poor Cloud Native Format #82
Replies: 7 comments 7 replies
-
Anyone care to explain their position besides a 'thumbs down'? If your cloud native data is stored on S3, GCS or ABS, you would have to download significant amounts of data to realize the performance benefit of Parquet's columnar format. That simple fact contradicts the entire purpose of cloud native data, which strives to minimize the amount of data moved across a network, without intermediate processing. Parquet's performance advantage can only be realized when sitting near a processor. |
Beta Was this translation helpful? Give feedback.
-
sure, you say
Which is why GeoParquet allows range-request 🤷
I think the
Well I could argue that GDAL is an heavy dependency has well (I love GDAL, but it's really heavy, especially if you want to focus on one or two file format)
Again, I think there are multiple ways to understand the definition of Cloud Native... but in case of the Cloud Optimized Geotiff (which is one of the first format that introduced this notion) it was the fact that you could access any part of the data doing range-request which was the most important feature.
I totally disagree with this statement! so this is why I 👎 Let's agree to disagree! |
Beta Was this translation helpful? Give feedback.
-
@PostholerCom Advancement in cloud native data formats have come out of constructive criticism, a healthy amount of skepticism, and the desire to challenge the status quo. I'm really glad you brought up the SOZip format because it is a great example of this. As far as I am aware ZIP was first proposed as a cloud native geospatial format by tapacatl in 2018 which used compressed archives to reduce the number of static files required to store large vector tilesets while providing random access through HTTP range requests. While tapacatl and its successor tapacatl2 were focused exclusively on storing tilesets, the idea of using compressed formats quickly grew with the release of cotar in 2021 and SOZip in 2022 which are both more generic implementations that apply to other types of data beyond strictly tilesets. All of us here are smart enough to know that no data format is a silver bullet. The proper data format depends on the use case, and everyone's use case is different ("hike your own hike" comes to mind here). And thats totally fine! There is nothing wrong with that! In fact its even encouraged! If SOZip works for your use case you should continue to use it! Just like many other people in the industry will continue to use geoparquet because it works for their use cases! As someone who recently got into thru-hiking (colorado trail this past summer, loved every second) I think your use case is a very interesting one for exploring cloud native geospatial formats. You are in a perfect position to provide constructive criticism to advance the geoparquet spec (along with many other spec!). You seem to have a solid understanding of geospatial data formats, and you obviously care enough about these topics to contribute to the conversation.
What I don't understand is why, when given the opportunity, you choose to instead belittle and demean the very community that you are a part of. I am fortunate enough to have just attended FOSS4G NA where there was much discussion about geoparquet and I can promise you that nobody is trying to manipulate anyone. Manipulation suggests ill-intent which is incredibly disrespectful to the many developers who have dedicated their time and energy, often without anything in return, to contribute to the geoparquet specification and surrounding software ecosystems. I would highly recommend you instead focus on providing criticism that is more constructive. Let me help show what that could look like:
The bullet points above are respectfully challenging the current state of geoparquet while providing actionable insights that the community can use to iterate and improve which is what software development is all about.
The thumbs down are not because of your opinions but because of your tone. If you don't respect your peers they won't respect you in return. That is a lose-lose for everyone involved which is a situation I personally try to avoid because it's not equitable for anyone. If you don't have anything constructive to say please don't say anything at all. And if you do have something constructive to say please say it because those opinions are valued more than I think you realize. |
Beta Was this translation helpful? Give feedback.
-
So we're both on the same page, the initial post read "Why GeoParquet is a Poor Cloud Native Format", not cloud services or server side processing or furthering a spec.
No, I don't work with arrow, other than a dependency for my GDAL builds, I have no concerns other than that.
The columnar nature of parquet has ZERO performance benefit when your massive parquet file is sitting in an S3 bucket on the opposite coast. Parquet's performance benefit is when your data is next to a processor, not over a network. There are ZERO improvements to the spec that will overcome that fact. As a cloud native format, parquet brings nothing new to the table. Think about it. A massive parquet file and the performance it's capable of, is absolutely useless when it's sitting in an S3 bucket. That's why GeoParquet is a poor cloud native format. Just use FGB without the baggage of parquet.
That's not what I said. I said FGB and SOZip are supported internally, parquet is not, it has external dependencies when building GDAL. Parquet/Arrow are obviously supported by GDAL. |
Beta Was this translation helpful? Give feedback.
-
It certainly does. And used in this manner it brings nothing new to the table. For the performance benefits of parquet's columnar layout to be realized a huge amount of data needs to be moved into the browser. The network lag and probably the inability of the browser to handle that data renders this unusable. Skip the baggage of parquet and duckdb in your browser and just use FGB.
Through tinkering, in July I became aware that the GDAL parquet driver was not reading JSON structures in Overture OSM data. I reported this to gdal-dev and Even quickly fixed it. I have contributed in some small way and I'm not operating in a vacuum.
Over the years I've experienced the joy/curse of watching the endless parade of 'The Next Big Thing'. I've seen companies who bought into dubious tech ideas, throw software at the problem and are left with a Rube Goldberg tech stack that no knows how it works. Maybe you've seen this too? In my opinion, the concept of cloud native data IS one of those things that has legs and will be around tomorrow. It simplifies the problem, not add complexity. Parquet is an excellent format when sitting next to a processor. As a cloud native format it brings nothing new to the table. Clearly, you're emotionally attached to parquet, please stop trying to sell it as something it's not. Parquet is a poor cloud native format. |
Beta Was this translation helpful? Give feedback.
-
The bulk of your opposition to parquet as format appears to stem from focusing on a narrow use case and a limited interpretation of cloud native computing. I'm relatively new to this tabular space since my work is more focused n-dimensional array storage formats, but at a high level the principles are the same in both these worlds. In both cases we have massive volumes of data and we would like to optimze storage footprint and network transit impacts through some combination of compression, intelligent indexing and chunk storage partitioning. The use case you describe, reducing network transit impacts for "thousands of features" via requests for spatial subsets in high latency browser environments is very valid but as I mentioned above, very narrow. My daily work is focused on improving the scalability of data access for massively parallel compute clusters which routinely process the "millions of features" you describe. This computational power was previously only accessible to researchers with HPC access and even more restrictively, the necessary data stored in their HPC environments. Cloud native computing has democratized access to high performance computing resources and more importantly massively scalable object storage has made data previously guarded in institutional HPCs accessible to any researcher. This intreptation of cloud native is well described in the Cloud-Native Geospatial Foundation's descriptions. With respect to the efficacy of parquet as a format, the parallel algorithms we use are often optimized so that each worker node is accessing only subset of the entire dataset in order to perform computations allowing us to work on massive datasets which cannot be managed in memory. Parquet's partitioning, columnar selection and predicate pushdown filtering excel in these applications and when coupled with object storage allows us to scale our parallel computations quickly and efficiently. I'd suggest reading through this description of reading data into a dask dataframe to better understand how powerful this is. With respect to implementations, if you find Apache Arrow cumbersome as a dependency there are excellent alternatives such as fastparquet which is widely used in the Dask ecosystem. So really, I think when you said
You might have meant
Finally, our community is able to build and use these amazing tools because of a beautiful tradition of open collaborative discourse. We often have strong opinions but strive to express those opinions through concrete demonstrations that allow us to objectively compare and contrast technology choices. I personally try to follow the mantra of "contribute before criticizing". The flatgeobuf section of the guide could definitely be improved with expanded information on the benefits of Seek-Optimized-Zip. PRs are always appreciated. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the detailed response.
I would argue that yours is the 'very narrow use case'. You could probably count on one hand the number of folks doing what you're doing. The 'widest possible use case' is probably much more mundane. Web developers putting restaurants or social venues on Leaflet maps or national real estate agency doing the same. Or maybe every building footprint in the US and the FEMA flood zone(s) it occupies. It's not sexy or fancy. This is what the vast majority of cloud native data users will probably look like. No need for the baggage Parquet brings, FGB does nicely. |
Beta Was this translation helpful? Give feedback.
-
(Geo)Parquet is an extremely useful column oriented data format. When working with local, massive data sets having many millions of features, the performance of this format is second to none. Plenty of test cases demonstrate this.
The whole idea of cloud native data is to transfer the minimal amount of data from the cloud to the client. Put another way, a subset of the original source is transferred, this is either truncated or a sampling of the original data set. You would never transfer millions of features (or pixels) over the internet without serious performance consquences.
Ideally, on the client you would be dealing with a few thousand features or less. In modern JavaScript, doing spatial analysis or other types of processing on that many features is often trivial.
Parquet's columnar format provides no performance benefit in a cloud native setting. Your development environment will require the non-trivial addition of packages like Apache Arrow. The reality is GeoParquet offers no advantage over an SOZip'ed cloud native .fgb file (FlatGeobuf). SOZip and FlatGeobuf are both built internally in GDAL, Parquet is not.
It's disheartening to see GeoParquet being pushed as some extraordinary cloud native format. No cloud native data sources will benefit from Parquet. Let's not manipulate folks to flush resources down another 'Big Data' rabbit hole they don't need.
Note, the term cloud native refers to processing cloud data with a app/web client without intermediate backend servers/services.
Beta Was this translation helpful? Give feedback.
All reactions