An Exploration of ‘Cloud-Native Vector’

Diverse Requirements

I think part of the reason that an obvious cloud-native vector format has not emerged is that there are at least two main use cases that pull in different directions.

  • Visualization — one of the key things to do with data resting on the cloud is to visualize it, especially in web browsers. Vector tiles are super optimized for this use case and can be stored natively on the cloud, so they provide a great answer, but they aren’t great for non-visualization uses.
  • Analysis — the other thing you want to do is process the data to get insight. Analysis with cloud data warehouses is the most interesting to me, but it can also be done by streaming subsets to desktop (QGIS & ESRI) and browser-based (Unfolded, OpenLayers, etc) tools.

Visualization Challenges

Enabling visualization is also much more challenging with vector data than raster. With both a key step is to create overviews that give a user a sensible view of the data at lower zoom levels. With raster data, there are a small number of well-known algorithms (nearest neighbor, average, bilinear) that can work consistently on any input and give pretty great results. With vector data, taking a naive mathematical approach, like just using 25% of the points at each overview level, will yield results that don’t ‘look’ right and aren’t useful for a person to figure out where to zoom in to. And what works for one type of data may well not work for other types of data: you may be able to get a nice algorithm that does great road overviews but plugging hydrographic data into it might well yield garbage. The tippecanoe project did a great job of providing tools to create great vector data overviews, but it needs a good bit of user configuration and its techniques have not yet been widely adopted.

Further requirements

Past the core visualization vs analysis issue, there are a few other potential desires that can complicate things further. One particular trade-off is the ability to handle lots of transactions versus optimizing for the speed of largescale analysis. Row-based formats are better suited for the former, but newer columnar data formats like Parquet and ORC have become popular for read-heavy analytic workflows. A cloud-native geospatial format would ideally handle both transactions and analysis, but realistically particular formats will likely better serve one or the other workflow.

Potential Formats

In my research, I came across a few different formats and ideas that have potential. If you’ve got another format to add or a correction on what I’ve said then please add it as a comment on medium and I’ll incorporate — my hope is that this document can represent the ‘state of play’ for vector formats for the next bit of time.

Flatgeobuf

The newest format that lines up with many of the requirements for cloud-native vector data is called Flatgeobuf. It aims to enable large volumes of static data, with streaming and random access, performing faster than legacy formats with no size limits. It does a ton of things right in my book:

  • Based on a more modern serialization format. In this case, it’s from a format called flatbuffers, which was created at Google and is now completely open-source.
  • Focused on its use cases. It aims at serving large amounts of static data with streaming and random access. It isn’t trying to be everything to everyone but does end up being much better than alternatives.
  • Good implementation support. They prioritized making sure that gdal, geotools, geoserver, postigs, qgis, openlayers & leaflet all support it, even though it’s a relatively young project.

Cloud-optimized Shapefile

Paul Ramsey has been promoting this idea, and I love that it has the same backward-compatibility ethos of Cloud-Optimized GeoTIFF’s. Just like COG’s and COPC take a flexible format and put stricter requirements on how you order, so too would a cloud-optimized shapefile take the traditional shapefile parts and structure them better. There’s a couple twists though that would need to be resolved. First, the spatial index file is key here, but the ESRI shx isn’t an open format. Paul’s proposed candidate is the .qix file that MapServer uses. Unfortunately, the ecosystem of tools to generate that isn’t very fleshed out, so there would need to be some development of more utilities to easily generate it, but it could work quite well. So it would end up a bit ‘different’ from a typical shapefile, so the backwards compatibility aspect of it ends up a bit less than I was imagining. The biggest problem in my opinion is that it would inherit all the problems of shapefile, as well-articulated by switchfromshapefile.org:

GeoParquet & GeoArrow

The format that I explored the most during my OGC Visiting Fellow time was the idea of adding standard geospatial support to Parquet. This was an extension of an effort I started with Javier from Carto to better align the Cloud Data Warehouse providers. The first goal of that is to ensure that the exports and imports of various formats to and from each of the cloud data warehouse vendors are interoperable. For the most part, this is just aligning on the exact forms of Well Known Binary and GeoJSON. That effort has stalled out a bit, but I’m hoping we’ll revive it soon.

GeoPackage

I’m a big fan of geopackage, in part because it’s probably the OGC standard I’ve been most involved in, and it was the first OGC standard to really embrace working in the open on GitHub, thanks to Jeff Yutzler’s leadership. Interestingly it does have options to store data optimized for analysis (as core spatial tables) and for visualization (through several vector tile community extensions). It even defines a ‘Vector Tiles Attribute Extension’ to associate vector tiles with the core spatial tables. I’m very curious if this has seen adoption, as it would seem to be an attempt to unite the visualization and analysis capabilities in one.

GeoJSON?

I’m pretty sure GeoJSON is not the answer but wanted to include it for completeness' sake, as it’s the other recommended format in switchfromshapefile.org. Though even there they make it clear that it’s just for ‘data interchange, particularly for web services’. It is worth mentioning some interesting projects that evolve the core concept of GeoJSON to make it a bit better in cloud-native use cases:

  • Cloud-Optimized GeoJSON was an experiment from Boundless (now part of Planet as Planet Federal), with an explicit goal to make a vector equivalent to Cloud-Optimized GeoTIFF. It mostly focused on providing a spatially-oriented ‘table of contents’ upfront, for better seek access.
  • Newline-delimited GeoJSON is an upgrade to GeoJSON that enables more efficient streaming using JSON Text sequences. From the overview site: ‘Each line can be read independently, parsed, and processed, without the entire file ever having to be parsed in memory, which is great for huge files.’ You can read more about it in this great blog post.

What’s next?

I hope the above is a decent overview of the potential formats. If I’ve missed one let me know and I’d be happy to add it, and am also happy to tweak any of the above with more information.

  • Formats for Cloud Data Warehouses. As a baseline, I believe we should be helping ensure interoperability between the cloud data warehouse vendors. They have adopted some new formats like Parquet, Avro & ORC, and as a geo-community we should help them get ‘geo’ right. But the real potential is to get to a format that they can use in a ‘cloud-native’ way, as an alternative to their internal formats that works ‘well enough’. This would enable data providers to publish in one format and have users of BigQuery, Snowflake, Redshift and more be able to use it directly in their workflows.
  • ‘Overviews included’. I’m far from sure this will work, but it feels like it’s worth experimenting with including some amount of ‘overviews’ directly in formats, to help meet the visualization and analysis use cases in a single file. Doing this will necessarily involve some duplication, but ideally there’s innovation on smart ways to link data within a format. The core ‘challenge’ would be to have a 1+ gigabyte file that can be visualized quickly in OpenLayers/Leaflet at the full layer extent while also providing full attribute data and indices (spatial and even others) for use in analysis.

Have thoughts & ideas on this? Share them!

And I’ll close this post with a plug for an event I’m helping pull together with the OGC, on ‘Cloud-Native Geospatial’. We just put up the ‘save the date’ and call for speakers. I’d love to have a set of talks exploring different aspects of ‘cloud-native vector’ formats. It could just be an idea you have, or a little experiment you want to share (even if you haven’t even made the experiment yet). It’s going to be a series of 5-minute lightning talks, with a very friendly audience, so please submit any ideas at https://forms.gle/fXJD7xtztffR3vJs9.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Chris Holmes

Chris Holmes

1.2K Followers

Product Architect @ Planet, Board Member @ Open Geospatial Consortium, Technical Fellow @ Radiant.Earth