`format_shapefile()` `utils::zip()` usage #5

brownag · 2024-03-04T02:44:17Z

An alternative to utils::zip() may be to write shapefiles as .shz/.shp.zip extension and readily use seek-optimized ZIP

From https://gdal.org/user/virtual_file_systems.html#sozip-seek-optimized-zip

GDAL (>= 3.7) has full read and write support for .zip files following the SOZip (Seek-Optimized ZIP) profile.

The ESRI Shapefile / DBF and GPKG -- GeoPackage vector drivers can directly generate SOZip-enabled .shz/.shp.zip or .gpkg.zip files.

While this would allow for us to easily make use of seek-optimize ZIP, I think it still does not get us around the need to have the target object name end in ".shz" or ".zip" so that GDAL can detect the file on read. #4 (comment)

The text was updated successfully, but these errors were encountered:

brownag · 2024-03-04T05:23:03Z

A reprex of how this can be implemented. Indeed there is no need to have the target end in .zip, nor for the calls to list.files()/utils::zip()

geotargets/R/tar-shapefile.R

Lines 20 to 29 in 69a08fe

    
           zf <- list.files( 
        
             pattern = replace_dot_zip( 
        
               x = path, 
        
               replacement = "" 
        
             ) 
        
           ) 
        
           utils::zip( 
        
             zipfile = path, 
        
             files = zf 
        
           )

The target is written as a .shz file, then renamed to lose the extension (targets checks that there is a file with same name as R object)
The target is read using the curly braces notation, allowing GDAL to recognize the archive even though it lacks the .shz/.zip extension

targets::tar_script({
    library(terra)
    library(targets)

    lux_area <- function(projection = "EPSG:4326") {
        terra::project(terra::vect(system.file("ex", "lux.shp", package = "terra")),
                       projection)
    }

    format_shapefile_zip <- tar_format(
        read = function(path) terra::vect(paste0("/vsizip/{", path, "}")),
        write = function(object, path) {
            terra::writeVector(
                x = object,
                filename = paste0(path, ".shz"),
                filetype = "ESRI Shapefile"
            )
            file.rename(paste0(path, ".shz"), path)
        },
        marshal = function(object) terra::wrap(object),
        unmarshal = function(object) terra::unwrap(object)
    )

    tar_target(
        lux_area1,
        command = lux_area(),
        format = format_shapefile_zip
    )
})
targets::tar_make()
#> terra 1.7.73
#> ▶ dispatched target lux_area1
#> ● completed target lux_area1 [0.037 seconds]
#> ▶ completed pipeline [0.131 seconds]
targets::tar_load("lux_area1")
lux_area1
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux_area1} (lux_area1)
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

njtierney · 2024-03-11T01:12:04Z

That looks great to me, thanks @brownag !

How would one implement getting a shapefile via gadm?

Currently I've got a function written like so:

get_gadm_country <- function(country = "Australia") {
  dir.create("data/shapefiles", recursive = TRUE, showWarnings = FALSE)
  gadm(
    country = country,
    level = 0,
    path = "data/shapefiles",
    version = "4.1",
    # low res
    resolution = 2
  )
}

But this feels against the suggested workflow in that it requires creating a directory and specifying a path?

…n author

brownag · 2024-03-12T07:07:13Z

So, I think that in the case of get_gadm_country() you will get one .rds file in the directory path for each country specified. Each .rds file contains a wrapped SpatVector of a single country. When the cache exists for a particular country/version/resolution, rather than calling download.file() the SpatVector is read from file and unwrapped. Each of these could be "file" targets if you can determine what a particular call to gadm() should produce (i.e. 1 file with specific file name given parameters in path for each country specified)

After Aariq@f82edd4 you can easily do something like tar_terra_vect(some_countries, get_gadm_country(c("Australia", "New Zealand"))) which will write the SpatVector combination of two .rds files as a new seek-optimized shapefile (.shz). If the source .rds files don't exist (i.e. on first run of the pipeline) or are deleted (i.e. a temp folder is used), they would be re-downloaded. This does not rely on external files outside the target store, but technically you have copies of the data in the initial gadm() download (stored as PackedSpatVector) and the target(s) (stored as zipped shapefile)

If you want to have the file(s) produced by gadm() be the targets... we need to consider that they are not shapefiles but rather PackedSpatVector in "rds" format... and a single call could produce multiple "rds" files (multiple targets)

You might need to have one format="file" target for each country of interest, possibly using dynamic branching or using a wrapper function around gadm() so you can identify what the file names are, rather than return the unwrapped SpatVector result in memory. EDIT: format="file" can be multiple file paths or directory, so the number of targets to use would depend on what you want to do with them

In general format="file" can probably be used for a variety of cases of geospatial data, especially when the user has some specific folder structure or process they need to follow that wouldn't work well out of the target store proper. Using format="file" does require some additional bookkeeping on the paths, filenames and types, though. https://books.ropensci.org/targets/data.html#external-files

Aariq · 2024-03-13T17:39:35Z

Let's move the discussion about gadm() to a new issue (if still relevant) and close this one, yeah?

njtierney · 2024-03-13T23:57:04Z

Yes I think let's close this issue and open a new one about gadm :)

njtierney · 2024-03-14T00:17:18Z

Also I should say, thank you @brownag for your explanation above! Much appreciated, :)

brownag mentioned this issue Mar 4, 2024

Ideas on generalization of spatial package backends and file sources using GDAL (terra, sf, stars, etc.) #4

Closed

njtierney added a commit to Aariq/geotargets that referenced this issue Mar 12, 2024

implement shapefile method from @brownag njtierney#5 and add him as a…

f82edd4

…n author

njtierney closed this as completed Mar 13, 2024

njtierney mentioned this issue Mar 14, 2024

using geodata::gadm with geotargets #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`format_shapefile()` `utils::zip()` usage #5

`format_shapefile()` `utils::zip()` usage #5

brownag commented Mar 4, 2024 •

edited

Loading

brownag commented Mar 4, 2024

njtierney commented Mar 11, 2024

brownag commented Mar 12, 2024 •

edited

Loading

Aariq commented Mar 13, 2024

njtierney commented Mar 13, 2024

njtierney commented Mar 14, 2024

format_shapefile() utils::zip() usage #5

format_shapefile() utils::zip() usage #5

Comments

brownag commented Mar 4, 2024 • edited Loading

brownag commented Mar 4, 2024

njtierney commented Mar 11, 2024

brownag commented Mar 12, 2024 • edited Loading

Aariq commented Mar 13, 2024

njtierney commented Mar 13, 2024

njtierney commented Mar 14, 2024

`format_shapefile()` `utils::zip()` usage #5

`format_shapefile()` `utils::zip()` usage #5

brownag commented Mar 4, 2024 •

edited

Loading

brownag commented Mar 12, 2024 •

edited

Loading