Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

format_shapefile() utils::zip() usage #5

Closed
brownag opened this issue Mar 4, 2024 · 6 comments
Closed

format_shapefile() utils::zip() usage #5

brownag opened this issue Mar 4, 2024 · 6 comments

Comments

@brownag
Copy link
Contributor

brownag commented Mar 4, 2024

An alternative to utils::zip() may be to write shapefiles as .shz/.shp.zip extension and readily use seek-optimized ZIP

From https://gdal.org/user/virtual_file_systems.html#sozip-seek-optimized-zip

GDAL (>= 3.7) has full read and write support for .zip files following the SOZip (Seek-Optimized ZIP) profile.

The ESRI Shapefile / DBF and GPKG -- GeoPackage vector drivers can directly generate SOZip-enabled .shz/.shp.zip or .gpkg.zip files.

While this would allow for us to easily make use of seek-optimize ZIP, I think it still does not get us around the need to have the target object name end in ".shz" or ".zip" so that GDAL can detect the file on read. #4 (comment)

@brownag
Copy link
Contributor Author

brownag commented Mar 4, 2024

A reprex of how this can be implemented. Indeed there is no need to have the target end in .zip, nor for the calls to list.files()/utils::zip()

zf <- list.files(
pattern = replace_dot_zip(
x = path,
replacement = ""
)
)
utils::zip(
zipfile = path,
files = zf
)

  1. The target is written as a .shz file, then renamed to lose the extension (targets checks that there is a file with same name as R object)
  2. The target is read using the curly braces notation, allowing GDAL to recognize the archive even though it lacks the .shz/.zip extension
targets::tar_script({
    library(terra)
    library(targets)

    lux_area <- function(projection = "EPSG:4326") {
        terra::project(terra::vect(system.file("ex", "lux.shp", package = "terra")),
                       projection)
    }

    format_shapefile_zip <- tar_format(
        read = function(path) terra::vect(paste0("/vsizip/{", path, "}")),
        write = function(object, path) {
            terra::writeVector(
                x = object,
                filename = paste0(path, ".shz"),
                filetype = "ESRI Shapefile"
            )
            file.rename(paste0(path, ".shz"), path)
        },
        marshal = function(object) terra::wrap(object),
        unmarshal = function(object) terra::unwrap(object)
    )

    tar_target(
        lux_area1,
        command = lux_area(),
        format = format_shapefile_zip
    )
})
targets::tar_make()
#> terra 1.7.73
#> ▶ dispatched target lux_area1
#> ● completed target lux_area1 [0.037 seconds]
#> ▶ completed pipeline [0.131 seconds]
targets::tar_load("lux_area1")
lux_area1
#>  class       : SpatVector 
#>  geometry    : polygons 
#>  dimensions  : 12, 6  (geometries, attributes)
#>  extent      : 5.74414, 6.528252, 49.44781, 50.18162  (xmin, xmax, ymin, ymax)
#>  source      : lux_area1} (lux_area1)
#>  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#>  names       :  ID_1   NAME_1  ID_2   NAME_2  AREA   POP
#>  type        : <num>    <chr> <num>    <chr> <num> <int>
#>  values      :     1 Diekirch     1 Clervaux   312 18081
#>                    1 Diekirch     2 Diekirch   218 32543
#>                    1 Diekirch     3  Redange   259 18664

@njtierney
Copy link
Owner

That looks great to me, thanks @brownag !

How would one implement getting a shapefile via gadm?

Currently I've got a function written like so:

get_gadm_country <- function(country = "Australia") {
  dir.create("data/shapefiles", recursive = TRUE, showWarnings = FALSE)
  gadm(
    country = country,
    level = 0,
    path = "data/shapefiles",
    version = "4.1",
    # low res
    resolution = 2
  )
}

But this feels against the suggested workflow in that it requires creating a directory and specifying a path?

njtierney added a commit to Aariq/geotargets that referenced this issue Mar 12, 2024
@brownag
Copy link
Contributor Author

brownag commented Mar 12, 2024

So, I think that in the case of get_gadm_country() you will get one .rds file in the directory path for each country specified. Each .rds file contains a wrapped SpatVector of a single country. When the cache exists for a particular country/version/resolution, rather than calling download.file() the SpatVector is read from file and unwrapped. Each of these could be "file" targets if you can determine what a particular call to gadm() should produce (i.e. 1 file with specific file name given parameters in path for each country specified)

After Aariq@f82edd4 you can easily do something like tar_terra_vect(some_countries, get_gadm_country(c("Australia", "New Zealand"))) which will write the SpatVector combination of two .rds files as a new seek-optimized shapefile (.shz). If the source .rds files don't exist (i.e. on first run of the pipeline) or are deleted (i.e. a temp folder is used), they would be re-downloaded. This does not rely on external files outside the target store, but technically you have copies of the data in the initial gadm() download (stored as PackedSpatVector) and the target(s) (stored as zipped shapefile)

If you want to have the file(s) produced by gadm() be the targets... we need to consider that they are not shapefiles but rather PackedSpatVector in "rds" format... and a single call could produce multiple "rds" files (multiple targets)

You might need to have one format="file" target for each country of interest, possibly using dynamic branching or using a wrapper function around gadm() so you can identify what the file names are, rather than return the unwrapped SpatVector result in memory. EDIT: format="file" can be multiple file paths or directory, so the number of targets to use would depend on what you want to do with them

In general format="file" can probably be used for a variety of cases of geospatial data, especially when the user has some specific folder structure or process they need to follow that wouldn't work well out of the target store proper. Using format="file" does require some additional bookkeeping on the paths, filenames and types, though. https://books.ropensci.org/targets/data.html#external-files

@Aariq
Copy link
Collaborator

Aariq commented Mar 13, 2024

Let's move the discussion about gadm() to a new issue (if still relevant) and close this one, yeah?

@njtierney
Copy link
Owner

Yes I think let's close this issue and open a new one about gadm :)

@njtierney
Copy link
Owner

Also I should say, thank you @brownag for your explanation above! Much appreciated, :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants