- Functionality and good practices
- 1. Input data and high-level functions needed to achieve a sampling design and analysis results
- 1.1 Data & functions for drawing probability samples
- 1.2 Data & functions for making sampling frames
- 1.3 Data & functions for making the 'base' sampling frame
- 1.4 Data & functions for composing a temporary sampling schedule as a selection from a legacy judgment sample (existing measurement locations)
- 1.5 Data & functions for model-building in support of the design
- 1.6 Data & functions for inferences (also relevant for inference simulations in the design stage)
- 2. Data and intermediate functionality, needed in support of the high-level functions
- 2.1 Data & functions to obtain the attributes 'type' and the type's spatial proportion
- 2.2 Data & functions for restricting the spatial target population for each monitoring scheme (target population restricting data)
- 2.3 Data & functions for defining relevant existing measurement locations with 'usefulness' attributes
- 2.4 Intermediate-level helper functions
- 3. Low-level helper functions
Note: this document also applies to related 'n2khab-' repositories (mentioned below). This picture shows their relations.
-
data (pre)processing is to be reproducible, and is therefore defined by:
- R-functions that aim at standardized data-reading, data-conversions etc., with arguments for undecided aspects that the user can set (including also, the directory of the dataset).
These R-functions are made available to the user through the
n2khab
package. - R-scripts or ideally, literate scripts (R markdown) that define the actual workflow (processing pipeline), including the chosen arguments of the functions
- R-functions that aim at standardized data-reading, data-conversions etc., with arguments for undecided aspects that the user can set (including also, the directory of the dataset).
These R-functions are made available to the user through the
-
in some cases it can be useful to store (and version) the resulting dataset of a workflow as well (although it can be reproduced), especially if:
- it is useful to offer immediate access to the resulting dataversion, e.g.:
- by other colleagues
- to compare versions in time more easily
- the workflow is computationally intensive
In those cases, the workflow is stored and versioned in the n2khab-preprocessing repository.
- it is useful to offer immediate access to the resulting dataversion, e.g.:
-
hence, the aim is to write easily usable functions to achieve the targeted results. A function can call other (helper) functions, therefore also high-level functions are to be made to enhance automatisation and reproducibility.
-
when contributing: please use
tidyverse
,sf
,raster
andgit2rdata
packages for data reading and processing. See the README file! -
use standardized names of datafiles, to be found here (see column
ID
). These names are irrespective of the actual dataversion, which also has an ID. There are some useful filter-views available in the google sheet.
Explanatory notes on function arguments used further:
programme
: refers to MNE or MHQscheme
: the specific monitoring scheme within MNE or MHQobject
: a data object in Rpath
: the directory where the dataset can be foundfile
: the filename, without extension in the case of multiple files with different extensionsoutputdir
: the directory where the dataset is to be written (should be a subfolder ofoutputdir
named as the dataset's ID)threshold_pct
: the areal percentage threshold used to withhold types fromhabitatmap
resolution
: the resolution that the user wants for the highlevel rasterhighlevel_raster
: coarse raster based onGRTSmaster_habitats
cell_samplesizes
: dataframe with sample size per evaluation cellconnection
: database connection
Dataset IDs can be found in this googlesheet.
XG3 in the below context refers to HG3 and/or LG3 (in piezometers).
(Only briefly considered for now.)
- sampling frame
- the sampling design:
- the design type
- attributes of spatial, temporal and revisit design
Needed functions: in repo n2khab-mne-design / n2khab-mhq-design
Scripts/Rmarkdown: in repo n2khab-mne-design / n2khab-mhq-design
Results: to be written into repo n2khab-mne-design / n2khab-mhq-design, or in a separate repo with the sampling administration
base_samplingframe
: a 'base' sampling frame (see 1.3) that does not distinguish between monitoring schemes, but instead provides the unioned spatial target population for the monitoring schemes of MHQ and/or MNE;- target population restricting data (see 2.2): information that complements the 'base' sampling frame, in order to restrict the spatial target population for each monitoring scheme and completely define the respective target populations.
Needed functions: in repo n2khab-samplingframes:
write_samplingframe(programme, scheme, outputdir)
Scripts/Rmarkdown: in repo n2khab-samplingframes
Results: to be written into repo n2khab-samplingframes
A 'base' sampling frame is implemented as a unioned dataframe of the spatial target populations of the respective types (for all monitoring schemes of MNE or MHQ as a whole). Each row represents a spatial unit that belongs to the target population of one type. The 'base' sampling frame can either be separated between MNE and MHQ, or provided with a TRUE/FALSE attribute for MNE and MHQ. The spatial unit can correspond to a raster cell from a GRTS master raster (terrestrial types), a line segment (lotic types) or a polygon of non-fixed size (lentic types).
The 'base' sampling frame needs input data in order to provide the following attributes when drawing samples:
- spatial unit definition (ID, spatial attributes): derived from:
GRTSmaster_habitats
habitatmap_terr
habitatmap_integrated
mhq_terrestrial_locs
mhq_lentic_locs
mhq_lotic_locs
flanders
(used to restrict the previous layers, as far as needed)
- the attributes 'type' and the type's spatial proportion (see 2.1)
- domains:
sac
(Special Areas of Conservation: Habitats Directive (Flanders))biogeoregions
- GRTS ranking number: derived from:
GRTSmaster_habitats
- algorithms to join the
GRTSmaster_habitats
ranking number to spatial units (of terrestrial, lotic, lentic types respectively)
In practice, it is possible to build up the base sampling frame in steps, i.e. according to the needs.
E.g., the addition of habitatmap_integrated
(which combines habitatmap_terr
with watersurfaces
, habitatstreams
) can also be postponed.
Needed functions: in repo n2khab-samplingframes:
write_base_samplingframe(outputdir)
Scripts/Rmarkdown: in repo n2khab-samplingframes
Results: to be written into repo n2khab-samplingframes
1.4 Data & functions for composing a temporary sampling schedule as a selection from a legacy judgment sample (existing measurement locations)
(Cf. chapter 6 of this report -- in Dutch.)
- relevant existing measurement locations with 'usefulness' attributes (see 2.3)
samplingframe
- highlevel raster for evaluation
- aimed sample size for each evaluation cell
- sampling design attributes, especially spatial sample sizes
Plus the supporting functions (see further) to scale up GRTSmaster_habitats
to make the highlevel raster and calculate the expected sample size for each evaluation cell.
Needed functions: in repo n2khab-mne-design:
sample_legacy_sites_groundwater(highlevel_raster, cell_samplesizes, outputdir)
sample_probabilistic_sites_groundwater(highlevel_raster, cell_samplesizes, outputdir)
The suffix _groundwater
can also be replaced by something else. Subprogramme (e.g. groundwater) specific functions are used here because of the peculiarities.
Scripts/Rmarkdown: in repo n2khab-mne-design
Results: to be written into repo n2khab-mne-design
(Cf. chapter 7 of this report -- in Dutch.)
- a dataset of the environmental variable of interest, that has at least some relevance to the target population (and ideally, spatial and/or temporal overlap)
- optionally:
- spatial attributes of existing measurement locations
- the attributes 'type' and the type's spatial proportion
- target population restricting data
- spatial layers supporting sample simulation:
soilmap
ecoregions
- etc.
(Cf. chapter 9 of this report -- in Dutch.)
I.e. including model-assisted inference.
-
sampling-unit-level design attributes, including type, sampling weights, time (at least at the level of the revisit design's specifications), domain and poststratum specification
-
scheme_types
, defining typegroups if applicable -
auxiliary variable(s) (see draft list -- in Dutch), known for the whole of the sampling frame, either:
- categorical variable defining poststrata (for poststratification)
- continuous variable (for regression estimation)
Examples include:
soilmap
ecoregions
-
domain population sizes for more efficient domain estimation (i.e. subpopulation estimation through poststratification)
Needed functions: in repo n2khab-mne-design or in a more general n2khab inference functions repo:
status_estimates(samplingunits_dataframe, domain, auxiliaries, typegroups)
samplingunits_dataframe
provides sampling weights, population size of domain, poststratum and total population, typegroup membership, typeauxiliaries
are variable names, to be provided insamplingunits_dataframe
, which will be evaluated as categorical (for poststratification) or continuous (for regression estimation)- returns estimates and confidence intervals from design-based spatial or spatiotemporal inference
localtrend_estimates()
- returns modelled temporal trend parameter for each site (mean & confidence interval), which can subsequently be fed into
status_estimates()
for design-based spatial inference
- returns modelled temporal trend parameter for each site (mean & confidence interval), which can subsequently be fed into
Scripts/Rmarkdown: in same repo
Results: to be written into repo n2khab-mne-design (simulations) or n2khab-mne-result (result), same could be done for n2khab-mhq
2. Data and intermediate functionality, needed in support of the high-level functions {#intermediate}
Possible dataframes or spatial objects to join this information to, include GRTSmaster_habitats
, groundwater_sites
, lenticwater_sites
, ... Often, more than one type can be linked to a spatial unit and therefore the information is typically not directly part of a spatial object (they use a common identifier). Instead, a long (tidy) dataframe (tibble) is generated to enlist all types that are recorded at a location.
Depending on the purpose, the type-attribute is to be derived from one or more of the following:
habitatmap
habitatdune
habitatstreams
watersurfaces
mhq_lentic_locs
mhq_lotic_locs
Moreover, it is brought in consistency, and restricted to the type codes from the datasource types
.
A processed layer habitatmap_stdized
is constructed to standardise the format of habitatmap
:
- main types need to be linked to their corresponding subtypes in order to be picked up when selections are defined at the subtype level (needed when no subtype information exists for a given spatial object).
- extra attributes are needed in most applications: the rank and the associated areal proportion of each row.
Ideally, an intermediate spatial layer is generated that combines the above layers and integrates their information.
Needed functions: in package n2khab:
The functions take into account type code consistency and link subtypes to main types. Most functions return a data set consisting of both a spatial object and a long / tidy dataframe (tibble), including areal proportions.
read_habitatmap_stdized(path, file)
- returns spatial object and long (tidy) tibble based on
habitatmap
- returns spatial object and long (tidy) tibble based on
read_habitatmap_terr(path, file)
- returns
habitatmap_terr
: this integrates the terrestrial locations ofhabitatmap_stdized
(but excluding those with very low certainty of containing habitat or RIB),habitatdune
andmhq_terrestrial_locs
and adds further interpretation (especially: translating some main type codes into a specific subtype).
- returns
expand_types()
: takes a type column in a dataframe and expands it, based on relationships between main types and subtypes, in order to optimize joins withhabitatmap_terr
.read_habitatmap_integrated(path, file)
- returns
habitatmap_integrated
: based onhabitatmap_terr
; inserts the spatial units fromhabitatstreams
andwatersurfaces_interpr
while retaining useful (type) attributes, ideally including those frommhq_lentic_locs
andmhq_lotic_locs
- returns
Dedicated writing workflow (scripts/Rmarkdown): in repo n2khab-preprocessing
Results of the dedicated writing workflow: to be written into data/20_processed
2.2 Data & functions for restricting the spatial target population for each monitoring scheme (target population restricting data)
Separate data next to the sampling frame are needed to restrict the spatial target population for each monitoring scheme, in order to completely define the respective spatial target populations. These data are comprised of:
schemes
: provides an ID for each monitoring scheme, its defining attributes (e.g. in MNE: compartment, environmental pressure, (sometimes:) variable) and mentions whether a further spatial restriction layer is neededscheme_types
: dataframe that lists the types of the target population of respective monitoring schemes- spatial restriction to units irrespective of type -- depending on the monitoring scheme (see list): derived from:
shallowgroundwater
floodsensitive
- other possible spatial layers
Needed functions: in package n2khab:
read_schemes(path, file)
read_scheme_types(path, file)
read_shallowgroundwater(path, file)
read_floodsensitive(path, file)
Results: NOT to be written
2.3 Data & functions for defining relevant existing measurement locations with 'usefulness' attributes
- spatial attributes of existing measurement locations
- 'usefulness' attributes of the locations that allow to make selections which maximize 1) usefulness of existing data and 2) the potential of follow-up in the near future. These are derived of a dataset of the environmental variable of interest, that has at least a relevant overlap with the target population
- usefulness selection criteria
- spatial selection criteria:
- the attributes 'type' and the type's spatial proportion (see 2.1)
- target population restricting data (see 2.2)
- topological criteria for spatially joining the target population with the existing measurement locations
Needed functions: in repo inborutils:
qualify_groundwater_sites(xg3_metadata, xg3_data, chemistry_metadata, chemistry_data)
- the input objects conform to the formats returned by
read_groundwater_xg3()
andread_groundwater_chemistry()
(see part 3) - the function flags sites with the following quality characteristics:
- XG3 data available
- hydrochemical data available
- recent data available (either XG3 or hydrochemical)
- length of the 'useful XG3 data series', i.e. the longest available XG3 data series that has more hydrological years with than without an XG3 observation, and that starts and ends with a hydrological year with an XG3 observation
- number of gaps (missing years) in the useful XG3 data series
- first hydrological year of the useful XG3 data series
- last hydrological year of the useful XG3 data series
- the function returns a spatial object (hereafter named
groundwater_sites
) with the quality criteria, and with the piezometer IDs and coordinates
- the input objects conform to the formats returned by
spatialjoin_groundwater_sites(object, topological_criterion, groundwater_sites)
- takes a spatial R object (e.g.
soilmap
,habitatmap_terr
,habitatmap_integrated
) and uses atopological_criterion
(e.g. intersect with buffer around piezometers with radius x) to make a spatial join with a spatial objectgroundwater_sites
as returned byqualify_groundwater_sites()
- returns a tidy dataframe (tibble) (hereafter named
groundwater_joinedattributes
) with piezometer IDs and the joined attributes (as buffers may be used, a long format is necessary)
- takes a spatial R object (e.g.
Results: NOT to be written
Needed functions: in package n2khab:
filter_groundwater_sites(groundwater_sites, groundwater_joinedattributes, scheme, usefulness)
- combines the spatial object returned by
qualify_groundwater_sites()
with a dataframe (tibble), returned byspatialjoin_groundwater_sites()
and which provides type & type attributes, and restricts the result:- according to the types and optional spatial restrictions as imposed by the specified MNE-
scheme
; - according to
usefulness
criteria, which could be given as a dataframe with the allowed minimum and maximum values of quality characteristics
- according to the types and optional spatial restrictions as imposed by the specified MNE-
- returns the shrinked forms of
groundwater_sites
andgroundwater_joinedattributes
, as a GeoJSON file or shapefile (points) and a dataframe (tibble), respectively. - alternatively, define a function that encapsulates
qualify_groundwater_sites()
andspatialjoin_groundwater_sites()
and applies the restriction.
- combines the spatial object returned by
Dedicated writing workflow (scripts/Rmarkdown): in repo n2khab-mne-design
Results of the dedicated writing workflow: to be written into repo n2khab-mne-design
Needed functions: in package n2khab:
spatialjoin_GRTS(object)
- takes a spatial R object (polygons, line segments, points), makes a spatial join with
GRTSmaster_habitats
and returns a spatial R object with GRTS attributes added; - potentially involves an open GIS-backend;
- for polygons and line segments, implements a point selection procedure to comply with MHQ selections.
- takes a spatial R object (polygons, line segments, points), makes a spatial join with
write_highlevel_GRTSmh(resolution)
- writes
highlevel_raster
, i.e. a scaled up version (usingresolution
) ofGRTSmaster_habitats
- writes
soiltexture_coarse()
- takes a vector with soil type codes (character of factor) and converts this into a factor with three coarse texture classes (fine / coarse / peat)
Results: NOT to be written
Needed functions: in repo n2khab-mne-design:
expected_sample_size(programme, scheme, highlevel_raster)
Results: NOT to be written
To recall, read_xxx()
functions typically return:
- tidy formatted data (which may mean that a spatial dataset is to be kept separate from long-formatted attributes).
- While several
read_xxx()
functions refer to data that are more specific to n2khab projects, otherread_xxx()
functions have broader interest. Therefore, place the latter (only) in the inborutils package.
- While several
- data with English variable names and labels of identifiers (such as types, pressures, ...)
- tibbles instead of dataframes
- only the variables needed for n2khab projects
So, depending on the data source, it may require more than a read_vc()
or st_read()
statement.
Needed functions: in package n2khab:
-
For reading input data:
read_env_pressures(path, file)
read_schemes(path, file)
read_scheme_types(path, file)
read_types(path, file)
read_namelist(path, file)
- this holds the names, corresponding to codes in other textual data sources (
types
,env_pressures
etc.), supporting multiple languages.
- this holds the names, corresponding to codes in other textual data sources (
read_GRTSmh(path, file)
- if this is not feasible within R, an open GIS-backend needs to be called by R
read_habitatdune(path, file)
read_habitatstreams(path, file)
read_sac(path, file)
read_mhq_terrestrial_locs(path, file)
read_mhq_lentic_locs(path, file)
read_mhq_lotic_locs(path, file)
-
In some cases, for reading generated data:
read_habitatmap_terr(path, file)
- loads the R objects, returned by
write_habitatmap_terr()
- loads the R objects, returned by
read_habitatmap_integrated(path, file)
- loads the R objects, returned by
write_habitatmap_integrated()
- loads the R objects, returned by
read_samplingframe(path, file)
- loads the R object, returned by
write_samplingframe()
- loads the R object, returned by
read_base_samplingframe(path, file)
- loads the R object, returned by
write_base_samplingframe()
- loads the R object, returned by
Results: NOT to be written
Needed functions: in inborutils package:
- For reading input data:
read_watersurfaces(path, file)
read_flanders(path, file)
read_provinces(path, file)
read_biogeoregions(path, file)
read_ecoregions(path, file)
read_soilmap(path, file)
read_groundwater_xg3(connection, selection)
- defines the query to be executed in the groundwater database, in order to extract metadata and XG3 data
- it implements criteria, which can be given by a dataframe argument
selection
:- maximum filter bottom depth as meters below soil surface (workflow implementation: at most 3 meters below soil surface)
- from piezometer couples, which to retain (workflow implementation: only the most shallow one)
read_groundwater_chemistry(connection, selection)
- defines the query to be executed in the groundwater database, in order to extract metadata and hydrochemical data
- it implements criteria, which can be given by a dataframe argument
selection
:- maximum filter bottom depth as meters below soil surface (workflow implementation: at most 3 meters below soil surface)
- from piezometer couples, which to retain (workflow implementation: only the most shallow one)
Results: NOT to be written
Workflows (in no matter which repo) will depend on the user that places data in the right location, and most of all: the right data in the right location.
The following things are therefore needed in each repo where data processing is done (analysis repositories and n2khab-preprocessing):
datalist_chosen
: a tabular file that clearly defines which data sources and versions thereof are needed (the file is versioned in the respective repo)- functionality regarding the definition of data sources and data versions (see 3.3)
- a housekeeping workflow that checks whether the right data are present, as defined by
datalist_chosen
, and which should be run on a regular basis
Needed functions: in package n2khab
check_inputdata(checksums, root, checksumdelay=14*24*3600)
- checks data presence, data version and integrity, cf. the functionality described here
- it generates, next to each file, a metadata file and, under certain conditions, a checksum file
- it checks the current metadata against the metadata file and it checks the checksum against the checksum in the
checksums
dataframe, which is to be generated fromdatalist_chosen
,dataversions
anddatasources
- it reports to the user
Dedicated workflow (scripts/Rmarkdown): in repo n2khab-preprocessing plus other repos with data processing
Results: NOT to be written
In order to allow for checks (see 3.2) and further metadata, definition of data is needed:
datasources
: tabular file that defines data sources (mirrors worksheet 'data sources' in this googlesheet): attributes like ID, n2khab-repo, data owner, authorative source location, relative local path where data is to be expected.dataversions
: tabular file that defines data versions (mirrors worksheet 'data source versions in this googlesheet): attributes like sourceID, versionID, authorative source location, fileserver data path, filename, checksum.- a housekeeping workflow that updates
datasources
anddataversions
, and which should be run on a regular basis
Needed functions: in package n2khab:
Functions that keep datasources
and dataversions
in sync with the mirror google sheet:
write_datasources(outputdir)
write_dataversions(outputdir)
Dedicated writing workflow (scripts/Rmarkdown): in repo n2khab
Results of the dedicated writing workflow: to be written into repo n2khab