Automate R script (#25)

* feat: developing algorithm training and evaluation module * fix: minor bug fixes with paths and np array values retrieval * feat: create initial fs_algo package * feat: contain all training/eval into a single class * feat: simplify evaluation file write and module import * feat: add basic unit testing for AlgoTrainEval class * feat: convert save dir structure creation and fsds dataset reader into modular functions inside package * Update README.md describe the unique dependencies * feat: simplify aspects of attribute organization and combining with metrics * feat: beginning to convert attribute wrangling into a class * feat: add algorithm configuration file * feat: established class for attribute configuration file, scripts functioning * feat: add verbose option * fix: update to warnings.warn() * feat: building out additional unit tests for AttrConfigAndVars class * chore: remove spaces * feat: add unit test for fs_read_attr_comid * feat: add UserWarnings and associated unit test * feat: add unit tests for _find_feat_srce_id, fs_retr_nhdp_comids and fix associated functions when behavior didn't follow expected behavior * feat: add unit test for fs_save_algo_dir_struct * feat: a basic unit test for _open_response_data_fsds * chore: simplify algo script based on functionality moved into fs_algo_train_eval module * doc: add sphinx documentation to _read_attr_config and fs_read_attr_comid * doc: add sphinx-formatted documentation to the functions in the fs_algo_train_eval module; feat: move some hard-coded variables into the algorithm config file * fix: changes vars to attrs in AlgoTrainEval arg * fix: added the new parameters that were hard-coded (test_size & seed) * fix: swapped the train/test fractions to appropriate printout order * feat: make sphinx documentation * fix: reinstall sphinx docs for fsds_proc * fix: remove unused path_camels * fix: remove unused references to path_camels * fix: update standard fsds_proc config files to create netcdf rather than csv; rename these files schema to config * doc: update config file documentation on preferred save_type * doc: update description of yaml file's dataset * fix: update config files with featureID and featureSource entries * fix: change vars to attrs based on package's object name change * fix: change logic to ensure config file read if dataset attribute read failed * feat: add a raw data input checker/corrector for cases when nwissite gage ids are missing the leading 0 * fix: changed path_data to represent the raw input files containing corrected nwissite USGS gage ids (leading zeros) * fix: added appropriate fillna for nwissite gage ids not needed to be corrected * fix: adjust path check for attributes instead of algo * doc: add descriptive notes on algo pre-processing and suggest future improvements for datasets not processed with fsds_proc with TODO * doc: simplify attr_config, change dir_attrs to dir_db_attrs * chore: add some additional hydroatlas and USGS NHD variables for consideration * chore: add updated attribute variables to config files, based on top 5 variables considered by Bolotin et al 2022 SI work * fix: add error handling when hydrofabric could not be downloaded for a given comid * fix: avoid index error generated from attr_ddf_sub.shape[0].compute() by simply performing attr_ddf_sub.compute() first, which is needed anyway * fix: change fs_read_attr_comid to return pd.DataFrame instead of dask df, and add checks ensuring 'value' data column being float type, check for no NA values present * feat: add NA drop prior to train/test split * feat: create a separate function that standarizes the algorithm file save path * doc: add documentation to the std_algo_path func * feat: create script to generate algo prediction data for testing * feat: generating predictions from trained algos under dev * feat: add processing of xssa locations, randomly selecting a subset to use for algo prediction * feat: develop algo prediction's config ingest, and determine paths to prediction locations and trained algos * feat: add config file path builder * feat: create metric prediction and write results to file * feat: build unit test for build_cfig_path() * feat: build unit test for build_cfig_path() * feat: add unit testsfor std_pred_path and _read_pred_comid; test coverage now at 92% * feat: add oob = True as default for RandomForestRegressor * feat: add hyperparameterization capability using grid search and associated unit tests * feat: add unit testing for train_eval() * chore: change algo config for testing out hyperparameterization * chore: add UserWarning category specification to warnings.warn * fix: algo config assignment accidentally only looked at first line of params * fix: make sure that hyperparameter key:value pairings contained inside dict, not list * fix: adjust unit test's algo_config formats to represent the issue of a dict of a list, which the list_to_dict() function then converts * fix: _check_attributes_exist now appropriately reports missing attributes and comids * fix: ensure algo and pipeline keys contain algo and pipeline object types in the grid search case * Update pkg/fs_algo/fs_algo/fs_algo_train_eval.py Co-authored-by: LaurenBolotin-NOAA <[email protected]> * Update pkg/fs_algo/fs_algo/fs_algo_train_eval.py Co-authored-by: LaurenBolotin-NOAA <[email protected]> * chore: Update README.md Rename proc_fsds to fsds_proc * fix: remove network hardcoding for lyrs in proc_attr_wrap call * fix: rename ext to fileext since ext is a pre-defined object * fix: change unit test use of ext to fileext * feat: experimenting with attribute grabbing * doc: revise function documentation for clarity * chore: rename fsds to fs in all python-related files and config files * chore: rename fsds_proc directory to fs_proc * chore: rename additional fsds to fs * chore: rename remaining fsds to fs * doc: minor change to install instructions of fs_proc * feat: add requirements for fs_algo package * feat: add requirements.yml for conda environment of fs_algo/fs_proc python packages * doc: add details on func for creating col_schema_df * feat: add nwissite gage id leading zero checker as automated step * fix: new line continuation in f-string messages related to nwis checker * fix: update local config path and example in script * doc: change install description for this package * fix: modify logical test on elif featureSource == nwissite * feat: update and add new unit testing that accommodates the check_fix_nwissite_gageids function * fix: update temp directory assignment to work with non-Unix systems * doc: minor adjustment for instructional example on running unit tests * Make the change match the exact repo name * Make changes match exact repo name * doc: minor changes that will be removed: comid loc lookup * fix: rename fsds to fs in files corresponding to proc.attr.hydfab R package * feat: update R package with name change of fsds to fs * chore: update fsds to fs in config files and R unit tests * doc: update README from fsds to fs in non-url instances * doc: Update README.md Update hyperlinks and descriptions with latest fsds to fs change, and OWP repo location. * Update README.md doc: minor path fix * chore: rename fsds_attrs_grab.R to fs_attrs_grab.R and add updated Rd documentation using fs instead of fsds * doc: update arg name change of ext to fileext * doc: remove commented out code and create delineations on code sections * doc: correct mis-spellings --------- Co-authored-by: LaurenBolotin-NOAA <[email protected]>
NOAA-OWP · Oct 24, 2024 · a51de8d · a51de8d
1 parent 6e54890
commit a51de8d
Show file tree

Hide file tree

Showing 7 changed files with 392 additions and 49 deletions.
diff --git a/README.md b/README.md
@@ -143,7 +143,7 @@ Attributes from non-standardized datasets may need to be acquired for RaFTS mode
 Run [`flow.install.proc.attr.hydfab.R`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/flow.install.proc.attr.hydfab.R) to install the package. Note that a user may need to modify the section that creates the `fs_dir` for their custom path to this repo's directory.
 
 ## Usage - `proc.attr.hydfab`
-The following is an example script that runs the attribute grabber: [`fs_attrs_grab`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/fsds_attrs_grab.R).
+The following is an example script that runs the attribute grabber: [`fs_attrs_grab`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/fs_attrs_grab.R).
 
 This script grabs attribute data corresponding to locations of interest, and saves those attribute data inside a directory as multiple parquet files. The `proc.attr.hydfab::retrieve_attr_exst()` function may then efficiently query and then retrieve desired data by variable name and comid from those parquet files.
 

diff --git a/pkg/proc.attr.hydfab/R/proc_attr_grabber.R b/pkg/proc.attr.hydfab/R/proc_attr_grabber.R
@@ -105,16 +105,16 @@ retrieve_attr_exst <- function(comids, vars, dir_db_attrs, bucket_conn=NA){
 }
 
 
-proc_attr_std_hfsub_name <- function(comid,custom_name='', ext='gpkg'){
+proc_attr_std_hfsub_name <- function(comid,custom_name='', fileext='gpkg'){
   #' @title Standardidze hydrofabric subsetter's local filename
   #' @description Internal function that ensures consistent filename
   #' @param comid the USGS common identifier, generated by nhdplusTools
   #' @param custom_name Desired custom name following 'hydrofab_'
-  #' @param ext file extension of the hydrofrabric data. Default 'gpkg'
+  #' @param fileext file extension of the hydrofrabric data. Default 'gpkg'
 
   hfsub_fn <- base::gsub(pattern = paste0(custom_name,"__"),
                          replacement = "_",
-                         base::paste0('hydrofab_',custom_name,'_',comid,'.',ext))
+                         base::paste0('hydrofab_',custom_name,'_',comid,'.',fileext))
   return(hfsub_fn)
 }
 
@@ -185,7 +185,7 @@ proc_attr_usgs_nhd <- function(comid,usgs_vars){
 }
 
 
-proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg',
+proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",fileext = 'gpkg',
                          lyrs=c('divides','network')[2],
                          hf_cat_sel=TRUE, overwrite=FALSE){
 
@@ -195,7 +195,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'
   #' @param comid character class. The common identifier USGS location code for a surface water feature.
   #' @param dir_db_hydfab character class. Local directory path for storing hydrofabric data
   #' @param custom_name character class. A custom name to insert into hydrofabric file. Default \code{glue("{lyrs}_")}
-  #' @param ext character class. file extension of hydrofabric file. Default 'gpkg'
+  #' @param fileext character class. file extension of hydrofabric file. Default 'gpkg'
   #' @param lyrs character class. The layer name(s) of interest from hydrofabric. Default 'network'.
   #' @param hf_cat_sel boolean. TRUE for a total catchment characterization specific to a single comid, FALSE (or anything else) for all subcatchments
   #' @param overwrite boolean. Overwrite local data when pulling from hydrofabric s3 bucket? Default FALSE.
@@ -204,7 +204,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'
   # Build the hydfab filepath
   name_file <- proc.attr.hydfab:::proc_attr_std_hfsub_name(comid=comid,
                                    custom_name=glue::glue('{lyrs}_'),
-                                   ext=ext)
+                                   fileext=fileext)
   fp_cat <- base::file.path(dir_db_hydfab, name_file)
 
   if(!base::dir.exists(dir_db_hydfab)){
@@ -225,7 +225,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'
 
   # Read the hydrofabric file gpkg for each layer
   hfab_ls <- list()
-  if (ext == 'gpkg') {
+  if (fileext == 'gpkg') {
     # Define layers
     layers <- sf::st_layers(dsn = fp_cat)
     for (lyr in layers$name){
@@ -469,7 +469,7 @@ proc_attr_gageids <- function(gage_ids,featureSource,featureID,Retr_Params,
       # Retrieve the variables corresponding to datasets of interest & update database
       loc_attrs <- try(proc.attr.hydfab::proc_attr_wrap(comid=comid,
                                                     Retr_Params=Retr_Params,
-                                                    lyrs='network',overwrite=FALSE))
+                                                    lyrs=lyrs,overwrite=FALSE))
       if("try-error" %in% class(loc_attrs)){
         message(glue::glue("Skipping gage_id {gage_id} corresponding to comid {comid}"))
       }

diff --git a/pkg/proc.attr.hydfab/flow/fs_attrs_grab.R b/pkg/proc.attr.hydfab/flow/fs_attrs_grab.R
@@ -12,6 +12,7 @@
 #' term 'gage_id' is used as a variable in glue syntax to create featureID
 #' @seealso [fs_proc] A python package that processes input data for the
 #' formulation-selector
+#' @usage Rscript fs_attrs_grab.R "/path/to/attribute_config.yaml"
 
 # Changelog / Contributions
 #   2024-07-24 Originally created, GL
@@ -21,61 +22,81 @@ library(yaml)
 library(ncdf4)
 library(proc.attr.hydfab)
 library(glue)
+
 # TODO is AWS_NO_SIGN_REQUEST necessary??
 # Sys.setenv(AWS_NO_SIGN_REQUEST="YES")
 
-# TODO create config yaml.
-
-# TODO read in config yaml, must populate NA for items that are empty.
-
-# Define input directory:
-# TODO change this to reading the standardized metadata, not the generated data
+# Define command line argument
+cmd_args <- commandArgs("trailingOnly" = TRUE)
 
-# raw_config <- yaml::read_yaml("/Users/guylitt/git/formulation-selector/scripts/eval_ingest/xssa/xssa_attr_config.yaml")
-raw_config <- yaml::read_yaml("/Users/guylitt/git/formulation-selector/scripts/eval_ingest/SI/SI_attr_config.yaml")
+if(base::length(cmd_args)!=1){
+  warning("Unexpected to have more than one argument in Rscript fs_attrs_grab.R /path/to/attribute_config.yaml.")
+}
 
+# Read in config file, e.g.  "~/git/formulation-selector/scripts/eval_ingest/SI/SI_attr_config.yaml"
+path_attr_config <- cmd_args[1] # "~/git/formulation-selector/scripts/eval_ingest/xssa/xssa_attr_config.yaml"
+raw_config <- yaml::read_yaml(path_attr_config)
 
-datasets <- ds <- raw_config$formulation_metadata[[grep("datasets",raw_config$formulation_metadata)]]$datasets#c("juliemai-xSSA",'all')[1] # A listing of datasets to grab attributes. Dataset names match what is inside dir_std_base.  'all' processes all datasets inside dir_std_base.
-#ds_nc_filenames <- c('juliemai-xSSA_Raven_blended.nc','*.nc')[1]
-
+# A listing of datasets to grab attributes. Dataset names match what is inside dir_std_base.  'all' processes all datasets inside dir_std_base.
+datasets <- raw_config$formulation_metadata[[grep("datasets",
+        raw_config$formulation_metadata)]]$datasets #c("juliemai-xSSA",'all')[1]
 
+# Define directory paths from the config file
 home_dir <- Sys.getenv("HOME")
-dir_base <- file.path(home_dir,'noaa','regionalization','data')
-
-dir_std_base <- file.path(dir_base,"input","user_data_std") # The location of standardized data generated by fs_proc python package
-dir_db_hydfab <- file.path(dir_base,'input','hydrofabric') # The local dir where hydrofabric data are stored to limit s3 connections
-dir_db_attrs <- file.path(dir_base,'input','attributes') # The parent dir where each comid's attribute parquet file is stored in the subdirectory 'comid/', and each dataset's aggregated parquet attributes are stored in the subdirectory '/{dataset_name}
-
-
-s3_base <- "s3://lynker-spatial/tabular-resources" # s3 path containing hydrofabric-formatted attribute datasets
-s3_bucket <- 'lynker-spatial' # s3 bucket containing hydrofabric data
-
-s3_path_hydatl <- glue::glue('{s3_base}/hydroATLAS/hydroatlas_vars.parquet') # path to hydroatlas data formatted for hydrofabric
+dir_base <- glue::glue(base::unlist(raw_config$file_io)[['dir_base']])#file.path(home_dir,'noaa','regionalization','data')
+dir_std_base <- glue::glue(base::unlist(raw_config$file_io)[['dir_std_base']]) #file.path(dir_base,"input","user_data_std") # The location of standardized data generated by fs_proc python package
+dir_db_hydfab <- glue::glue(base::unlist(raw_config$file_io)[['dir_db_hydfab']]) # file.path(dir_base,'input','hydrofabric') # The local dir where hydrofabric data are stored to limit s3 connections
+dir_db_attrs <- glue::glue(base::unlist(raw_config$file_io)[['dir_db_attrs']])  # file.path(dir_base,'input','attributes') # The parent dir where each comid's attribute parquet file is stored in the subdirectory 'comid/', and each dataset's aggregated parquet attributes are stored in the subdirectory '/{dataset_name}
+
+# Read s3 connection details
+s3_base <- base::unlist(raw_config$hydfab_config)[['s3_base']]#s3://lynker-spatial/tabular-resources" # s3 path containing hydrofabric-formatted attribute datasets
+s3_bucket <- base::unlist(raw_config$hydfab_config)[['s3_bucket']] #'lynker-spatial' # s3 bucket containing hydrofabric data
+
+# s3 path to hydroatlas data formatted for hydrofabric
+if ("s3_path_hydatl" %in% names(base::unlist(raw_config$attr_select))){
+  s3_path_hydatl <- glue::glue(base::unlist(raw_config$attr_select)[['s3_path_hydatl']])  # glue::glue('{s3_base}/hydroATLAS/hydroatlas_vars.parquet')
+} else {
+  s3_path_hydatl <- NULL
+}
 
 # Additional config options
-hf_cat_sel <- c("total","all")[1] # total: interested in the single location's aggregated catchment data; all: all subcatchments of interest
-ext <- 'gpkg'
-attr_sources <- c("hydroatlas","usgs") # "streamcat",
-# TODO communicate to user that these are standardized variable names
-ha_vars <- c('pet_mm_s01', 'cly_pc_sav', 'cly_pc_uav','cly_pc_sav','ari_ix_sav') # hydroatlas variables
-sc_vars <- c() # TODO look up variables. May need to select datasets first
-usgs_vars <- c('TOT_TWI','TOT_PRSNOW','TOT_POPDENS90','TOT_EWT','TOT_RECHG','TOT_PPT7100_ANN','TOT_AET','TOT_PET','TOT_SILTAVE','TOT_BASIN_AREA','TOT_BASIN_SLOPE','TOT_ELEV_MEAN','TOT_ELEV_MAX','TOT_Intensity','TOT_Wet','TOT_Dry' ) # list of variables retrievable using nhdplusTools::get_characteristics_metadata()
+hf_cat_sel <-  base::unlist(raw_config$hydfab_config)[['hf_cat_sel']] #c("total","all")[1] # total: interested in the single location's aggregated catchment data; all: all subcatchments of interest
+ext <- base::unlist(raw_config$hydfab_config)[['ext']] # 'gpkg'
+
+#-----------------------------------------------------
+# Variable listings:
+names_attr_sel <- base::unlist(base::lapply(raw_config$attr_select,
+                                            function(x) base::names(x)))
+
+# Transform into single named list of lists rather than nested sublists
+idxs_vars <- base::grep("_vars", names_attr_sel)
+var_names <- names_attr_sel[idxs_vars]
+sub_attr_sel <- base::lapply(idxs_vars, function(i)
+  raw_config$attr_select[[i]][[1]])
+base::names(sub_attr_sel) <- var_names
+
+# Subset to only those non-null variables:
+sub_attr_sel <- sub_attr_sel[base::unlist(base::lapply(sub_attr_sel,
+                          function(x) base::any(!base::is.null(unlist(x)))))]
+var_names_sub <- names(sub_attr_sel)
 #-----------------------------------------------------
+message(glue::glue("Attribute dataset sources include the following:\n
+  {paste0(var_names_sub,collapse='\n')}"))
 
-# TODO generate this listing structure based on what is provided in yaml config
-#   & accounting for empty entries
+message(glue::glue("Attribute variables to be acquired include :\n
+  {paste0(sub_attr_sel,collapse='\n')}"))
 
-Retr_Params <- list(paths = list(# Note that if a path is provided, ensure the
+Retr_Params <- base::list(paths = base::list(
+  # Note that if a path is provided, ensure the
   # name includes 'path'. Same for directory having variable name with 'dir'
                         dir_db_hydfab=dir_db_hydfab,
                         dir_db_attrs=dir_db_attrs,
                         s3_path_hydatl = s3_path_hydatl,
                         dir_std_base = dir_std_base),
-                   vars = list(usgs_vars = usgs_vars,
-                               ha_vars = ha_vars,
-                               sc_vars = sc_vars),
+                   vars = sub_attr_sel,
                    datasets = datasets
        )
+# PROCESS ATTRIBUTES
 
 ls_comids <- proc.attr.hydfab:::grab_attrs_datasets_fs_wrap(Retr_Params,overwrite = TRUE)
 

diff --git a/pkg/proc.attr.hydfab/man/proc_attr_hf.Rd b/pkg/proc.attr.hydfab/man/proc_attr_hf.Rd
diff --git a/pkg/proc.attr.hydfab/man/proc_attr_std_hfsub_name.Rd b/pkg/proc.attr.hydfab/man/proc_attr_std_hfsub_name.Rd
diff --git a/pkg/proc.attr.hydfab/tests/testthat/test_proc_attr_grabber.R b/pkg/proc.attr.hydfab/tests/testthat/test_proc_attr_grabber.R
@@ -292,7 +292,7 @@ testthat::test_that("proc_attr_usgs_nhd", {
 
 testthat::test_that("proc_attr_hf not a comid",{
  testthat::expect_error(proc.attr.hydfab::proc_attr_hf(comid="13Notacomid14", dir_db_hydfab,
-                                                       custom_name="{lyrs}_",ext = 'gpkg',
+                                                       custom_name="{lyrs}_",fileext = 'gpkg',
                                                        lyrs=c('divides','network')[2],
                                                        hf_cat_sel=TRUE, overwrite=FALSE))
 })