Skip to content

Commit

Permalink
Automate R script (#25)
Browse files Browse the repository at this point in the history
* feat: developing algorithm training and evaluation module

* fix: minor bug fixes with paths and np array values retrieval

* feat: create initial fs_algo package

* feat: contain all training/eval into a single class

* feat: simplify evaluation file write and module import

* feat: add basic unit testing for AlgoTrainEval class

* feat: convert save dir structure creation and fsds dataset reader into modular functions inside package

* Update README.md

describe the unique dependencies

* feat: simplify aspects of attribute organization and combining with metrics

* feat: beginning to convert attribute wrangling into a class

* feat: add algorithm configuration file

* feat: established class for attribute configuration file, scripts functioning

* feat: add verbose option

* fix: update to warnings.warn()

* feat: building out additional unit tests for AttrConfigAndVars class

* chore: remove spaces

* feat: add unit test for fs_read_attr_comid

* feat: add UserWarnings and associated unit test

* feat: add unit tests for _find_feat_srce_id, fs_retr_nhdp_comids and fix associated functions when behavior didn't follow expected behavior

* feat: add unit test for fs_save_algo_dir_struct

* feat: a basic unit test for _open_response_data_fsds

* chore: simplify algo script based on functionality moved into fs_algo_train_eval module

* doc: add sphinx documentation to _read_attr_config and fs_read_attr_comid

* doc: add sphinx-formatted  documentation to the functions in the fs_algo_train_eval module; feat: move some hard-coded variables into the algorithm config file

* fix: changes vars to attrs in AlgoTrainEval arg

* fix: added the new parameters that were hard-coded (test_size & seed)

* fix: swapped the train/test fractions to appropriate printout order

* feat: make sphinx documentation

* fix: reinstall sphinx docs for fsds_proc

* fix: remove unused path_camels

* fix: remove unused references to path_camels

* fix: update standard fsds_proc config files to create netcdf rather than csv; rename these files  schema to config

* doc: update config file documentation on preferred save_type

* doc: update description of yaml file's dataset

* fix: update config files with featureID and featureSource entries

* fix: change vars to attrs based on package's object name change

* fix: change logic to ensure config file read if dataset attribute read failed

* feat: add a raw data input checker/corrector for cases when nwissite gage ids are missing the leading 0

* fix: changed path_data to represent the raw input files containing corrected nwissite USGS gage ids (leading zeros)

* fix: added appropriate fillna for nwissite gage ids not needed to be corrected

* fix: adjust path check for attributes instead of algo

* doc: add descriptive notes on algo pre-processing and suggest future improvements for datasets not processed with fsds_proc with TODO

* doc: simplify attr_config, change dir_attrs to dir_db_attrs

* chore: add some additional hydroatlas and USGS NHD variables for consideration

* chore: add updated attribute variables to config files, based on top 5 variables considered by Bolotin et al 2022 SI work

* fix: add error handling when hydrofabric could not be downloaded for a given comid

* fix: avoid index error generated from attr_ddf_sub.shape[0].compute() by simply performing attr_ddf_sub.compute() first, which is needed anyway

* fix: change fs_read_attr_comid to return pd.DataFrame instead of dask df, and add checks ensuring 'value' data column being float type, check for no NA values present

* feat: add NA drop prior to train/test split

* feat: create a separate function that standarizes the algorithm file save path

* doc: add documentation to the std_algo_path func

* feat: create script to generate algo prediction data for testing

* feat: generating predictions from trained algos under dev

* feat: add processing of xssa locations, randomly selecting a subset to use for algo prediction

* feat: develop algo prediction's config ingest, and determine paths to prediction locations and trained algos

* feat: add config file path builder

* feat: create metric prediction and write results to file

* feat: build unit test for build_cfig_path()

* feat: build unit test for build_cfig_path()

* feat: add unit testsfor std_pred_path and _read_pred_comid; test coverage now at 92%

* feat: add oob = True as default for RandomForestRegressor

* feat: add hyperparameterization capability using grid search and associated unit tests

* feat: add unit testing for train_eval()

* chore: change algo config for testing out hyperparameterization

* chore: add UserWarning category specification to warnings.warn

* fix: algo config assignment accidentally only looked at first line of params

* fix: make sure that hyperparameter key:value pairings contained inside dict, not list

* fix: adjust unit test's algo_config formats to represent the issue of a dict of a list, which the list_to_dict() function then converts

* fix: _check_attributes_exist now appropriately reports missing attributes and comids

* fix: ensure algo and pipeline keys contain algo and pipeline object types in the grid search case

* Update pkg/fs_algo/fs_algo/fs_algo_train_eval.py

Co-authored-by: LaurenBolotin-NOAA <[email protected]>

* Update pkg/fs_algo/fs_algo/fs_algo_train_eval.py

Co-authored-by: LaurenBolotin-NOAA <[email protected]>

* chore: Update README.md

Rename proc_fsds to fsds_proc

* fix: remove network hardcoding for lyrs in proc_attr_wrap call

* fix: rename ext to fileext since ext is a pre-defined object

* fix: change unit test use of ext to fileext

* feat: experimenting with attribute grabbing

* doc: revise function documentation for clarity

* chore: rename fsds to fs in all python-related files and config files

* chore: rename fsds_proc directory to fs_proc

* chore: rename additional fsds to fs

* chore: rename remaining fsds to fs

* doc: minor change to install instructions of fs_proc

* feat: add requirements for fs_algo package

* feat: add requirements.yml for conda environment of fs_algo/fs_proc python packages

* doc: add details on func for creating col_schema_df

* feat: add nwissite gage id leading zero checker as automated step

* fix: new line continuation in f-string messages related to nwis checker

* fix: update local config path and example in script

* doc: change install description for this package

* fix: modify logical test on elif featureSource == nwissite

* feat: update and add new unit testing that accommodates the check_fix_nwissite_gageids function

* fix: update temp directory assignment to work with non-Unix systems

* doc: minor adjustment for instructional example on running unit tests

* Make the change match the exact repo name

* Make changes match exact repo name

* doc: minor changes that will be removed: comid loc lookup

* fix: rename fsds to fs in files corresponding to proc.attr.hydfab R package

* feat: update R package with name change of fsds to fs

* chore: update fsds to fs in config files and R unit tests

* doc: update README from fsds to fs in non-url instances

* doc: Update README.md

Update hyperlinks and descriptions with latest fsds to fs change, and OWP repo location.

* Update README.md

doc: minor path fix

* chore: rename fsds_attrs_grab.R to fs_attrs_grab.R and add updated Rd documentation using fs instead of fsds

* doc: update arg name change of ext to fileext

* doc: remove commented out code and create delineations on code sections

* doc: correct mis-spellings

---------

Co-authored-by: LaurenBolotin-NOAA <[email protected]>
  • Loading branch information
glitt13 and bolotinl authored Oct 24, 2024
1 parent 6e54890 commit a51de8d
Show file tree
Hide file tree
Showing 7 changed files with 392 additions and 49 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ Attributes from non-standardized datasets may need to be acquired for RaFTS mode
Run [`flow.install.proc.attr.hydfab.R`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/flow.install.proc.attr.hydfab.R) to install the package. Note that a user may need to modify the section that creates the `fs_dir` for their custom path to this repo's directory.

## Usage - `proc.attr.hydfab`
The following is an example script that runs the attribute grabber: [`fs_attrs_grab`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/fsds_attrs_grab.R).
The following is an example script that runs the attribute grabber: [`fs_attrs_grab`](https://github.com/NOAA-OWP/formulation-selector/blob/main/pkg/proc.attr.hydfab/flow/fs_attrs_grab.R).

This script grabs attribute data corresponding to locations of interest, and saves those attribute data inside a directory as multiple parquet files. The `proc.attr.hydfab::retrieve_attr_exst()` function may then efficiently query and then retrieve desired data by variable name and comid from those parquet files.

Expand Down
16 changes: 8 additions & 8 deletions pkg/proc.attr.hydfab/R/proc_attr_grabber.R
Original file line number Diff line number Diff line change
Expand Up @@ -105,16 +105,16 @@ retrieve_attr_exst <- function(comids, vars, dir_db_attrs, bucket_conn=NA){
}


proc_attr_std_hfsub_name <- function(comid,custom_name='', ext='gpkg'){
proc_attr_std_hfsub_name <- function(comid,custom_name='', fileext='gpkg'){
#' @title Standardidze hydrofabric subsetter's local filename
#' @description Internal function that ensures consistent filename
#' @param comid the USGS common identifier, generated by nhdplusTools
#' @param custom_name Desired custom name following 'hydrofab_'
#' @param ext file extension of the hydrofrabric data. Default 'gpkg'
#' @param fileext file extension of the hydrofrabric data. Default 'gpkg'

hfsub_fn <- base::gsub(pattern = paste0(custom_name,"__"),
replacement = "_",
base::paste0('hydrofab_',custom_name,'_',comid,'.',ext))
base::paste0('hydrofab_',custom_name,'_',comid,'.',fileext))
return(hfsub_fn)
}

Expand Down Expand Up @@ -185,7 +185,7 @@ proc_attr_usgs_nhd <- function(comid,usgs_vars){
}


proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg',
proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",fileext = 'gpkg',
lyrs=c('divides','network')[2],
hf_cat_sel=TRUE, overwrite=FALSE){

Expand All @@ -195,7 +195,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'
#' @param comid character class. The common identifier USGS location code for a surface water feature.
#' @param dir_db_hydfab character class. Local directory path for storing hydrofabric data
#' @param custom_name character class. A custom name to insert into hydrofabric file. Default \code{glue("{lyrs}_")}
#' @param ext character class. file extension of hydrofabric file. Default 'gpkg'
#' @param fileext character class. file extension of hydrofabric file. Default 'gpkg'
#' @param lyrs character class. The layer name(s) of interest from hydrofabric. Default 'network'.
#' @param hf_cat_sel boolean. TRUE for a total catchment characterization specific to a single comid, FALSE (or anything else) for all subcatchments
#' @param overwrite boolean. Overwrite local data when pulling from hydrofabric s3 bucket? Default FALSE.
Expand All @@ -204,7 +204,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'
# Build the hydfab filepath
name_file <- proc.attr.hydfab:::proc_attr_std_hfsub_name(comid=comid,
custom_name=glue::glue('{lyrs}_'),
ext=ext)
fileext=fileext)
fp_cat <- base::file.path(dir_db_hydfab, name_file)

if(!base::dir.exists(dir_db_hydfab)){
Expand All @@ -225,7 +225,7 @@ proc_attr_hf <- function(comid, dir_db_hydfab,custom_name="{lyrs}_",ext = 'gpkg'

# Read the hydrofabric file gpkg for each layer
hfab_ls <- list()
if (ext == 'gpkg') {
if (fileext == 'gpkg') {
# Define layers
layers <- sf::st_layers(dsn = fp_cat)
for (lyr in layers$name){
Expand Down Expand Up @@ -469,7 +469,7 @@ proc_attr_gageids <- function(gage_ids,featureSource,featureID,Retr_Params,
# Retrieve the variables corresponding to datasets of interest & update database
loc_attrs <- try(proc.attr.hydfab::proc_attr_wrap(comid=comid,
Retr_Params=Retr_Params,
lyrs='network',overwrite=FALSE))
lyrs=lyrs,overwrite=FALSE))
if("try-error" %in% class(loc_attrs)){
message(glue::glue("Skipping gage_id {gage_id} corresponding to comid {comid}"))
}
Expand Down
91 changes: 56 additions & 35 deletions pkg/proc.attr.hydfab/flow/fs_attrs_grab.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
#' term 'gage_id' is used as a variable in glue syntax to create featureID
#' @seealso [fs_proc] A python package that processes input data for the
#' formulation-selector
#' @usage Rscript fs_attrs_grab.R "/path/to/attribute_config.yaml"

# Changelog / Contributions
# 2024-07-24 Originally created, GL
Expand All @@ -21,61 +22,81 @@ library(yaml)
library(ncdf4)
library(proc.attr.hydfab)
library(glue)

# TODO is AWS_NO_SIGN_REQUEST necessary??
# Sys.setenv(AWS_NO_SIGN_REQUEST="YES")

# TODO create config yaml.

# TODO read in config yaml, must populate NA for items that are empty.

# Define input directory:
# TODO change this to reading the standardized metadata, not the generated data
# Define command line argument
cmd_args <- commandArgs("trailingOnly" = TRUE)

# raw_config <- yaml::read_yaml("/Users/guylitt/git/formulation-selector/scripts/eval_ingest/xssa/xssa_attr_config.yaml")
raw_config <- yaml::read_yaml("/Users/guylitt/git/formulation-selector/scripts/eval_ingest/SI/SI_attr_config.yaml")
if(base::length(cmd_args)!=1){
warning("Unexpected to have more than one argument in Rscript fs_attrs_grab.R /path/to/attribute_config.yaml.")
}

# Read in config file, e.g. "~/git/formulation-selector/scripts/eval_ingest/SI/SI_attr_config.yaml"
path_attr_config <- cmd_args[1] # "~/git/formulation-selector/scripts/eval_ingest/xssa/xssa_attr_config.yaml"
raw_config <- yaml::read_yaml(path_attr_config)

datasets <- ds <- raw_config$formulation_metadata[[grep("datasets",raw_config$formulation_metadata)]]$datasets#c("juliemai-xSSA",'all')[1] # A listing of datasets to grab attributes. Dataset names match what is inside dir_std_base. 'all' processes all datasets inside dir_std_base.
#ds_nc_filenames <- c('juliemai-xSSA_Raven_blended.nc','*.nc')[1]

# A listing of datasets to grab attributes. Dataset names match what is inside dir_std_base. 'all' processes all datasets inside dir_std_base.
datasets <- raw_config$formulation_metadata[[grep("datasets",
raw_config$formulation_metadata)]]$datasets #c("juliemai-xSSA",'all')[1]

# Define directory paths from the config file
home_dir <- Sys.getenv("HOME")
dir_base <- file.path(home_dir,'noaa','regionalization','data')

dir_std_base <- file.path(dir_base,"input","user_data_std") # The location of standardized data generated by fs_proc python package
dir_db_hydfab <- file.path(dir_base,'input','hydrofabric') # The local dir where hydrofabric data are stored to limit s3 connections
dir_db_attrs <- file.path(dir_base,'input','attributes') # The parent dir where each comid's attribute parquet file is stored in the subdirectory 'comid/', and each dataset's aggregated parquet attributes are stored in the subdirectory '/{dataset_name}


s3_base <- "s3://lynker-spatial/tabular-resources" # s3 path containing hydrofabric-formatted attribute datasets
s3_bucket <- 'lynker-spatial' # s3 bucket containing hydrofabric data

s3_path_hydatl <- glue::glue('{s3_base}/hydroATLAS/hydroatlas_vars.parquet') # path to hydroatlas data formatted for hydrofabric
dir_base <- glue::glue(base::unlist(raw_config$file_io)[['dir_base']])#file.path(home_dir,'noaa','regionalization','data')
dir_std_base <- glue::glue(base::unlist(raw_config$file_io)[['dir_std_base']]) #file.path(dir_base,"input","user_data_std") # The location of standardized data generated by fs_proc python package
dir_db_hydfab <- glue::glue(base::unlist(raw_config$file_io)[['dir_db_hydfab']]) # file.path(dir_base,'input','hydrofabric') # The local dir where hydrofabric data are stored to limit s3 connections
dir_db_attrs <- glue::glue(base::unlist(raw_config$file_io)[['dir_db_attrs']]) # file.path(dir_base,'input','attributes') # The parent dir where each comid's attribute parquet file is stored in the subdirectory 'comid/', and each dataset's aggregated parquet attributes are stored in the subdirectory '/{dataset_name}

# Read s3 connection details
s3_base <- base::unlist(raw_config$hydfab_config)[['s3_base']]#s3://lynker-spatial/tabular-resources" # s3 path containing hydrofabric-formatted attribute datasets
s3_bucket <- base::unlist(raw_config$hydfab_config)[['s3_bucket']] #'lynker-spatial' # s3 bucket containing hydrofabric data

# s3 path to hydroatlas data formatted for hydrofabric
if ("s3_path_hydatl" %in% names(base::unlist(raw_config$attr_select))){
s3_path_hydatl <- glue::glue(base::unlist(raw_config$attr_select)[['s3_path_hydatl']]) # glue::glue('{s3_base}/hydroATLAS/hydroatlas_vars.parquet')
} else {
s3_path_hydatl <- NULL
}

# Additional config options
hf_cat_sel <- c("total","all")[1] # total: interested in the single location's aggregated catchment data; all: all subcatchments of interest
ext <- 'gpkg'
attr_sources <- c("hydroatlas","usgs") # "streamcat",
# TODO communicate to user that these are standardized variable names
ha_vars <- c('pet_mm_s01', 'cly_pc_sav', 'cly_pc_uav','cly_pc_sav','ari_ix_sav') # hydroatlas variables
sc_vars <- c() # TODO look up variables. May need to select datasets first
usgs_vars <- c('TOT_TWI','TOT_PRSNOW','TOT_POPDENS90','TOT_EWT','TOT_RECHG','TOT_PPT7100_ANN','TOT_AET','TOT_PET','TOT_SILTAVE','TOT_BASIN_AREA','TOT_BASIN_SLOPE','TOT_ELEV_MEAN','TOT_ELEV_MAX','TOT_Intensity','TOT_Wet','TOT_Dry' ) # list of variables retrievable using nhdplusTools::get_characteristics_metadata()
hf_cat_sel <- base::unlist(raw_config$hydfab_config)[['hf_cat_sel']] #c("total","all")[1] # total: interested in the single location's aggregated catchment data; all: all subcatchments of interest
ext <- base::unlist(raw_config$hydfab_config)[['ext']] # 'gpkg'

#-----------------------------------------------------
# Variable listings:
names_attr_sel <- base::unlist(base::lapply(raw_config$attr_select,
function(x) base::names(x)))

# Transform into single named list of lists rather than nested sublists
idxs_vars <- base::grep("_vars", names_attr_sel)
var_names <- names_attr_sel[idxs_vars]
sub_attr_sel <- base::lapply(idxs_vars, function(i)
raw_config$attr_select[[i]][[1]])
base::names(sub_attr_sel) <- var_names

# Subset to only those non-null variables:
sub_attr_sel <- sub_attr_sel[base::unlist(base::lapply(sub_attr_sel,
function(x) base::any(!base::is.null(unlist(x)))))]
var_names_sub <- names(sub_attr_sel)
#-----------------------------------------------------
message(glue::glue("Attribute dataset sources include the following:\n
{paste0(var_names_sub,collapse='\n')}"))

# TODO generate this listing structure based on what is provided in yaml config
# & accounting for empty entries
message(glue::glue("Attribute variables to be acquired include :\n
{paste0(sub_attr_sel,collapse='\n')}"))

Retr_Params <- list(paths = list(# Note that if a path is provided, ensure the
Retr_Params <- base::list(paths = base::list(
# Note that if a path is provided, ensure the
# name includes 'path'. Same for directory having variable name with 'dir'
dir_db_hydfab=dir_db_hydfab,
dir_db_attrs=dir_db_attrs,
s3_path_hydatl = s3_path_hydatl,
dir_std_base = dir_std_base),
vars = list(usgs_vars = usgs_vars,
ha_vars = ha_vars,
sc_vars = sc_vars),
vars = sub_attr_sel,
datasets = datasets
)
# PROCESS ATTRIBUTES

ls_comids <- proc.attr.hydfab:::grab_attrs_datasets_fs_wrap(Retr_Params,overwrite = TRUE)

Expand Down
4 changes: 2 additions & 2 deletions pkg/proc.attr.hydfab/man/proc_attr_hf.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions pkg/proc.attr.hydfab/man/proc_attr_std_hfsub_name.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ testthat::test_that("proc_attr_usgs_nhd", {

testthat::test_that("proc_attr_hf not a comid",{
testthat::expect_error(proc.attr.hydfab::proc_attr_hf(comid="13Notacomid14", dir_db_hydfab,
custom_name="{lyrs}_",ext = 'gpkg',
custom_name="{lyrs}_",fileext = 'gpkg',
lyrs=c('divides','network')[2],
hf_cat_sel=TRUE, overwrite=FALSE))
})
Expand Down
Loading

0 comments on commit a51de8d

Please sign in to comment.