Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attribute aggregation/transformation + plotting & evaluation analyses #34

Open
wants to merge 96 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
7c0a582
Add alternate comid retrieval via sf geometry in case nwissite return…
glitt13 Nov 1, 2024
44e1182
merge upstream/main
glitt13 Nov 1, 2024
8a937ce
fix: add gage_id inside each loc_attrs df; fix: set fill=TRUE for rbi…
glitt13 Nov 1, 2024
3951059
fix: add usgs_vars sublist to Retr_Params
glitt13 Nov 1, 2024
8c7ed2c
feat: add a format checker on Retr_Params
glitt13 Nov 1, 2024
339279c
feat: add attribute variable name checker, incorporate check_attr_sel…
glitt13 Nov 1, 2024
df57202
feat: developing approach to transform attributes
glitt13 Nov 5, 2024
e7d5e0f
feat: add cmd/config file capability to retrieving camels attributes …
glitt13 Nov 5, 2024
f5b01b7
fix: update script to work with return of a data.table rather than a …
glitt13 Nov 5, 2024
34efc2e
fix: address path/glue format issues
glitt13 Nov 5, 2024
96b01a9
refactor: negligible change
glitt13 Nov 7, 2024
3a7d4c1
feat: add parquet file read option based on check for comid in filename
glitt13 Nov 7, 2024
218eba0
doc: update fs_read_attr_comid documentation based on read_type
glitt13 Nov 7, 2024
08b867b
doc: update yaml config files to jive with latest developments in att…
glitt13 Nov 8, 2024
a9b215e
feat: core functionality that aggregates & transforms attributes
glitt13 Nov 12, 2024
d61c9c2
refactor: move config file read out of for-loop; fix: ensure str form…
glitt13 Nov 12, 2024
2ec53f1
fix: add error if Null vals returned following aggregation/transforma…
glitt13 Nov 12, 2024
e2d79c1
feat: create file listing needed comid-attributes pairings
glitt13 Nov 13, 2024
6cf7a7b
doc: describe steps in creating transformed attributes; feat: update …
glitt13 Nov 13, 2024
5d288c3
feat: add attribute generation script for camels catchments
glitt13 Nov 13, 2024
c9bdad6
fix: remove deprecated wrapper function from tfrm_attr
glitt13 Nov 13, 2024
35f0663
fix: resolve merge conflicts
glitt13 Nov 14, 2024
0ea2a92
fix: change dask dataframe to eager evaluation
glitt13 Nov 14, 2024
329a8e1
feat: partially-created unit tests corresponding to attribute transfo…
glitt13 Nov 14, 2024
3e0f378
feat: convert missing comid/attrs scripts into functions; doc: augmen…
glitt13 Nov 15, 2024
67398a3
fix: add in home_dir as optional part of attr config's directory form…
glitt13 Nov 15, 2024
b14c63a
fix: add logic on whether a warning prints after first checking if mi…
glitt13 Nov 15, 2024
60b776c
feat: add attribute config file parser function to R package proc.att…
glitt13 Nov 18, 2024
02de2b0
fix: address undefined objects in attr_cfig_parse
glitt13 Nov 18, 2024
33d11ec
fix: remove duplicated attr_cfig_parse from package file
glitt13 Nov 18, 2024
2c40272
feat: create missing attributes finder wrapper function
glitt13 Nov 19, 2024
5ee6d28
doc: update descriptive documentation for fs_attrs_miss_wrap()
glitt13 Nov 19, 2024
b8ff3cb
feat: add the missing attributes Rscript and wrapper documentation
glitt13 Nov 19, 2024
5b57a09
fix: remove items in transformation config file no longer used; doc: …
glitt13 Nov 19, 2024
d00fa8b
cherry-pick transform config file doc updates and remove deprecated i…
glitt13 Nov 19, 2024
21a41a7
fix: patch the attribute metadata comid column read by searching for …
glitt13 Nov 19, 2024
f2dfdd1
feat: add Rscript call that attempts to retrieve missing attributes i…
glitt13 Nov 19, 2024
7ada133
doc: add printout explaining Rscript called to retrieve missing attri…
glitt13 Nov 19, 2024
93d0655
merge missing comid-attribute grabber functionality
glitt13 Nov 19, 2024
1828f43
feat: create the kratzert et al 2019 preprocessing script to standard…
glitt13 Nov 21, 2024
233fea5
fix: allow multiple datasets to be parsed
glitt13 Nov 22, 2024
358def7
fix: allow multiple datasets to be parsed, assign datasets to Retr_Pa…
glitt13 Nov 22, 2024
fbab547
fix: streamline script now that Retr_Params created by attr_cfig_parse()
glitt13 Nov 22, 2024
3b42c55
fix: multidataset processing streamlining; fix: allow user to specify…
glitt13 Nov 22, 2024
131e5eb
doc: improve config file and processing script documentation
glitt13 Nov 22, 2024
c3ea15e
feat: add ealstm config files for processing
glitt13 Nov 22, 2024
7a819cf
fix: ensure algo config object doesn't become empty when looping over…
glitt13 Nov 22, 2024
874ae40
Add scripts and associated cfg file for model performance viz
bolotinl Nov 22, 2024
55a4147
Remove scratch code
bolotinl Nov 22, 2024
f3e4bad
fix: remove erroneous aggregation function for TOT_WB5100_yr_mean in …
glitt13 Nov 25, 2024
29b41c0
fix: path_meta.exists missing ()
glitt13 Nov 26, 2024
1ab3244
feat: add duplicate attribute checker/remover; feat: create attribute…
glitt13 Nov 26, 2024
d7650b4
fix: add duplicate handling to transformation processing
glitt13 Nov 26, 2024
8d098b4
rm: remove deprecated script
glitt13 Nov 26, 2024
2bf3d75
feat: implment parsing & function call for attribute csv file read op…
glitt13 Nov 26, 2024
370b73d
feat: update config files with new options to read attributes by csv,…
glitt13 Nov 26, 2024
59c3cca
feat: create demonstrative script on how to remove bad data generated…
glitt13 Nov 26, 2024
82566c4
feat: create ealstm attribute transformation config file, and an alte…
glitt13 Nov 26, 2024
2d885a7
Include download of US map
bolotinl Nov 26, 2024
aa1c3ba
Get ds_type, write_type from pred cfg; convert os to Pathlib
bolotinl Nov 26, 2024
fb5d15d
feat: work in progress on developing functions for analysis of rf mod…
glitt13 Nov 27, 2024
3c0cc87
Use existing functions for pulling info from attr config
bolotinl Nov 27, 2024
ae0a05d
feat: adding attributes of interest file for ealstm analysis
glitt13 Nov 27, 2024
d63d800
feat: adding PCA to agu script
glitt13 Nov 27, 2024
6ec6ba9
feat: add analysis dir to save directory structure
glitt13 Nov 28, 2024
170036a
feat: create correlation analyses
glitt13 Nov 28, 2024
91def5d
fix: simplify attribute filtering in dask dfs
glitt13 Nov 28, 2024
9b93ead
feat: add principal component analysis to dataset characterization
glitt13 Nov 28, 2024
f436080
feat: add figure importance plotting; feat: developing learning curve…
glitt13 Nov 29, 2024
e65d539
feat: add feature importance plot wrapper functional call
glitt13 Nov 29, 2024
b400d39
feat: create the learning curve plotting for each trained algorithm
glitt13 Nov 30, 2024
10a568e
merge Lauren's data viz for further usage
glitt13 Nov 30, 2024
bcbe77c
feat: integrate bolotinl's geospatial & regression plotting; refactor…
bolotinl Dec 2, 2024
f69fbba
fix: update dataset preprocessing
glitt13 Dec 2, 2024
56984fe
refactor: Adapt to updated comid/geometry retrieval
glitt13 Dec 3, 2024
b90bdcb
feat: Integrate visualization plotting for each dataset into the stan…
glitt13 Dec 3, 2024
8f3a738
feat: create a cross-comparison 'best' predictor analysis; refactor: …
glitt13 Dec 3, 2024
a37faec
fix: modify best map plotting for AGU 2024 poster
glitt13 Dec 3, 2024
de2bfe2
fix: non-multi param training should not access params from algo_conf…
glitt13 Dec 17, 2024
c60bc4e
fix: update function name change
glitt13 Dec 17, 2024
ddbc2e9
feat: all set for AGU24
glitt13 Dec 17, 2024
5ff2f04
fix: explicitly define arg names in AlgoTrainEval; fix: update new re…
glitt13 Dec 17, 2024
697eef5
feat: add a new 'metric' mapping for xSSA sobol' sensitivities
glitt13 Dec 18, 2024
56098da
fix: remove print message looking for objects that don't exist
glitt13 Dec 18, 2024
ac3067a
fix: rename accidental base path inside std_eval_metrs_path()
glitt13 Dec 18, 2024
c6cf6cb
Change viz scripts to call functions in fsate; add consistent plot st…
bolotinl Dec 19, 2024
374c8e5
doc: add documentation to fs_algo functions
glitt13 Dec 19, 2024
f269094
fix: remove scratch analysis
glitt13 Dec 19, 2024
59b25cf
fix: remove hydroatlas vars from config file
glitt13 Dec 19, 2024
b4c87e7
fix: move printout confirming write after write happens
glitt13 Dec 19, 2024
de8f937
refactor: hydroatlas accommodates local or s3 paths and nhdplus pulls…
glitt13 Dec 22, 2024
a653285
refactor: create a multi-attribute & multi-comid query approach for e…
glitt13 Dec 24, 2024
0cd57ef
refactor: remake attribute retrieval to pull multiple comids and attr…
glitt13 Dec 29, 2024
de5642b
fix: address issues exposed during unit testing
glitt13 Dec 30, 2024
e88f295
test: expand and revise unit tests for current functionality
glitt13 Dec 30, 2024
78e17c4
fix: minor spelling correction in fs_attr_menu.yaml; doc: convenience…
glitt13 Dec 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions pkg/fs_algo/fs_algo/RaFTS_theme.mplstyle
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Style theme for RaFTS data visualizations

axes.labelsize : 12
lines.linewidth : 2
xtick.labelsize : 11
ytick.labelsize : 11
legend.fontsize : 11
font.family : Arial

# viridis color codes: https://waldyrious.net/viridis-palette-generator/
# viridis with a slightly lighter purple:
axes.prop_cycle: cycler('color', ['7e3b8a', '21918c', 'fde725', '3b528b', '5ec962'])

# Other odd options -------
# viridis:
# axes.prop_cycle: cycler('color', ['440154', '21918c', 'fde725', '3b528b', '5ec962'])

# viridis plasma:
# axes.prop_cycle: cycler('color', ['f89540', 'cc4778', '7e03a8', '0d0887', 'f0f921'])
1,322 changes: 1,249 additions & 73 deletions pkg/fs_algo/fs_algo/fs_algo_train_eval.py

Large diffs are not rendered by default.

190 changes: 190 additions & 0 deletions pkg/fs_algo/fs_algo/fs_perf_viz.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
'''
@title: Produce data visualizations for RaFTS model performance outputs
@author: Lauren Bolotin <[email protected]>
@description: Reads in several config files,
visualizes results for the specified RaFTS algorithms and evaluation metrics,
and saves plots to .png's.
@usage: python fs_perf_viz.py "/full/path/to/viz_config.yaml"

Changelog/contributions
2024-11-22 Originally created, LB
'''
import geopandas as gpd
import os
import pandas as pd
from shapely.geometry import Point
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.metrics import r2_score
from sklearn.metrics import root_mean_squared_error
import yaml
from pathlib import Path
import argparse
import fs_algo.fs_algo_train_eval as fsate
import xarray as xr
import urllib.request
import zipfile
import pkg_resources


if __name__ == "__main__":
parser = argparse.ArgumentParser(description = 'process the data visualization config file')
parser.add_argument('path_viz_config', type=str, help='Path to the YAML configuration file specific for data visualization')
args = parser.parse_args()

home_dir = Path.home()
path_viz_config = Path(args.path_viz_config) #Path(f'{home_dir}/FSDS/formulation-selector/scripts/eval_ingest/xssa/xssa_viz_config.yaml')

with open(path_viz_config, 'r') as file:
viz_cfg = yaml.safe_load(file)

# Get features from the viz config file --------------------------
algos = viz_cfg.get('algos')
print('Visualizing data for the following RaFTS algorithms:')
print(algos)
print('')
metrics = viz_cfg.get('metrics')
print('And for the following evaluation metrics:')
print(metrics)
print('')

plot_types = viz_cfg.get('plot_types')
plot_types_dict = {k: v for d in plot_types for k, v in d.items()}
true_keys = [key for key, value in plot_types_dict.items() if value is True]
print('The following plots will be generated:')
print(true_keys)
print('')

# Get features from the pred config file --------------------------
path_pred_config = fsate.build_cfig_path(path_viz_config,viz_cfg.get('name_pred_config',None)) # currently, this gives the pred config path, not the attr config path
pred_cfg = yaml.safe_load(open(path_pred_config, 'r'))
path_attr_config = fsate.build_cfig_path(path_pred_config,pred_cfg.get('name_attr_config',None))
ds_type = pred_cfg.get('ds_type')
write_type = pred_cfg.get('write_type')

# Get features from the attr config file --------------------------
attr_cfg = fsate.AttrConfigAndVars(path_attr_config)
attr_cfg._read_attr_config()
datasets = attr_cfg.attrs_cfg_dict.get('datasets')
dir_base = attr_cfg.attrs_cfg_dict.get('dir_base')
dir_std_base = attr_cfg.attrs_cfg_dict.get('dir_std_base')

# Get features from the main config file --------------------------
# NOTE: This assumes that the main config file is just called [same prefix as all other config files]_config.yaml
# Build the path to the main config file by referencing the other config files we've already read in
prefix_viz = str(path_viz_config.name).split('_')[0]
prefix_attr = str(path_attr_config.name).split('_')[0]
if (prefix_viz != prefix_attr):
raise ValueError('All config files must be in the same directory and be\
identifiable using the same prefix as each other (e.g.\
[dataset]_config.yaml, [dataset]_pred_config.yaml, \
[dataset]_attr_config.yaml, etc.)')
else:
prefix = prefix_viz

path_main_config = fsate.build_cfig_path(path_viz_config,f'{prefix_viz}_config.yaml')
with open(path_main_config, 'r') as file:
main_cfg = yaml.safe_load(file)

# NOTE: This is something I'm not totally sure will function properly with multiple datasets
formulation_id = list([x for x in main_cfg['formulation_metadata'] if 'formulation_id' in x][0].values())[0]
save_type = list([x for x in main_cfg['file_io'] if 'save_type' in x][0].values())[0]
if save_type.lower() == 'netcdf':
save_type_obs = 'nc'
engine = 'netcdf4'
else:
save_type_obs = 'zarr'
engine = 'zarr'

# Access the location metadata for prediction sites
path_meta_pred = pred_cfg.get('path_meta')

# Location for accessing existing outputs and saving plots
dir_out = fsate.fs_save_algo_dir_struct(dir_base).get('dir_out')
dir_out_viz_base = Path(dir_out/Path("data_visualizations"))

# Enforce style
style_path = pkg_resources.resource_filename('fs_algo', 'RaFTS_theme.mplstyle')
plt.style.use(style_path)

# Loop through all datasets
for ds in datasets:
path_meta_pred = f'{path_meta_pred}'.format(ds = ds, dir_std_base = dir_std_base, ds_type = ds_type, write_type = write_type)
meta_pred = pd.read_parquet(path_meta_pred)

# Loop through all algorithms
for algo in algos:
# Loop through all metrics
for metric in metrics:
# Pull the predictions
path_pred = fsate.std_pred_path(dir_out,algo=algo,metric=metric,dataset_id=ds)
pred = pd.read_parquet(path_pred)
data = pd.merge(meta_pred, pred, how = 'inner', on = 'comid')
Path(f'{dir_out}/data_visualizations').mkdir(parents=True, exist_ok=True)
# If you want to export the merged data for any reason:
# data.to_csv(f'{dir_out}/data_visualizations/{ds}_{algo}_{metric}_data.csv')

# Does the user want a scatter plot comparing the observed module performance and the predicted module performance by RaFTS?
if 'pred_map' in true_keys:
states = fsate.gen_conus_basemap(f'{dir_out}/data_visualizations/')

# Plot performance on map
lat = data['Y']
lon = data['X']
geometry = [Point(xy) for xy in zip(lon,lat)]
geo_df = gpd.GeoDataFrame(geometry = geometry)
geo_df['performance'] = data['prediction'].values
geo_df.crs = ("EPSG:4326")

fsate.plot_map_pred(geo_df=geo_df, states=states,
title=f'RaFTS Predicted Performance Map: {ds}',
metr=metric, colname_data='performance')

# Save the plot as a .png file
output_path = fsate.std_map_pred_path(dir_out_viz_base=dir_out_viz_base,
ds=ds, metr=metric, algo_str=algo,
split_type='prediction')
plt.savefig(output_path, dpi=300, bbox_inches='tight')
plt.clf()
plt.close()


if 'obs_vs_sim_scatter' in true_keys:
# Scatter plot of observed vs. predicted module performance
# Remove 'USGS-' from ids so it can be merged with the actual performance data
data['identifier'] = data['identifier'].str.replace(r'\D', '', regex=True)
data['identifier'] = data['identifier'].str.strip() # remove leading and trailing spaces

# Read in the observed performance data
path_obs_perf = f'{dir_std_base}/{ds}/{ds}_{formulation_id}.{save_type_obs}'
obs = xr.open_dataset(path_obs_perf, engine=engine)
# NOTE: Below is one option, but it assumes there is only one possible .nc or .zarr file to read in (it only reads the first one it finds with that file extension)
# obs = fsate._open_response_data_fs(dir_std_base=dir_std_base, ds=ds)
obs = obs.to_dataframe()

# Standardize column names
obs.reset_index(inplace=True)
obs = obs.rename(columns={"gage_id": "identifier"})

# Subset columns
data = data[['identifier', 'comid', 'X', 'Y', 'prediction', 'metric', 'dataset']]
data = data[data['metric'] == metric]
data.columns = data.columns.str.lower()
obs = obs[['identifier', metric]]

# Merge the observed and predicted data
data = pd.merge(data, obs, how = 'inner', on = 'identifier')

# Plot the observed vs. predicted module performance
fsate.plot_pred_vs_obs_regr(y_pred=data['prediction'], y_obs=data[metric],
ds = ds, metr=metric)

# Save the plot as a .png file
output_path = fsate.std_regr_pred_obs_path(dir_out_viz_base=dir_out_viz_base,
ds=ds, metr=metric, algo_str=algo,
split_type='prediction')
plt.savefig(output_path, dpi=300, bbox_inches='tight')
plt.clf()
plt.close()

37 changes: 30 additions & 7 deletions pkg/fs_algo/fs_algo/fs_proc_algo.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from pathlib import Path
import fs_algo.fs_algo_train_eval as fsate
import ast
import numpy as np

"""Workflow script to train algorithms on catchment attribute data for predicting
formulation metrics and/or hydrologic signatures.
Expand All @@ -27,10 +28,12 @@
algo_config = {k: algo_cfg['algorithms'][k] for k in algo_cfg['algorithms']}
if algo_config['mlp'][0].get('hidden_layer_sizes',None): # purpose: evaluate string literal to a tuple
algo_config['mlp'][0]['hidden_layer_sizes'] = ast.literal_eval(algo_config['mlp'][0]['hidden_layer_sizes'])

algo_config_og = algo_config.copy()

verbose = algo_cfg['verbose']
test_size = algo_cfg['test_size']
seed = algo_cfg['seed']
read_type = algo_cfg.get('read_type','all') # Arg for how to read attribute data using comids in fs_read_attr_comid(). May be 'all' or 'filename'.

#%% Attribute configuration
name_attr_config = algo_cfg.get('name_attr_config', Path(path_algo_config).name.replace('algo','attr'))
Expand All @@ -45,7 +48,18 @@
attr_cfig = fsate.AttrConfigAndVars(path_attr_config)
attr_cfig._read_attr_config()

attrs_sel = attr_cfig.attrs_cfg_dict.get('attrs_sel', None)


# Grab the attributes of interest from the attribute config file,
# OR a .csv file if specified in the algo config file.
name_attr_csv = algo_cfg.get('name_attr_csv')
colname_attr_csv = algo_cfg.get('colname_attr_csv')
attrs_sel = fsate._id_attrs_sel_wrap(attr_cfig=attr_cfig,
path_cfig=path_attr_config,
name_attr_csv = name_attr_csv,
colname_attr_csv = colname_attr_csv)

# Define directories/datasets from the attribute config file
dir_db_attrs = attr_cfig.attrs_cfg_dict.get('dir_db_attrs')
dir_std_base = attr_cfig.attrs_cfg_dict.get('dir_std_base')
dir_base = attr_cfig.attrs_cfg_dict.get('dir_base')
Expand All @@ -71,22 +85,29 @@

# %% COMID retrieval and assignment to response variable's coordinate
[featureSource,featureID] = fsate._find_feat_srce_id(dat_resp,attr_cfig.attr_config) # e.g. ['nwissite','USGS-{gage_id}']
comids_resp = fsate.fs_retr_nhdp_comids(featureSource,featureID,gage_ids=dat_resp['gage_id'].values)
gdf_comid = fsate.fs_retr_nhdp_comids_geom(featureSource=featureSource,
featureID=featureID,
gage_ids=dat_resp['gage_id'].values)
comids_resp = gdf_comid['comid']
dat_resp = dat_resp.assign_coords(comid = comids_resp)

# Remove the unknown comids:
dat_resp = dat_resp.dropna(dim='comid',how='any')
comids_resp = [x for x in comids_resp if x is not np.nan]
# TODO allow secondary option where featureSource and featureIDs already provided, not COMID

#%% Read in predictor variable data (aka basin attributes)
# Read the predictor variable data (basin attributes) generated by proc.attr.hydfab
df_attr = fsate.fs_read_attr_comid(dir_db_attrs, comids_resp, attrs_sel = attrs_sel,
_s3 = None,storage_options=None)
_s3 = None,storage_options=None,read_type=read_type)
# Convert into wide format for model training
df_attr_wide = df_attr.pivot(index='featureID', columns = 'attribute', values = 'value')

# %% Train, test, and evaluate
rslt_eval = dict()
for metr in metrics:
print(f' - Processing {metr}')
if len(algo_config) == 0:
algo_config = algo_config_og.copy()
# Subset response data to metric of interest & the comid
df_metr_resp = pd.DataFrame({'comid': dat_resp['comid'],
metr : dat_resp[metr].data})
Expand All @@ -103,10 +124,12 @@
metr=metr,test_size=test_size, rs = seed,
verbose=verbose)
train_eval.train_eval() # Train, test, eval wrapper

# Retrieve evaluation metrics dataframe
rslt_eval[metr] = train_eval.eval_df

path_eval_metr = fsate.std_eval_metrs_path(dir_out_alg_ds, ds,metr)
train_eval.eval_df.to_csv(path_eval_metr)
del train_eval
# Compile results and write to file
rslt_eval_df = pd.concat(rslt_eval).reset_index(drop=True)
rslt_eval_df['dataset'] = ds
Expand Down
Loading