Dev ife (#51)

* xdate works for overall series correlation * Added code for creating bins and dividing series into segments * Cleaning up and commenting related to xdate * series_corr works but is inefficient * WIP changes * Added comments, updated working jupyter notebook * Changes since start of fall semester * variance stabiliization produces accurate values * Unit tests for readers, summary, stats and tbrm * Added unit tests for detrend and chron * Added tests for chron_stabilized, series_corr and writers --------- Co-authored-by: Ifeoluwa Ale <[email protected]> Co-authored-by: cosimichele <[email protected]>
OpenDendro · Nov 3, 2023 · 8394f7c · 8394f7c
1 parent 8f29cbd
commit 8394f7c
Show file tree

Hide file tree

Showing 33 changed files with 4,551 additions and 3,836 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,12 @@
 # Scripts for testing
 src/test_*.py
+src/*.txt
+src/misc.py
+
+# IDE stuff
+.DS_Store
+.vscode/
+tests/data/.DS_Store
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/dev-instructions.md b/dev-instructions.md
@@ -0,0 +1,106 @@
+# dplPy Developer Instructions (in progress)
+
+Welcome to the dplPy developer manual.
+
+## Environment setup
+To contribute to dplpy, you will need to set up some tools
+
+### 1. GitHub setup
+
+#### 1.1 Create dplPy fork in github
+
+You will need your own copy of dplpy to work on the code. Go to the dplPy github page and click the fork button. Make sure the option to copy only the main branch is unchecked.
+
+
+#### 1.2 Create local repository
+In your local terminal, clone the fork to your computer using the commands shown below. Replace {your-user} with your github username.
+```
+$ git clone https://github.com/{your-user}/OpenDendro/dplPy.git dplpy-{your-user}
+$ cd dplpy-{your-user}
+git remote add upstream https://github.com/OpenDendro/dplPy.git
+git fetch upstream
+```
+
+This creates a github repository in dplPy-{your-user} on your computer and connects the repository to your fork, which is now connected to the main dplPy repository.
+
+#### 1.3 Create feature branch
+
+TBC
+
+
+### 2. Conda environment
+
+The packages required to run dplPy are all specified in environment.yml. 
+
+#### 2.1\. Create your environment with the required packages installed.
+
+If you're using conda, run
+
+```
+$ conda env create -f environment.yml 
+```
+
+If you're using mamba, run
+
+```
+$ mamba env create -f environment.yml
+```
+
+If prompted for permission to install requred packages, select y.
+
+#### 2.2\. Activate your environment. 
+You will need to have the conda environment activated anytime you want to test code from the package.
+
+```
+conda activate dplpy
+```
+
+After running this command, you should see (dplpy) on the left of each new line in the terminal.
+
+#### 2.3\. Run unit and integration tests to ensure that installation was successful.
+TBA: Instructions for running tests
+
+### 3. IDE setup
+
+We recommend using VSCode for development. The following instructions show how to set up VSCode to recognize the conda environment and debug tests.
+
+#### 3.1\. Open the dplpy folder in VScode
+In VSCode, open the folder containing your local dplpy repository. If you followed the instructions above, this should be a folder named `dplpy-{your-github-username}`. Then, open the file `src/dplpy.py`.
+
+#### 3.2\. Change the python interpreter to use the conda environment's interpreter
+In the bottom corner of your IDE display, select the language interpreter.
+
+Choose the interpreter `Python 3.x ('dplpy')`, with a path that ends with `/envs/dplpy/python`.
+
+Now you should be able to run any python files within the currently open folder with the run button in VSCode, instead of running them through the terminal. 
+
+Note: If the terminal is opened after the interpreter has been set to use the conda environment, conda activate dplpy will automatically be run and does not need to be run again.
+
+#### 3.3\. Set up unit testing tools
+
+Go to the testing tab (on the left side of the VSCode display). With your environment set. If the tests are not automatically discovered, open `.vscode/settings.json` and add the following lines inside the curly braces, so that your file looks like this:
+
+```
+{
+    // any pre-existing configurations (DO NOT ADD THIS, THIS REPRESENTS ANYTHING ALREADY IN THE FILE)
+
+    "python.testing.pytestArgs": [
+        "./src/unittests"
+    ],
+    "python.testing.unittestEnabled": false,
+    "python.testing.pytestEnabled": true
+}
+```
+
+If `.vscode/settings.json` has not been created, create it and add the lines shown above.
+
+Go back to the testing tab and verify that the dplpy unit tests are showing. They should look like this:
+
+TBA: Image
+
+
+Run the tests by clicking the play button on src.
+
+
+## Overview of dplPy functions
+
diff --git a/environment.yml b/environment.yml
@@ -15,4 +15,5 @@ dependencies:
   - pip:
     - csaps
     - jupyterlab
-    - notebook
+    - notebook
+    - pytest
diff --git a/src/__init__.py b/src/__init__.py
@@ -2,4 +2,6 @@
 
 __author__ = "Tyson Lee Swetnam"
 __email__ = "[email protected]"
-__version__ = "0.1"
+__version__ = "0.1"
+
+from src import dplpy
diff --git a/src/autoreg.py b/src/autoreg.py
@@ -44,23 +44,25 @@
 
 def ar_func(data, max_lag=5):
     if isinstance(data, pd.DataFrame):
-        res = {}
+        start_df = pd.DataFrame(index=pd.Index(data.index))
+        to_concat = [start_df]
         for column in data.columns:
-            res[column] = ar_func_series(data[column], max_lag).tolist()
+            to_concat.append(ar_func_series(data[column], max_lag))
+        res = pd.concat(to_concat, axis=1)
         return res
     elif isinstance(data, pd.Series):
         res = ar_func_series(data, max_lag)
         return res
     else:
-        return TypeError("argument should be either pandas dataframe or pandas series.")
+        raise TypeError("Data argument should be either pandas dataframe or pandas series.")
 
 # This function returns residuals plus mean of the best fit AR
 # model of the data
 def ar_func_series(data, max_lag):
     nullremoved_data = data.dropna()
     pars = autoreg(nullremoved_data, max_lag)
 
-    y = nullremoved_data.to_numpy()
+    y = nullremoved_data
 
     yi = fitted_values(y, pars)
 
@@ -70,13 +72,18 @@ def ar_func_series(data, max_lag):
 
     # Add mean to the residuals
     for i in range(len(res)):
-        res[i] += mean
+        res.iloc[i] += mean
 
     return res
 
 # This method selects the best AR model with a specified maximum order
 # The best model is selected based on AIC value
-def autoreg(data, max_lag=5):
+def autoreg(data: pd.Series, max_lag=5):
+    # validate data?
+    if not isinstance(data, pd.Series):
+        raise TypeError("Data argument should be pandas series. Received " + str(type(data)) + " instead.")
+
+    # Need to change this to only ignore specific warnings instead of all
     with warnings.catch_warnings():
         warnings.filterwarnings("ignore")
         ar_data = ar_select_order(data.dropna(), max_lag, ic='aic', old_names=False)
@@ -86,13 +93,13 @@ def autoreg(data, max_lag=5):
 # This function calculates the in-sample predicted values of a series,
 # given an array containing the original data and the parameters for
 # the AR model
-def fitted_values(data_array, params):
-    mean = np.mean(data_array)
+def fitted_values(data_series, params):
+    mean = np.mean(data_series)
     results = []
 
-    for i in range((len(params)-1), len(data_array)):
-        pred = params[0]
+    for i in range((len(params)-1), len(data_series)):
+        pred = params.iloc[0]
         for j in range(1, len(params)):
-            pred += (params[j] * data_array[i-j])
+            pred += (params.iloc[j] * data_series.iloc[i-j])
         results.append(pred)
     return np.asarray(results)
diff --git a/src/chron.py b/src/chron.py
@@ -47,7 +47,10 @@
 
 # Main function for creating chronology of series. Formats input, prewhitens if necessary
 # and produces output mean value chronology in a dataframe.
-def chron(rwi_data, biweight=True, prewhiten=False, plot=True):
+def chron(rwi_data: pd.DataFrame, biweight=True, prewhiten=False, plot=True):
+    if not isinstance(rwi_data, pd.DataFrame):
+        raise TypeError("Expected pandas dataframe as input, got " + str(type(rwi_data)) + " instead")
+
     chron_data = {}
     for series in rwi_data:
         series_data = rwi_data[series].dropna()

diff --git a/src/chron_stabilized.py b/src/chron_stabilized.py
@@ -0,0 +1,90 @@
+from rbar import get_running_rbar, mean_series_intercorrelation
+from chron import chron
+import numpy as np
+import pandas as pd
+import warnings
+
+
+def chron_stabilized(rwi_data: pd.DataFrame, win_length=50, min_seg_ratio=0.33, biweight=True, running_rbar=False):
+    if not isinstance(rwi_data, pd.DataFrame):
+        raise TypeError("Expected data input to be a pandas dataframe, not " + str(type(rwi_data)) + ".")
+
+
+    num_years = rwi_data.shape[0]
+
+    if win_length > num_years:
+        raise ValueError("Window length should not be greater than the number of rows in the dataset")
+
+    if min_seg_ratio <= 0 or min_seg_ratio > 1:
+        raise ValueError("min_seg_ratio cannot be <= 0 or > 1")
+
+    if win_length < 0.3*num_years or win_length >= 0.5*num_years:
+        warnings.warn("We recommend using a window length greater than 30%% but less than 50%% of the chronology length\n")
+
+    print("Generating variance stabilized chronology...\n")
+
+    # give rbar function a range of years (window length) to calculate rbar for
+    # calculate rbar for that window, using either osborn's or frank's or 67spline
+    # get rbar for each relevant segment of the dataframe
+
+
+    mean_val = rwi_data.mean().mean()
+
+    zero_mean_data = rwi_data - mean_val
+
+    rbar_array = np.zeros(zero_mean_data.shape[0])
+
+    if win_length % 2 == 0:
+        target = (win_length)/2
+    else:
+        target = (win_length-1)/2
+
+    for i in range(num_years-win_length + 1):
+        data_segment = zero_mean_data[i:i + win_length]
+        if data_segment.shape[0] < win_length:
+            continue
+        target_index = int(i + target)
+        rbar_array[target_index] = get_running_rbar(data_segment, min_seg_ratio)
+
+    rbar_array = pad_rbar_array(rbar_array)
+
+    reg_chron = chron(zero_mean_data, biweight=biweight, plot=False)
+
+    mean_rwis = reg_chron["Mean RWI"].to_numpy()
+    samp_deps = reg_chron["Sample depth"].to_numpy()
+    denom = np.multiply(samp_deps-1, rbar_array) + 1
+
+    n_eff = np.minimum(np.divide(samp_deps, denom), samp_deps)
+    rbar_const = mean_series_intercorrelation(zero_mean_data, "pearson", min_seg_ratio)
+    stabilized_means = np.multiply(mean_rwis, np.sqrt(n_eff * rbar_const))
+
+    if running_rbar:
+        stabilized_chron =  pd.DataFrame(data={"Adjusted CRN": stabilized_means + mean_val, "Running rbar": rbar_array, "Sample depth": samp_deps}, index=reg_chron.index)
+    else:
+        stabilized_chron =  pd.DataFrame(data={"Adjusted CRN": stabilized_means + mean_val, "Sample depth": samp_deps}, index=reg_chron.index)
+
+    print("SUCCESS!\n")
+    return stabilized_chron
+
+def pad_rbar_array(rbar_array):
+    # double check that rbar cannot be 0
+    first = 0
+    first_valid = 0
+    for val in rbar_array:
+        if val != 0 and not np.isnan(val):
+            first = val
+            break
+        first_valid += 1
+
+    last = 0
+    last_valid = len(rbar_array) - 1
+    for val in np.flip(rbar_array):
+        if val != 0 and not np.isnan(val):
+            last = val
+            break
+        last_valid -= 1
+
+    rbar_array[:first_valid] = np.full(first_valid, first) #should be  np.full(first_valid, 1)
+    rbar_array[last_valid:] = np.full(len(rbar_array) - last_valid, last) #should be np.full(len(rbar_array) - last_valid, last)
+
+    return rbar_array
diff --git a/src/detrend.py b/src/detrend.py
@@ -40,17 +40,26 @@
 from autoreg import ar_func
 import curvefit
 
-def detrend(data, fit="spline", method="residual", plot=True, period=None):
+def detrend(data: pd.DataFrame | pd.Series, fit="spline", method="residual", plot=True, period=None):
     if isinstance(data, pd.DataFrame):
+<<<<<<< HEAD
+        res = pd.DataFrame(index=pd.Index(data.index))
+        to_add = [res]
+        for column in data.columns:
+            to_add.append(detrend_series(data[column], column, fit, method, plot, period=None))
+        output_df = pd.concat(to_add, axis=1)
+        return output_df.rename_axis(data.index.name)
+=======
         res = pd.DataFrame(index=data.index)
         to_add = [res]
         for column in data.columns:
             to_add.append(detrend_series(data[column], column, fit, method, plot, period=None))
         return pd.concat(to_add, axis=1)
+>>>>>>> main
     elif isinstance(data, pd.Series):
         return detrend_series(data, data.name, fit, method, plot)
     else:
-        return TypeError("argument should be either pandas dataframe or pandas series.")
+        raise TypeError("argument should be either pandas dataframe or pandas series.")
 
 # Takes a series as input and by default fits it to a spline, then 
 # detrends it by calculating residuals
@@ -74,17 +83,15 @@ def detrend_series(data, series_name, fit, method, plot, period=None):
         yi = curvefit.horizontal(x, y)
     else:
         # give error message for unsupported curve fit
-        print()
-        return ValueError("unsupported keyword for curve-fit type. See documentation for more info.")
+        raise ValueError("unsupported keyword for curve-fit type. See documentation for more info.")
 
     if method == "residual":
         detrended_data = residual(y, yi)
     elif method == "difference":
         detrended_data = difference(y, yi)
     else:
         # give error message for unsupported detrending method
-        print()
-        return ValueError("unsupported keyword for detrending method. See documentation for more info.")
+        raise ValueError("unsupported keyword for detrending method. See documentation for more info.")
 
     if plot:
         fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(7,3))

diff --git a/src/dplpy.py b/src/dplpy.py
@@ -185,6 +185,18 @@ def autoreg_from_parser(args):
 
 def xdate_from_parser(args):
     xdate(input=args.input)
+<<<<<<< HEAD
+
+def series_corr_from_parser(args):
+    series_corr(input=args.input)
+
+def chron_stabilized_from_parser(args):
+    chron_stabilized(input=args.input)
+
+def write_from_parser(args):
+    write(input=args.input)
+=======
+>>>>>>> main
 
 def series_corr_from_parser(args):
     series_corr(input=args.input)
@@ -209,9 +221,16 @@ def rbar_from_parser(args):
 from detrend import detrend
 from autoreg import ar_func, autoreg
 from chron import chron
+<<<<<<< HEAD
+from chron_stabilized import chron_stabilized
+from xdate import xdate, xdate_plot
+from series_corr import series_corr
+from writers import write
+=======
 from xdate import xdate, xdate_plot
 from series_corr import series_corr
 from rbar import rbar, common_interval
+>>>>>>> main
 
 def main(args=None):
     parser = argparse.ArgumentParser(description="dplPy v0.1") # update version as we update packages