From 0b91b845dc2008f995963610f98d46ed653c9809 Mon Sep 17 00:00:00 2001 From: Dean Henze Date: Thu, 11 Jul 2024 13:37:58 -0700 Subject: [PATCH] Fixed typos and quarto text incompatibilities. Updated some language. --- notebooks/Advanced_cloud/basic_dask.ipynb | 26 ++++-- .../Advanced_cloud/coiled_cluster_01.ipynb | 27 +++--- .../Advanced_cloud/coiled_function_01.ipynb | 93 ++++++++++++------- .../Advanced_cloud/dask_cluster_01.ipynb | 32 ++++--- .../Advanced_cloud/dask_delayed_01.ipynb | 82 ++++++++++------ 5 files changed, 164 insertions(+), 96 deletions(-) diff --git a/notebooks/Advanced_cloud/basic_dask.ipynb b/notebooks/Advanced_cloud/basic_dask.ipynb index 3f0ded8e..2182d37c 100644 --- a/notebooks/Advanced_cloud/basic_dask.ipynb +++ b/notebooks/Advanced_cloud/basic_dask.ipynb @@ -23,14 +23,15 @@ "\n", "Dask is a Python package to utilize parallel processing. With data accessible in the Cloud this capability presents a great opportunity, since any user can spin up a virtual machine with more RAM and processors than their local machine/laptop, allowing them to magnify the benefits of parallel processing. Great introductions to parallel computing and Dask (brief yet pedagogical) are linked to in the Prerequisites section further down and a reader completely new to these concepts is encouraged to read them.\n", "\n", - "In the author's experience, many Earthdata analyses will fall into one of two categrories, and determining which one is key to set up the appropriate parallelization method. \n", - "1. A computation which simply needs to be replicated many times, such as applying the same computation to 1000 files. The first schematic (below) shows an example for a common NASA Earthdata set format, where each file contains data for one timestamp, as well as spatial dimensions such as *x1*=latitude, *x2*=longitude. We want to apply a function *F(x1,x2)* to each file. Alternately, each file could correspond to a satellite orbit, and *x1, x2* are the satellite cross-track and along-track dimensions.\n", + "This notebook covers two types of parallel computations that will be applicable to many earth data analyses. Determining which one is needed for your analysis is key to receiving the performance benefits of parallel computing. \n", + "\n", + "1. **Function replication:** A computation which needs to be replicated many times, such as applying the same computation to 1000 files. The first schematic (below) shows an example for a common NASA Earthdata set format, where each file contains data for one timestamp, as well as spatial dimensions such as *x1*=latitude, *x2*=longitude. We want to apply a function *F(x1,x2)* to each file. Alternately, each file could correspond to a satellite orbit, and *x1, x2* are the satellite cross-track and along-track dimensions.\n", "\n", "\"sch1\"\n", "\n", "

\n", "\n", - "2. A computation which cannot trivially be replicated over multiple files, or over parts of a single file. In the example of the NASA Earthdata set, where each file corresponds to a separate time stamp, this type of parallelization challenge could correspond to taking the mean and standard deviation over time at each latitude, longitude grid point (second schematic below). In this case, data from all the files is required to compute these quantities. Another example is an empirical orthogonal function (EOF) analysis, which needs to be performed on the entire 3D dataset as it extracts key modes of variability in both the time and spatial dimensions (third schematic).\n", + "2. **Large dataset chunking:** For cases where a large data set needs to be worked on as a whole. This could e.g. because the function works on some or all of the data from each of the files, so we cannot work on each file independently. In the example of the NASA Earthdata set, where each file corresponds to a separate time stamp, this type of parallelization challenge could correspond to taking the mean and standard deviation over time at each latitude, longitude grid point (first schematic below). Another example is an empirical orthogonal function (EOF) analysis, which needs to be performed on the entire 3D dataset as it extracts key modes of variability in both the time and spatial dimensions (second schematic below).\n", "\n", "

\n", "\n", @@ -38,11 +39,12 @@ "\n", "\"sch3\"\n", "\n", - "This notebook covers basic examples of both cases 1 and 2 using Dask. In two subsequent notebooks, more complex examples of each are demo'd. However, in both cases the underlying ideas covered in this notebook will be the foundation of the workflows.\n", + "This notebook covers basic examples of both cases using Dask. In subsequent notebooks, more complex examples of each are covered. However, in both cases the underlying ideas covered in this notebook will be the foundation of the workflows.\n", "\n", "\n", "### Requirements, prerequisite knowledge, learning outcomes\n", "#### Requirements to run this notebook\n", + "\n", "* Run this notebook in an EC2 instance in us-west-2. **It is recommended to have a minimum of an m6i.4xlarge EC2 type for this demo, in order to start local clusters with the number of processors and memory per worker we used.**\n", "* Have an Earthdata Login account.\n", "\n", @@ -57,7 +59,15 @@ "#### Learning outcomes\n", "This notebook demonstrates two methods to parallelize analyses which access NASA Earthdata directly in the cloud. The first method is used to compute the timeseries of global mean sea surface temperatures using the Multiscale Ultrahigh Resolution (MUR) Global Foundation Sea Surface Temperature data set (a gridded SST product), https://doi.org/10.5067/GHGMR-4FJ04. The second method uses the same data set to compute a 2D spatial map of SST's at each grid point. \n", "\n", - "In both cases the analyses are parallelized on a local cluster (e.g. using the computing power of only the specific EC2 instance spun up). This notebook does not cover using multi-node, distributed clusters (e.g. combining computing power of multiple VMs at once). \n" + "In both cases the analyses are parallelized on a local cluster (e.g. using the computing power of only the specific EC2 instance spun up). This notebook does not cover using multi-node, distributed clusters (e.g. combining computing power of multiple VMs at once)." + ] + }, + { + "cell_type": "markdown", + "id": "05ece8e1-5a0e-4eaa-943f-013d27a6c0ad", + "metadata": {}, + "source": [ + "## Import packages" ] }, { @@ -808,7 +818,7 @@ "id": "874e0285-dc71-40df-9082-54d3e33ba813", "metadata": {}, "source": [ - "# 2. Parallelization use case 1: computation which is easily iterable over multiple files\n", + "# 2. Function replication\n", "\n", "The example computation used here is computing the global mean at each time stamp. Since each MUR file corresponds to one time stamp, the task is straightforward: replicate the computation once per file. \n", "\n", @@ -1099,7 +1109,7 @@ "id": "d19cd491-48e2-405f-939c-3b4c7a095768", "metadata": {}, "source": [ - "# 3. Parallelization use case 2: computation which is not easily iterable over multiple files\n", + "# 3. Large dataset chunking\n", "\n", "The example computation used here is the temporal mean of SST calculated at each grid point. Since each file corresponds to a timestamp, we need data accross all files to compute the mean at each grid point. Therefore, the previous parallelization method would not work well. " ] @@ -2925,7 +2935,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.8.13" + "version": "3.12.3" } }, "nbformat": 4, diff --git a/notebooks/Advanced_cloud/coiled_cluster_01.ipynb b/notebooks/Advanced_cloud/coiled_cluster_01.ipynb index f8298f24..b9fcc191 100644 --- a/notebooks/Advanced_cloud/coiled_cluster_01.ipynb +++ b/notebooks/Advanced_cloud/coiled_cluster_01.ipynb @@ -20,13 +20,14 @@ "\"sch1\"\n", "\"sch1\"\n", "\n", - "In a previous notebook, a toy example was used to demonstrate this basic functionality using a local dask cluster and Xarray built-in functions to work on the data set in chunks. In this notebook, we expand that workflow to a more complex analysis, representing something closer to a real-world use-case. In this notebook, we parallelize computations using the third party software/package `Coiled`. In short, `Coiled` will allow us to spin up AWS virtual machines (EC2 instances) and create a distributed cluster out of them, all with a few lines of Python from within this notebook. *You will need a Coiled account, but once set up, you can run this notebook entirely from your laptop while the parallel computation portion will be run on the distributed cluster in AWS.* \n", + "In a previous notebook, a toy example was used to demonstrate this basic functionality using a local dask cluster and Xarray built-in functions to work on the data set in chunks. In this notebook, that workflow is expanded to a more complex analysis. Parallel computations are performed via the third party software/package `Coiled`. In short, `Coiled` allows us to spin up AWS virtual machines (EC2 instances) and create a distributed cluster out of them, all with a few lines of Python from within a notebook. *You will need a Coiled account, but once set up, you can run this notebook entirely from your laptop while the parallel computation portion will be run on the distributed cluster in AWS.* \n", "\n", "\n", "#### Analysis: Mean Seasonal Cycle of SST Anomalies\n", "\n", "The analysis will generate the mean seasonal cycle of sea surface temperature (SST) at each gridpoint in a region of the west coast of the U.S.A. \n", - "The analysis uses a PO.DAAC hosted gridded global SST data sets:\n", + "The analysis uses a PO.DAAC hosted gridded global SST data set:\n", + "\n", "* GHRSST Level 4 MUR Global Foundation SST Analysis, V4.1: 0.01° x 0.01° resolution, global map, daily files, https://doi.org/10.5067/GHGMR-4FJ04\n", "\n", "The analysis will use files over the first decade of the time record. The following procedure is used to generate seasonal cycles:\n", @@ -34,18 +35,20 @@ "\"sch_sst-ssh-corr\"\n", "\n", "\n", - "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about \\\\$0.20. In Section 3, we perform the computation on all the files in the first decade, ~4000 files, ~10 TB of uncompressed data, for about \\\\$3.\n", + "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about $\\$$ 0.20. In Section 3, we perform the computation on all the files in the first decade, ~4000 files, ~10 TB of uncompressed data, for about $\\$$ 3.\n", "\n", "\n", "## Requirements, prerequisite knowledge, learning outcomes\n", "\n", "#### Requirements to run this notebook\n", + "\n", "* **Earthdata login account:** An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. \n", "* **Coiled account:** Create a coiled account (free to sign up), and connect it to an AWS account. For more information on Coiled, setting up an account, and connecting it to an AWS account, see their website https://www.coiled.io. \n", "* **Compute environment:** This notebook can be run either in the cloud (AWS instance running in us-west-2), or on a local compute environment (e.g. laptop, server), but the data loading step currently works substantially faster in the cloud. In both cases, the parallel computations are still sent to VM's in the cloud.\n", "\n", "\n", "#### Prerequisite knowledge\n", + "\n", "* The [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n", "\n", "#### Learning outcomes\n", @@ -655,7 +658,7 @@ } ], "source": [ - "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file objects from the endpoints which are compatible with Xarray\n", + "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file-like objects compatible with Xarray\n", "sst_test = xr.open_dataset(fileobj_test)['analysed_sst']\n", "sst_test" ] @@ -1616,9 +1619,9 @@ "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n", "\n", "## Remove linear warming trend:\n", - "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n", - "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n", - "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n", + "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n", + "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n", + "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n", "\n", "## Mean seasonal cycle:\n", "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")" @@ -1806,7 +1809,7 @@ } ], "source": [ - "fileobjs = earthaccess.open(datainfo) # Generate file objects from the endpoints which are compatible with Xarray" + "fileobjs = earthaccess.open(datainfo) # Generate file-like objects compatible with Xarray" ] }, { @@ -2564,15 +2567,15 @@ "## ----------------\n", "## Set up analysis\n", "## ----------------\n", - "## (Since we're dealing with dask arrays, these functions calls don't do the computations yet, just set them up)\n", + "## (Since these are dask arrays, functions calls don't do the computations yet, just set them up)\n", "\n", "## Subset to region off U.S.A. west coast:\n", "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n", "\n", "## Remove linear warming trend:\n", - "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n", - "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n", - "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n", + "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n", + "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n", + "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n", "\n", "## Mean seasonal cycle:\n", "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")" diff --git a/notebooks/Advanced_cloud/coiled_function_01.ipynb b/notebooks/Advanced_cloud/coiled_function_01.ipynb index e3486323..bb00a680 100644 --- a/notebooks/Advanced_cloud/coiled_function_01.ipynb +++ b/notebooks/Advanced_cloud/coiled_function_01.ipynb @@ -25,6 +25,7 @@ "#### SST-SSH Correlation Analysis\n", "\n", "The analysis will generate global maps of spatial correlation between sea surface temperature (SST) and sea surface height (SSH). The analysis uses PO.DAAC hosted, gridded SSH and SST data sets:\n", + "\n", "* MEaSUREs gridded SSH Version 2205: 0.17° x 0.17° resolution, global map, one file per 5-days, https://doi.org/10.5067/SLREF-CDRV3\n", "* GHRSST Level 4 MW_OI Global Foundation SST, V5.0: 0.25° x 0.25° resolution, global map, daily files, https://doi.org/10.5067/GHMWO-4FR05\n", "\n", @@ -38,13 +39,14 @@ "## Requirements, prerequisite knowledge, learning outcomes\n", "\n", "#### Requirements to run this notebook\n", + "\n", "* **Earthdata login account:** An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. \n", "* **Coiled account:** Create a coiled account (free to sign up), and connect it to an AWS account. For more information on Coiled, setting up an account, and connecting it to an AWS account, see their website https://www.coiled.io. \n", "* **Compute environment:** This notebook can be run either in the cloud (AWS instance running in us-west-2), or on a local compute environment (e.g. laptop, server), and will run equally as well in either. In both cases, the parallel computations are still sent to VM's in the cloud.\n", "\n", "\n", "#### Prerequisite knowledge\n", - "* Please make sure you are comfortable with the [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n", + "* The [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n", "\n", "#### Learning outcomes\n", "This notebook demonstrates how to use Coiled with a distributed cluster to replicate a function over many files in parallel. You will get better insight on how to apply this workflow to your own analysis." @@ -143,8 +145,8 @@ "def spatialcorr(x, y, p1, p2):\n", " \"\"\"\n", " Correlation between two 2D variables p1(x, y), p2(x, y), over the domain (x, y). Correlation is \n", - " computed between the anomalies of p1, p2, where anomalies for each variables are the deviations from \n", - " respective linear 2D surface fits.\n", + " computed between the anomalies of p1, p2, where anomalies for each variables are the deviations \n", + " from respective linear 2D surface fits.\n", " \"\"\"\n", " # Compute anomalies:\n", " ssha, _ = anomalies_2Dsurf(x, y, p1) # See function further down.\n", @@ -152,8 +154,9 @@ " \n", " # Compute correlation coefficient:\n", " a, b = ssta.flatten(), ssha.flatten()\n", - " if ( np.nansum(abs(a))==0 ) or ( np.nansum(abs(b))==0 ): # For cases where all anomalies are 0, the correlation \n", - " # should be 0. Numpy computes this correctly but throws a lot of warnings. So instead, we manually append 0.\n", + " if ( np.nansum(abs(a))==0 ) or ( np.nansum(abs(b))==0 ): \n", + " # For cases with all 0's, the correlation should be 0. Numpy computes this correctly \n", + " # but throws a lot of warnings. So instead, manually append 0.\n", " return 0\n", " else:\n", " return np.nanmean(a*b)/np.sqrt(np.nanvar(a) * np.nanvar(b))\n", @@ -176,11 +179,11 @@ " va, vm: 2D NumPy arrays\n", " Anomalies (va) and mean surface fit (vm).\n", " \"\"\"\n", - " # Functions to (1) output a 2D surface and (2) output the difference between 2D data and the computed surface:\n", + " # Function for a 2D surface:\n", " def surface(c,x0,y0): # Takes independent vars and poly coefficients\n", " a,b,c=c\n", " return a + b*x0 + c*y0\n", - " \n", + " # Function for the difference between data and the computed surface:\n", " def err(c,x0,y0,p): # Takes independent/dependent vars and poly coefficients\n", " a,b,c=c\n", " return p - (a + b*x0 + c*y0 )\n", @@ -216,23 +219,23 @@ "def spatial_corrmap(grans, lat_halfwin, lon_halfwin, lats=None, lons=None, f_notnull=0.5):\n", " \"\"\"\n", " Get a 2D map of SSH-SST spatial correlation coefficients, for one each of the \n", - " SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205 and MW_OI-REMSS-L4-GLOB-v5.0 collections. \n", - " At each gridpoint, the spatial correlation is computed over a lat, lon window of size \n", - " 2*lat_halfwin x 2*lon_halfwin. Correlation is computed from the SSH, SST anomalies, which are computed \n", - " in turn as the deviations from a fitted 2D surface over the window.\n", + " SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205 and MW_OI-REMSS-L4-GLOB-v5.0 \n", + " collections. At each gridpoint, the spatial correlation is computed over a lat, lon window of size \n", + " 2*lat_halfwin x 2*lon_halfwin. Correlation is computed from the SSH, SST anomalies, which are \n", + " computed in turn as the deviations from a fitted 2D surface over the window.\n", " \n", " Inputs\n", " ------\n", " grans: 2-tuple of earthaccess.results.DataGranule objects\n", - " Granule info for the SSH, SST files (in that order). These objects contain https and S3 locations.\n", + " Metadata for the SSH, SST files. These objects contain https and S3 locations.\n", " lat_halfwin, lon_halfwin: floats\n", " Half window size in degrees for latitude and longitude dimensions, respectively.\n", " lats, lons: None or 1D array-like\n", " Latitude, longitude gridpoints at which to compute the correlations. \n", " If None, will use the SSH grid.\n", " f_notnull: float between 0-1 (default = 0.5)\n", - " Threshold fraction of non-NAN values in a window in order for the correlation to be computed,\n", - " otherwise NAN is returned for that grid point. For edge cases, 'ghost' elements are counted as NAN.\n", + " Threshold fraction of non-NAN values in a window, otherwise the correlation is not computed, \n", + " and NAN is returned for that grid point. For edge cases, 'ghost' elements are counted.\n", "\n", " Returns\n", " ------\n", @@ -280,7 +283,10 @@ " lat_top = lat_cen + lat_halfwin\n", " lon_left = lon_cen - lon_halfwin\n", " lon_right = lon_cen + lon_halfwin\n", - " data_win = mergeddata.sel(Longitude=slice(lon_left, lon_right), Latitude=slice(lat_bottom, lat_top))\n", + " data_win = mergeddata.sel(\n", + " Longitude=slice(lon_left, lon_right), \n", + " Latitude=slice(lat_bottom, lat_top)\n", + " )\n", " \n", " # If number of non-nan values in window is less than threshold \n", " # value, append np.nan, else compute correlation coefficient:\n", @@ -288,7 +294,10 @@ " if n_notnul < n_thresh:\n", " coef.append(np.nan)\n", " else:\n", - " c = spatialcorr(data_win['Longitude'], data_win['Latitude'], data_win['SLA'].data, data_win['analysed_sst'].data)\n", + " c = spatialcorr(\n", + " data_win['Longitude'], data_win['Latitude'], \n", + " data_win['SLA'].data, data_win['analysed_sst'].data\n", + " )\n", " coef.append(c)\n", " \n", " return np.array(coef).reshape((len(lats), len(lons))), np.array(lats), np.array(lons)" @@ -343,8 +352,14 @@ "source": [ "## Granule info for all files in 2018:\n", "dt2018 = (\"2018-01-01\", \"2018-12-31\")\n", - "grans_ssh = earthaccess.search_data(short_name=\"SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205\", temporal=dt2018)\n", - "grans_sst = earthaccess.search_data(short_name=\"MW_OI-REMSS-L4-GLOB-v5.0\", temporal=dt2018)" + "grans_ssh = earthaccess.search_data(\n", + " short_name=\"SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205\", \n", + " temporal=dt2018\n", + " )\n", + "grans_sst = earthaccess.search_data(\n", + " short_name=\"MW_OI-REMSS-L4-GLOB-v5.0\", \n", + " temporal=dt2018\n", + " )" ] }, { @@ -424,7 +439,7 @@ "source": [ "## 3. Test the computation on a pair of files, output on a coarse resolution grid\n", "\n", - "To verify the functions work, we test them on the first pair of files. To reduce computation time, we compute them for a 2 degree x 2 degree output grid for now." + "To verify the functions work, test them on the first pair of files. To reduce computation time, we compute them for a 2 degree x 2 degree output grid for now." ] }, { @@ -701,10 +716,13 @@ "source": [ "def corrmap_toda(grans, lat_halfwin=3, lon_halfwin=3, lats=None, lons=None, f_notnull=0.5):\n", " \"\"\"\n", - " Calls spatial_corrmap() for a pair of SSH, SST granules and collects output into an Xarray DataArray. \n", - " Returns this along with the date of the file coverages.\n", + " Calls spatial_corrmap() for a pair of SSH, SST granules and collects output into an \n", + " Xarray DataArray. Returns this along with the date of the file coverages.\n", " \"\"\"\n", - " coef, lats, lons = spatial_corrmap(grans, lat_halfwin, lon_halfwin, lats=lats, lons=lons, f_notnull=0.5)\n", + " coef, lats, lons = spatial_corrmap(\n", + " grans, lat_halfwin, lon_halfwin, \n", + " lats=lats, lons=lons, f_notnull=0.5\n", + " )\n", " date = grans[0]['umm']['GranuleUR'].split(\"_\")[-1][:8] # get date from SSH UR.\n", " return date, xr.DataArray(data=coef, dims=[\"lat\", \"lon\"], coords=dict(lon=lons, lat=lats), name='corr_ssh_sst')" ] @@ -807,19 +825,22 @@ "# Wrap function in a Coiled function. Here is where we make cluster specifications like VM type:\n", "corrmap_toda_parallel = coiled.function(\n", " region=\"us-west-2\", spot_policy=\"on-demand\", vm_type=\"m6i.large\", \n", - " environ=earthaccess.auth_environ(), # This ensures that our EDL credentials are passed to each worker.\n", - " n_workers=73 # Optional: have the cluster manually scale to 73 VM's. Omit this line to have Coiled adaptively scale\n", + " environ=earthaccess.auth_environ(), # Ensure that our EDL credentials are passed to each worker.\n", + " n_workers=73 # Optional: have the cluster manually scale to 73 VM's. Omit for adaptive scaling.\n", " )(corrmap_toda)\n", "\n", "# Begin computations:\n", "grans_2tuples = list(zip(grans_ssh_analyze, grans_sst_analyze))\n", - "results = corrmap_toda_parallel.map(grans_2tuples, lat_halfwin=3, lon_halfwin=3, lats=lats, lons=lons, f_notnull=0.5)\n", + "results = corrmap_toda_parallel.map(\n", + " grans_2tuples, lat_halfwin=3, lon_halfwin=3, \n", + " lats=lats, lons=lons, f_notnull=0.5\n", + " )\n", "\n", "# Retreive the results from the cluster as they become available and save as .nc files locally:\n", "for date, result_da in results:\n", " result_da.to_netcdf(dir_results+\"spatial-corr-map_ssh-sst_\" + date + \".nc\")\n", "\n", - "# Since we manually scaled up the cluster, we have to manually scale it back down if we want to stop using resources:\n", + "# Since we manually scaled up the cluster, manually scale it back down to stop using resources:\n", "corrmap_toda_parallel.cluster.scale(1)\n", "\n", "\n", @@ -1493,7 +1514,7 @@ "\n", "Only Section 1 needs to be run prior to this. Sections 2-4 can be skipped.\n", "\n", - "This section mirrors the workflow of Section 4, but processes all 1808 pairs of files, spanning a little over two decades, at higher resolution. To get an estimate of how long this would take without parallel computing, you can re-run section 3 but replace a value of 0.5 for the higher_res variable with 0.25 (in the part where we estimate comp times). Trying this on a few machines, we get that it would take anywhere from 21 to 34 hours to process the record over a single year, which means for 22 years it would take 19 to 31 days to complete the entire record. When we used this code to run the computation in parallel on a 250 *t4g.small* instances, computation took us instead ~5 hours and cost ~$20, obtaining the following mean map for the full record:\n", + "This section mirrors the workflow of Section 4, but processes all 1808 pairs of files, spanning a little over two decades, at higher resolution. To get an estimate of how long this would take without parallel computing, you can re-run section 3 but replace a value of 0.5 for the higher_res variable with 0.25 (in the part where we estimate comp times). Trying this on a few machines, we get that it would take anywhere from 21 to 34 hours to process the record over a single year, which means for 22 years it would take 19 to 31 days to complete the entire record. When we used this code to run the computation in parallel on a 250 t4g.small instances, computation took us instead ~5 hours and cost ~$\\$$ 20, obtaining the following mean map for the full record:\n", "\n", "![image.png](attachment:210f8467-aef9-4240-bd2d-ffc11c955094.png)" ] @@ -1558,10 +1579,13 @@ "source": [ "def corrmap_toda(grans, lat_halfwin=3, lon_halfwin=3, lats=None, lons=None, f_notnull=0.5):\n", " \"\"\"\n", - " Calls spatial_corrmap() for a pair of SSH, SST granules and collects output into an Xarray DataArray. \n", - " Returns this along with the date of the file coverages.\n", + " Calls spatial_corrmap() for a pair of SSH, SST granules and collects output into an \n", + " Xarray DataArray. Returns this along with the date of the file coverages.\n", " \"\"\"\n", - " coef, lats, lons = spatial_corrmap(grans, lat_halfwin, lon_halfwin, lats=lats, lons=lons, f_notnull=0.5)\n", + " coef, lats, lons = spatial_corrmap(\n", + " grans, lat_halfwin, lon_halfwin, \n", + " lats=lats, lons=lons, f_notnull=0.5\n", + " )\n", " date = grans[0]['umm']['GranuleUR'].split(\"_\")[-1][:8] # get date from SSH UR.\n", " return date, xr.DataArray(data=coef, dims=[\"lat\", \"lon\"], coords=dict(lon=lons, lat=lats), name='corr_ssh_sst')" ] @@ -1618,13 +1642,16 @@ "corrmap_toda_parallel = coiled.function(\n", " region=\"us-west-2\", spot_policy=\"on-demand\", \n", " vm_type=\"t4g.small\", # General purpose vm that strikes a balance between performance and cost.\n", - " environ=earthaccess.auth_environ(), # This ensures that our EDL credentials are passed to each worker.\n", - " n_workers=[0,250] # Have cluster adaptively scale to at most 250 VM's. Will go back to 0 once work is finished.\n", + " environ=earthaccess.auth_environ(), # Ensure our EDL credentials are passed to each worker.\n", + " n_workers=[0,250] # Adaptively scale to at most 250 VM's. Will go back to 0 once finished.\n", " )(corrmap_toda)\n", "\n", "# Begin computations:\n", "grans_2tuples = list(zip(grans_ssh_analyze, grans_sst_analyze))\n", - "results = corrmap_toda_parallel.map(grans_2tuples, lat_halfwin=3, lon_halfwin=3, lats=lats, lons=lons, f_notnull=0.5)\n", + "results = corrmap_toda_parallel.map(\n", + " grans_2tuples, lat_halfwin=3, lon_halfwin=3, \n", + " lats=lats, lons=lons, f_notnull=0.5\n", + " )\n", "\n", "# Retreive the results from the cluster as they become available and save as .nc files locally:\n", "for date, result_da in results:\n", diff --git a/notebooks/Advanced_cloud/dask_cluster_01.ipynb b/notebooks/Advanced_cloud/dask_cluster_01.ipynb index a715d152..e65633c2 100644 --- a/notebooks/Advanced_cloud/dask_cluster_01.ipynb +++ b/notebooks/Advanced_cloud/dask_cluster_01.ipynb @@ -25,7 +25,8 @@ "#### Analysis: Mean Seasonal Cycle of SST Anomalies\n", "\n", "The analysis will generate the mean seasonal cycle of sea surface temperature (SST) at each gridpoint in a region of the west coast of the U.S.A. \n", - "The analysis uses a PO.DAAC hosted gridded global SST data sets:\n", + "The analysis uses a PO.DAAC hosted gridded global SST data set:\n", + "\n", "* GHRSST Level 4 MUR Global Foundation SST Analysis, V4.1: 0.01° x 0.01° resolution, global map, daily files, https://doi.org/10.5067/GHGMR-4FJ04\n", "\n", "The analysis will use files over the first decade of the time record. The following procedure is used to generate seasonal cycles:\n", @@ -33,18 +34,20 @@ "\"sch_sst-ssh-corr\"\n", "\n", "\n", - "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about \\\\$0.10. In Section 3, we perform the computation on all the files in the first decade, ~4000 files, ~10 TB of uncompressed data, for about \\\\$3.\n", + "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about $\\$$ 0.10. In Section 3, we perform the computation on all the files in the first decade, roughly 4000 files and 10 TB of uncompressed data, for $\\$$ 3.\n", "\n", "\n", "## Requirements, prerequisite knowledge, learning outcomes\n", "\n", "#### Requirements to run this notebook\n", + "\n", "* **Earthdata login account:** An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. \n", "* **Compute environment:** This notebook can technically be run either in the cloud (AWS instance running in us-west-2), or on a local compute environment (e.g. laptop, server). However, running in the cloud is recommended, since it is unlikely that a laptop has the power to replicate the compute times we quote here. If running in a local environment, the number of workers spun up Section 4 will likely need to be decreased.\n", "* **VM type:** For running in AWS (recommended), we used general purpose VM types for Sections 1 and 2, then switched to memory-optimized VM's for the large computations of Section 3 (more on this later). Since the cost per hour of these instances go up in size, we recommend the following workflow to explore this notebook.\n", - "> 1. Start by running Section 1 in a lower cost m6i.xlarge instance (fractions of a $1 per hour).\n", - "> 2. Section 2 includes the \"medium-sized\" parallel computations. For this, we ran Sections 1 and 2 on a `m7g.8xlarge` instance. At the time this notebook was written, this VM type took ~2-3 minutes to run the computations, and cost ~\\\\$1.3/hr.\n", - "> 3. For Optional Section 3, we ran using several memory optimized instances, including `r7a.32xlarge`, `r7i.48xlarge`, `r7iz.16xlarge` VMs.\n", + " \n", + " 1. Start by running Section 1 in a lower cost m6i.xlarge instance (fractions of a $\\$$ 1 per hour).\n", + " 2. Section 2 includes the \"medium-sized\" parallel computations. For this, we ran Sections 1 and 2 on a m7g.8xlarge instance. At the time this notebook was written, this VM type took ~2-3 minutes to run the computations, and cost ~$\\$$ 1.3/hr.\n", + " 3. For Optional Section 3, we ran using several memory optimized instances, including r7a.32xlarge, r7i.48xlarge, r7iz.16xlarge VMs.\n", "\n", "\n", "#### Prerequisite knowledge\n", @@ -658,7 +661,7 @@ } ], "source": [ - "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file objects from the endpoints which are compatible with Xarray\n", + "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file-like objects compatible with Xarray\n", "sst_test = xr.open_dataset(fileobj_test)['analysed_sst']\n", "sst_test" ] @@ -1571,9 +1574,9 @@ "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n", "\n", "## Remove linear warming trend:\n", - "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n", - "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n", - "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n", + "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n", + "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n", + "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n", "\n", "## Mean seasonal cycle:\n", "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")" @@ -1694,6 +1697,7 @@ "For this computation, [memory-optimized VMs](https://aws.amazon.com/ec2/instance-types/) were chosen (they have high memory per CPU). This [Coiled post](https://docs.coiled.io/blog/coiled-xarray.html) was used as inspiration, where it is found that having larger chunks are more efficient, which is why memory-optimized VMs are needed to handle the larger chunks. Additionally, we found that having 32 GB per worker, regardless of the worker number, was necessary to complete the computations without receiving memory errors.\n", "\n", "**VM runs that worked**\n", + "\n", "* `r7a.32xlarge` (128 CPUs, 1024 GiB memory): 32 workers each with 32 GB of memory; completed in 13 minutes for \\\\$3\n", "* `r7i.48xlarge` (192 CPUs, 1,536 GiB memory): 48 workers each with 32 GB of memory; completed in 13 minutes for \\\\$4\n", "* `r7iz.16xlarge` (64 CPUs, 512 GiB memory): 16 workers each with 32 GB of memory; completed in 18 minutes for \\\\$2.50" @@ -1757,7 +1761,7 @@ } ], "source": [ - "fileobjs = earthaccess.open(datainfo) # Generate file objects from the endpoints which are compatible with Xarray" + "fileobjs = earthaccess.open(datainfo) # Generate file-like objects compatible with Xarray" ] }, { @@ -2473,15 +2477,15 @@ "## ----------------\n", "## Set up analysis\n", "## ----------------\n", - "## (Since we're dealing with dask arrays, these functions calls don't do the computations yet, just set them up)\n", + "## (Since these are dask arrays, functions calls don't do the computations yet, just set them up)\n", "\n", "## Subset to region off U.S.A. west coast:\n", "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n", "\n", "## Remove linear warming trend:\n", - "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n", - "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n", - "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n", + "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n", + "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n", + "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n", "\n", "## Mean seasonal cycle:\n", "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")" diff --git a/notebooks/Advanced_cloud/dask_delayed_01.ipynb b/notebooks/Advanced_cloud/dask_delayed_01.ipynb index 3dcb9709..ae804d21 100644 --- a/notebooks/Advanced_cloud/dask_delayed_01.ipynb +++ b/notebooks/Advanced_cloud/dask_delayed_01.ipynb @@ -15,9 +15,10 @@ "\n", "\"sch1\"\n", "\n", - "In the previous notebook, a toy example was used to demonstrate this basic functionality using a local cluster and `dask.delayed()`. In this notebook, we expand that workflow to a more complex analysis, representing something closer to a real-world use-case.\n", + "In the previous notebook, a toy example was used to demonstrate this basic functionality using a local cluster and `dask.delayed()`. In this notebook, the workflow is used on a more complex analysis.\n", "\n", "The analysis will generate global maps of spatial correlation between sea surface temperature (SST) and sea surface height (SSH). The analysis uses PO.DAAC hosted, gridded SSH and SST data sets:\n", + "\n", "* MEaSUREs gridded SSH Version 2205: 0.17° x 0.17° resolution, global map, one file per 5-days, https://doi.org/10.5067/SLREF-CDRV3\n", "* GHRSST Level 4 MW_OI Global Foundation SST, V5.0: 0.25° x 0.25° resolution, global map, daily files, https://doi.org/10.5067/GHMWO-4FR05\n", "\n", @@ -25,20 +26,23 @@ "\n", "\"sch_sst-ssh-corr\"\n", "\n", - "This notebook will first define the functions to read in the data and perform the computations, then test them on a single file. Next a smaller parallel computation will be performed on all pairs of files in 2018 (73 pairs in total), reducing what would have otherwise taken hours to minutes instead. Finally, an optional section will demonstrate what was used to perform the full computation on all 1808 pairs of files at 0.25 degree resolution. \n", + "This notebook will first define the functions to read in the data and perform the computations, then test them on a single file. Next a smaller parallel computation will be performed on all pairs of files in 2018 (73 pairs in total), reducing what would have otherwise taken hours to minutes instead. Finally, the last section presents the code used to perform the full computation on all 1808 pairs of files at 0.25 degree resolution. \n", "\n", "## Requirements, prerequisite knowledge, learning outcomes\n", "\n", "#### Requirements to run this notebook\n", + "\n", "* **Earthdata login account:** An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. \n", "* **Compute environment:** This notebook can technically be run either in the cloud (AWS instance running in us-west-2), or on a local compute environment (e.g. laptop, server). However, running in the cloud is recommended, since it is unlikely that a laptop has the power to replicate the compute times we quote here. If running in a local environment, the number of workers spun up Section 4 will likely need to be decreased.\n", "* **VM type:** For running in AWS (recommended), we used C7i instances, which are \"compute-optimized\" VM's well-suited to the particular computations performed here (more on this later). Since the cost per hour of these instances go up in size, we recommend the following workflow to explore this notebook.\n", - "> 1. Start by running Sections 1 - 3 in a low cost c7i.2xlarge instance (fractions of a $1 per hour).\n", - "> 2. Parallel computations start in Section 4. For this, we ran Sections 1-4 with a c7i.24xlarge. At the time this notebook was written, this VM type took 7-10 minutes to run the computations, and cost ~\\\\$4/hr. You can run a smaller, less expensive VM type, but will need to change one line of code in Section 4.\n", - "> 3. For Optional Section 5, we ran using a c7i.24xlarge, although you could try running it with a c7i.48xlarge (double the vCPUs and double the cost).\n", + "\n", + " 1. Start by running Sections 1 - 3 in a low cost c7i.2xlarge instance (fractions of a \\$1 per hour).\n", + " 2. Parallel computations start in Section 4. For this, we ran Sections 1-4 with a c7i.24xlarge. At the time this notebook was written, this VM type took 7-10 minutes to run the computations, and cost ~\\$4/hr. You can run a smaller, less expensive VM type, but will need to change one line of code in Section 4.\n", + " 3. For Optional Section 5, we ran using a c7i.24xlarge, although you could try running it with a c7i.48xlarge (double the vCPUs and double the cost).\n", "\n", "#### Prerequisite knowledge\n", - "* Please make sure you are comfortable with the [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n", + "\n", + "* The [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n", "\n", "#### Learning outcomes\n", "This notebook demonstrates how to use `dask.delayed()` with a local cluster on an analysis mirroring what someone might want to do in a real-world setting. As such, you will get better insight on how to apply this workflow to your own analysis. Further, this notebook touches briefly on choosing VM types, which may be many user's first introduction to the topic. " @@ -138,8 +142,8 @@ "def spatialcorr(x, y, p1, p2):\n", " \"\"\"\n", " Correlation between two 2D variables p1(x, y), p2(x, y), over the domain (x, y). Correlation is \n", - " computed between the anomalies of p1, p2, where anomalies for each variables are the deviations from \n", - " respective linear 2D surface fits.\n", + " computed between the anomalies of p1, p2, where anomalies for each variables are the deviations \n", + " from respective linear 2D surface fits.\n", " \"\"\"\n", " # Compute anomalies:\n", " ssha, _ = anomalies_2Dsurf(x, y, p1) # See function further down.\n", @@ -147,8 +151,9 @@ " \n", " # Compute correlation coefficient:\n", " a, b = ssta.flatten(), ssha.flatten()\n", - " if ( np.nansum(abs(a))==0 ) or ( np.nansum(abs(b))==0 ): # For cases where all anomalies are 0, the correlation \n", - " # should be 0. Numpy computes this correctly but throws a lot of warnings. So instead, we manually append 0.\n", + " if ( np.nansum(abs(a))==0 ) or ( np.nansum(abs(b))==0 ): \n", + " # For cases with all 0's, the correlation should be 0. Numpy computes this correctly \n", + " # but throws a lot of warnings. So instead, manually append 0.\n", " return 0\n", " else:\n", " return np.nanmean(a*b)/np.sqrt(np.nanvar(a) * np.nanvar(b))\n", @@ -171,11 +176,11 @@ " va, vm: 2D NumPy arrays\n", " Anomalies (va) and mean surface fit (vm).\n", " \"\"\"\n", - " # Functions to (1) output a 2D surface and (2) output the difference between 2D data and the computed surface:\n", + " # Function for a 2D surface:\n", " def surface(c,x0,y0): # Takes independent vars and poly coefficients\n", " a,b,c=c\n", " return a + b*x0 + c*y0\n", - " \n", + " # Function for the difference between data and the computed surface:\n", " def err(c,x0,y0,p): # Takes independent/dependent vars and poly coefficients\n", " a,b,c=c\n", " return p - (a + b*x0 + c*y0 )\n", @@ -203,7 +208,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 1, "id": "f7e23d8b-0c6a-4c39-943e-4c37b6cc0177", "metadata": {}, "outputs": [], @@ -211,23 +216,23 @@ "def spatial_corrmap(grans, lat_halfwin, lon_halfwin, lats=None, lons=None, f_notnull=0.5):\n", " \"\"\"\n", " Get a 2D map of SSH-SST spatial correlation coefficients, for one each of the \n", - " SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205 and MW_OI-REMSS-L4-GLOB-v5.0 collections. \n", - " At each gridpoint, the spatial correlation is computed over a lat, lon window of size \n", - " 2*lat_halfwin x 2*lon_halfwin. Correlation is computed from the SSH, SST anomalies, which are computed \n", - " in turn as the deviations from a fitted 2D surface over the window.\n", + " SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205 and MW_OI-REMSS-L4-GLOB-v5.0 \n", + " collections. At each gridpoint, the spatial correlation is computed over a lat, lon window of \n", + " size 2*lat_halfwin x 2*lon_halfwin. Correlation is computed from the SSH, SST anomalies, \n", + " which are computed in turn as the deviations from a fitted 2D surface over the window.\n", " \n", " Inputs\n", " ------\n", " grans: 2-tuple of earthaccess.results.DataGranule objects\n", - " Granule info for the SSH, SST files (in that order). These objects contain https and S3 locations.\n", + " Metadata for the SSH, SST files. These objects contain https and S3 locations.\n", " lat_halfwin, lon_halfwin: floats\n", " Half window size in degrees for latitude and longitude dimensions, respectively.\n", " lats, lons: None or 1D array-like\n", " Latitude, longitude gridpoints at which to compute the correlations. \n", " If None, will use the SSH grid.\n", " f_notnull: float between 0-1 (default = 0.5)\n", - " Threshold fraction of non-NAN values in a window in order for the correlation to be computed,\n", - " otherwise NAN is returned for that grid point. For edge cases, 'ghost' elements are counted as NAN.\n", + " Threshold fraction of non-NAN values in a window, otherwise the correlation is not computed, \n", + " and NAN is returned for that grid point. For edge cases, 'ghost' elements are counted.\n", "\n", " Returns\n", " ------\n", @@ -275,7 +280,10 @@ " lat_top = lat_cen + lat_halfwin\n", " lon_left = lon_cen - lon_halfwin\n", " lon_right = lon_cen + lon_halfwin\n", - " data_win = mergeddata.sel(Longitude=slice(lon_left, lon_right), Latitude=slice(lat_bottom, lat_top))\n", + " data_win = mergeddata.sel(\n", + " Longitude=slice(lon_left, lon_right), \n", + " Latitude=slice(lat_bottom, lat_top)\n", + " )\n", " \n", " # If number of non-nan values in window is less than threshold \n", " # value, append np.nan, else compute correlation coefficient:\n", @@ -283,7 +291,10 @@ " if n_notnul < n_thresh:\n", " coef.append(np.nan)\n", " else:\n", - " c = spatialcorr(data_win['Longitude'], data_win['Latitude'], data_win['SLA'].data, data_win['analysed_sst'].data)\n", + " c = spatialcorr(\n", + " data_win['Longitude'], data_win['Latitude'], \n", + " data_win['SLA'].data, data_win['analysed_sst'].data\n", + " )\n", " coef.append(c)\n", " \n", " return np.array(coef).reshape((len(lats), len(lons))), np.array(lats), np.array(lons)" @@ -327,8 +338,14 @@ "source": [ "## Granule info for all files in 2018:\n", "dt2018 = (\"2018-01-01\", \"2018-12-31\")\n", - "grans_ssh = earthaccess.search_data(short_name=\"SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205\", temporal=dt2018)\n", - "grans_sst = earthaccess.search_data(short_name=\"MW_OI-REMSS-L4-GLOB-v5.0\", temporal=dt2018)" + "grans_ssh = earthaccess.search_data(\n", + " short_name=\"SEA_SURFACE_HEIGHT_ALT_GRIDS_L4_2SATS_5DAY_6THDEG_V_JPL2205\", \n", + " temporal=dt2018\n", + " )\n", + "grans_sst = earthaccess.search_data(\n", + " short_name=\"MW_OI-REMSS-L4-GLOB-v5.0\", \n", + " temporal=dt2018\n", + " )" ] }, { @@ -408,7 +425,7 @@ "source": [ "## 3. Test the computation on a pair of files, output on a coarse resolution grid\n", "\n", - "To verify the functions work, we test them on the first pair of files. To reduce computation time, we compute them for a 2 degree x 2 degree output grid for now." + "To verify the functions work, test them on the first pair of files. To reduce computation time, we compute them for a 2 degree x 2 degree output grid for now." ] }, { @@ -665,7 +682,7 @@ "source": [ "## 4. Parallel computations with Dask\n", "\n", - "Sections 1 and 2 need to be run prior to this.\n", + "*Sections 1 and 2 need to be run prior to this.*\n", "\n", "The previous section showed that analyzing a year's worth of data at 0.5 x 0.5 degree output resolution would take hours. In this section, we use `dask.delayed()` and a local cluster to complete this task in 7 - 20 minutes, depending on the VM type used. " ] @@ -689,9 +706,16 @@ " \"\"\"\n", " Calls spatial_corrmap() for a pair of SSH, SST granules and saves the results to a netCDF file.\n", " \"\"\"\n", - " coef, lats, lons = spatial_corrmap(grans, lat_halfwin, lon_halfwin, lats=lats, lons=lons, f_notnull=0.5)\n", + " coef, lats, lons = spatial_corrmap(\n", + " grans, lat_halfwin, lon_halfwin, \n", + " lats=lats, lons=lons, f_notnull=0.5\n", + " )\n", " date = grans[0]['umm']['GranuleUR'].split(\"_\")[-1][:8] # get date from SSH UR.\n", - " corrmap_da = xr.DataArray(data=coef, dims=[\"lat\", \"lon\"], coords=dict(lon=lons, lat=lats), name='corr_ssh_sst')\n", + " corrmap_da = xr.DataArray(\n", + " data=coef, dims=[\"lat\", \"lon\"], \n", + " coords=dict(lon=lons, lat=lats), \n", + " name='corr_ssh_sst'\n", + " )\n", " corrmap_da.to_netcdf(dir_results+\"spatial-corr-map_ssh-sst_\" + date + \".nc\")\n", " return" ] @@ -1444,7 +1468,7 @@ "source": [ "## 5. Optional: Parallel computation on full record at 0.25 degree resolution\n", "\n", - "Only Section 1 needs to be run prior to this. Sections 2-4 can be skipped.\n", + "*Only Section 1 needs to be run prior to this. Sections 2-4 can be skipped.*\n", "\n", "This section mirrors the workflow of Section 4, but processes all 1808 pairs of files, spanning a little over two decades, at higher resolution. To get an estimate of how long this would take without parallel computing, you can re-run section 3 but replace a value of *0.5* for the `higher_res` variable with *0.25* (in the part where we estimate comp times). Trying this on a few machines, we get that it would take anywhere from 21 to 34 hours to process the record over a single year, which means for 22 years it would take 19 to 31 days to complete the entire record. When we used this code to run the computation in parallel on a c7i.24xlarge instance, computation took us instead ~10.5 hours and cost ~$45, obtaining the following mean map for the full record:\n", "\n",