Merge pull request #236 from podaac/polish_daskcoiled_20240711

Fixed typos and quarto text incompatibilities. Updated some language.
podaac · Jul 11, 2024 · acb281c · acb281c
2 parents 4107564 + 0b91b84
commit acb281c
Show file tree

Hide file tree

Showing 5 changed files with 164 additions and 96 deletions.
diff --git a/notebooks/Advanced_cloud/basic_dask.ipynb b/notebooks/Advanced_cloud/basic_dask.ipynb
@@ -23,26 +23,28 @@
     "\n",
     "Dask is a Python package to utilize parallel processing. With data accessible in the Cloud this capability presents a great opportunity, since any user can spin up a virtual machine with more RAM and processors than their local machine/laptop, allowing them to magnify the benefits of parallel processing. Great introductions to parallel computing and Dask (brief yet pedagogical) are linked to in the Prerequisites section further down and a reader completely new to these concepts is encouraged to read them.\n",
     "\n",
-    "In the author's experience, many Earthdata analyses will fall into one of two categrories, and determining which one is key to set up the appropriate parallelization method. \n",
-    "1. A computation which simply needs to be replicated many times, such as applying the same computation to 1000 files. The first schematic (below) shows an example for a common NASA Earthdata set format, where each file contains data for one timestamp, as well as spatial dimensions such as *x1*=latitude, *x2*=longitude. We want to apply a function *F(x1,x2)* to each file. Alternately, each file could correspond to a satellite orbit, and *x1, x2* are the satellite cross-track and along-track dimensions.\n",
+    "This notebook covers two types of parallel computations that will be applicable to many earth data analyses. Determining which one is needed for your analysis is key to receiving the performance benefits of parallel computing. \n",
+    "\n",
+    "1. **Function replication:** A computation which needs to be replicated many times, such as applying the same computation to 1000 files. The first schematic (below) shows an example for a common NASA Earthdata set format, where each file contains data for one timestamp, as well as spatial dimensions such as *x1*=latitude, *x2*=longitude. We want to apply a function *F(x1,x2)* to each file. Alternately, each file could correspond to a satellite orbit, and *x1, x2* are the satellite cross-track and along-track dimensions.\n",
     "\n",
     "<img src=\"../../images/basicdask_schematic1.png\" alt=\"sch1\" width=\"500\"/>\n",
     "\n",
     "<br/><br/>\n",
     "\n",
-    "2. A computation which cannot trivially be replicated over multiple files, or over parts of a single file. In the example of the NASA Earthdata set, where each file corresponds to a separate time stamp, this type of parallelization challenge could correspond to taking the mean and standard deviation over time at each latitude, longitude grid point (second schematic below). In this case, data from all the files is required to compute these quantities. Another example is an empirical orthogonal function (EOF) analysis, which needs to be performed on the entire 3D dataset as it extracts key modes of variability in both the time and spatial dimensions (third schematic).\n",
+    "2. **Large dataset chunking:** For cases where a large data set needs to be worked on as a whole. This could e.g. because the function works on some or all of the data from each of the files, so we cannot work on each file independently. In the example of the NASA Earthdata set, where each file corresponds to a separate time stamp, this type of parallelization challenge could correspond to taking the mean and standard deviation over time at each latitude, longitude grid point (first schematic below). Another example is an empirical orthogonal function (EOF) analysis, which needs to be performed on the entire 3D dataset as it extracts key modes of variability in both the time and spatial dimensions (second schematic below).\n",
     "\n",
     "<br/><br/>\n",
     "\n",
     "<img src=\"../../images/basicdask_schematic2.png\" alt=\"sch2\" width=\"500\"/>\n",
     "\n",
     "<img src=\"../../images/basicdask_schematic3.png\" alt=\"sch3\" width=\"500\"/>\n",
     "\n",
-    "This notebook covers basic examples of both cases 1 and 2 using Dask. In two subsequent notebooks, more complex examples of each are demo'd. However, in both cases the underlying ideas covered in this notebook will be the foundation of the workflows.\n",
+    "This notebook covers basic examples of both cases using Dask. In subsequent notebooks, more complex examples of each are covered. However, in both cases the underlying ideas covered in this notebook will be the foundation of the workflows.\n",
     "\n",
     "\n",
     "### Requirements, prerequisite knowledge, learning outcomes\n",
     "#### Requirements to run this notebook\n",
+    "\n",
     "* Run this notebook in an EC2 instance in us-west-2. **It is recommended to have a minimum of an m6i.4xlarge EC2 type for this demo, in order to start local clusters with the number of processors and memory per worker we used.**\n",
     "* Have an Earthdata Login account.\n",
     "\n",
@@ -57,7 +59,15 @@
     "#### Learning outcomes\n",
     "This notebook demonstrates two methods to parallelize analyses which access NASA Earthdata directly in the cloud. The first method is used to compute the timeseries of global mean sea surface temperatures using the Multiscale Ultrahigh Resolution (MUR) Global Foundation Sea Surface Temperature data set (a gridded SST product), https://doi.org/10.5067/GHGMR-4FJ04. The second method uses the same data set to compute a 2D spatial map of SST's at each grid point. \n",
     "\n",
-    "In both cases the analyses are parallelized on a local cluster (e.g. using the computing power of only the specific EC2 instance spun up). This notebook does not cover using multi-node, distributed clusters (e.g. combining computing power of multiple VMs at once).  \n"
+    "In both cases the analyses are parallelized on a local cluster (e.g. using the computing power of only the specific EC2 instance spun up). This notebook does not cover using multi-node, distributed clusters (e.g. combining computing power of multiple VMs at once)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05ece8e1-5a0e-4eaa-943f-013d27a6c0ad",
+   "metadata": {},
+   "source": [
+    "## Import packages"
    ]
   },
   {
@@ -808,7 +818,7 @@
    "id": "874e0285-dc71-40df-9082-54d3e33ba813",
    "metadata": {},
    "source": [
-    "# 2. Parallelization use case 1: computation which is easily iterable over multiple files\n",
+    "# 2. Function replication\n",
     "\n",
     "The example computation used here is computing the global mean at each time stamp. Since each MUR file corresponds to one time stamp, the task is straightforward: replicate the computation once per file. \n",
     "\n",
@@ -1099,7 +1109,7 @@
    "id": "d19cd491-48e2-405f-939c-3b4c7a095768",
    "metadata": {},
    "source": [
-    "# 3. Parallelization use case 2: computation which is not easily iterable over multiple files\n",
+    "# 3. Large dataset chunking\n",
     "\n",
     "The example computation used here is the temporal mean of SST calculated at each grid point. Since each file corresponds to a timestamp, we need data accross all files to compute the mean at each grid point. Therefore, the previous parallelization method would not work well. "
    ]
@@ -2925,7 +2935,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.13"
+   "version": "3.12.3"
   }
  },
  "nbformat": 4,

diff --git a/notebooks/Advanced_cloud/coiled_cluster_01.ipynb b/notebooks/Advanced_cloud/coiled_cluster_01.ipynb
@@ -20,32 +20,35 @@
     "<img src=\"../../images/basicdask_schematic3.png\" alt=\"sch1\" width=\"500\"/>\n",
     "<img src=\"../../images/basicdask_schematic2.png\" alt=\"sch1\" width=\"500\"/>\n",
     "\n",
-    "In a previous notebook, a toy example was used to demonstrate this basic functionality using a local dask cluster and Xarray built-in functions to work on the data set in chunks. In this notebook, we expand that workflow to a more complex analysis, representing something closer to a real-world use-case. In this notebook, we parallelize computations using the third party software/package `Coiled`. In short, `Coiled` will allow us to spin up AWS virtual machines (EC2 instances) and create a distributed cluster out of them, all with a few lines of Python from within this notebook. *You will need a Coiled account, but once set up, you can run this notebook entirely from your laptop while the parallel computation portion will be run on the distributed cluster in AWS.* \n",
+    "In a previous notebook, a toy example was used to demonstrate this basic functionality using a local dask cluster and Xarray built-in functions to work on the data set in chunks. In this notebook, that workflow is expanded to a more complex analysis. Parallel computations are performed via the third party software/package `Coiled`. In short, `Coiled` allows us to spin up AWS virtual machines (EC2 instances) and create a distributed cluster out of them, all with a few lines of Python from within a notebook. *You will need a Coiled account, but once set up, you can run this notebook entirely from your laptop while the parallel computation portion will be run on the distributed cluster in AWS.* \n",
     "\n",
     "\n",
     "#### Analysis: Mean Seasonal Cycle of SST Anomalies\n",
     "\n",
     "The analysis will generate the mean seasonal cycle of sea surface temperature (SST) at each gridpoint in a region of the west coast of the U.S.A. \n",
-    "The analysis uses a PO.DAAC hosted gridded global SST data sets:\n",
+    "The analysis uses a PO.DAAC hosted gridded global SST data set:\n",
+    "\n",
     "* GHRSST Level 4 MUR Global Foundation SST Analysis, V4.1: 0.01° x 0.01° resolution, global map, daily files, https://doi.org/10.5067/GHGMR-4FJ04\n",
     "\n",
     "The analysis will use files over the first decade of the time record. The following procedure is used to generate seasonal cycles:\n",
     "\n",
     "<img src=\"../../images/schematic_sst-cycle.png\" alt=\"sch_sst-ssh-corr\" width=\"800\"/>\n",
     "\n",
     "\n",
-    "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about \\\\$0.20. In Section 3, we perform the computation on all the files in the first decade, ~4000 files, ~10 TB of uncompressed data, for about \\\\$3.\n",
+    "In Section 1 of this notebook, the first decade of MUR files are located on PO.DAAC using the `earthaccess` package, then a file is inspected and memory requirements for this data set are assessed. In Section 2, a \"medium-sized\" computation is performed, deriving the mean seasonal cycle for the files thinned out to once per week (570 files, 1.3 TB of uncompressed data) for about $\\$$ 0.20. In Section 3, we perform the computation on all the files in the first decade, ~4000 files, ~10 TB of uncompressed data, for about $\\$$ 3.\n",
     "\n",
     "\n",
     "## Requirements, prerequisite knowledge, learning outcomes\n",
     "\n",
     "#### Requirements to run this notebook\n",
+    "\n",
     "* **Earthdata login account:** An Earthdata Login account is required to access data from the NASA Earthdata system. Please visit https://urs.earthdata.nasa.gov to register and manage your Earthdata Login account. \n",
     "* **Coiled account:** Create a coiled account (free to sign up), and connect it to an AWS account. For more information on Coiled, setting up an account, and connecting it to an AWS account, see their website https://www.coiled.io. \n",
     "* **Compute environment:** This notebook can be run either in the cloud (AWS instance running in us-west-2), or on a local compute environment (e.g. laptop, server), but the data loading step currently works substantially faster in the cloud. In both cases, the parallel computations are still sent to VM's in the cloud.\n",
     "\n",
     "\n",
     "#### Prerequisite knowledge\n",
+    "\n",
     "* The [notebook on Dask basics](https://podaac.github.io/tutorials/notebooks/Advanced_cloud/basic_dask.html) and all prerequisites therein.\n",
     "\n",
     "#### Learning outcomes\n",
@@ -655,7 +658,7 @@
     }
    ],
    "source": [
-    "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file objects from the endpoints which are compatible with Xarray\n",
+    "fileobj_test = earthaccess.open([datainfo[0]])[0] # Generate file-like objects compatible with Xarray\n",
     "sst_test = xr.open_dataset(fileobj_test)['analysed_sst']\n",
     "sst_test"
    ]
@@ -1616,9 +1619,9 @@
     "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n",
     "\n",
     "## Remove linear warming trend:\n",
-    "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n",
-    "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n",
-    "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n",
+    "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n",
+    "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n",
+    "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n",
     "\n",
     "## Mean seasonal cycle:\n",
     "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")"
@@ -1806,7 +1809,7 @@
     }
    ],
    "source": [
-    "fileobjs = earthaccess.open(datainfo) # Generate file objects from the endpoints which are compatible with Xarray"
+    "fileobjs = earthaccess.open(datainfo) # Generate file-like objects compatible with Xarray"
    ]
   },
   {
@@ -2564,15 +2567,15 @@
     "## ----------------\n",
     "## Set up analysis\n",
     "## ----------------\n",
-    "## (Since we're dealing with dask arrays, these functions calls don't do the computations yet, just set them up)\n",
+    "## (Since these are dask arrays, functions calls don't do the computations yet, just set them up)\n",
     "\n",
     "## Subset to region off U.S.A. west coast:\n",
     "sst_regional = sst.sel(lat=slice(*lat_region), lon=slice(*lon_region))\n",
     "\n",
     "## Remove linear warming trend:\n",
-    "p = sst_regional.polyfit(dim='time', deg=1) # Degree 1 polynomial fit coefficients over time for each lat, lon.\n",
-    "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Compute linear trend time series at each lat, lon.\n",
-    "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim only.\n",
+    "p = sst_regional.polyfit(dim='time', deg=1) # Deg 1 poly fit coefficients at each grid point.\n",
+    "fit = xr.polyval(sst_regional['time'], p.polyfit_coefficients) # Linear fit time series at each point.\n",
+    "sst_detrend = (sst_regional - fit) # xarray is smart enough to subtract along the time dim.\n",
     "\n",
     "## Mean seasonal cycle:\n",
     "seasonal_cycle = sst_detrend.groupby(\"time.month\").mean(\"time\")"