diff --git a/docs/usage/chunking.ipynb b/docs/usage/chunking.ipynb index 09c53ef..5df0da1 100644 --- a/docs/usage/chunking.ipynb +++ b/docs/usage/chunking.ipynb @@ -13,11 +13,26 @@ " If you don’t know what a dask array is, check out the dask array documentation [here](https://docs.dask.org/en/stable/array.html)\n", "```\n", "\n", - "The default dask chunking used when loading datasets from an Intake-ESM datastore of NetCDF files is a single chunk per file (this is the same as the `xarray.open_mfdataset` default). In many cases, this may produce poor chunk sizes for the subsequent analysis.\n", + "Intake-ESM uses `xarray.open_dataset` to open datasets. By default, the argument `chunks={}` is used, which gives the following chunking:\n", + "- xarray < v2023.09.0: a single chunk per file\n", + "- xarray >= v2023.09.0: the \"preferred\" chunking for the file type being opened, which for NetCDF4 is the chunking on disk.\n", + "\n", + "In some cases, these defaults may produce poor chunk sizes for the subsequent analysis.\n", "\n", "In this tutorial, we briefly demonstrate how one might go about choosing and setting the dask chunk size when opening some data using the ACCESS-NRI catalog. This tutorial may be relevant to users wanting to open and work with some of the particularly large data products in the catalog, where careful consideration of chunking can make the difference between an analysis running or crashing. You can download the Jupyter notebook rendered below from [here](https://github.com/ACCESS-NRI/access-nri-intake-catalog/blob/main/docs/usage/chunking.ipynb) and run it yourself in an NCI ARE instance." ] }, + { + "cell_type": "code", + "execution_count": 1, + "id": "9722bf0e-bb48-4328-bc40-f2485c7a9e27", + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.filterwarnings(\"ignore\") # Suppress warnings for these docs" + ] + }, { "cell_type": "markdown", "id": "537ef34d-82d0-42fa-ba01-fa8be8b66ae5", @@ -27,7 +42,7 @@ "\n", "Before diving into the tutorial, it's worth touching on a few important considerations when choosing dask chunk sizes:\n", "\n", - "- Dask will distribute work (i.e. your analysis) on different chunks to different workers. If chunks are too small, the overhead burden of coordinating all the chunks starts to become significant. If chunk sizes are too large, your workers are likely to run out of memory. For typical dask cluster configurations, a good rule of thumb is to aim for chunk sizes between 100MB and 1GB.\n", + "- Dask will distribute work (i.e. your analysis) on different chunks to different workers. If chunks are too small, the overhead burden of coordinating all the chunks starts to become significant. If chunk sizes are too large, your workers are likely to run out of memory. For typical dask cluster configurations, a good rule-of-thumb is to aim for chunk sizes between 100 MB and 1 GB.\n", "\n", "- It's important to consider the analysis you're wanting to do when choosing chunk sizes. In general, you want to avoid dask having to pass data between workers. Think about how you can chunk your data so that each worker can do your analysis more-or-less independently of other workers. For example, if you're only doing operations in time, it may make sense to chunk your data along the spatial dimensions.\n", "\n", @@ -43,12 +58,12 @@ "source": [ "## Open some difficult data with default chunking\n", "\n", - "In this example, we'll be using the ACCESS-OM2 `01deg_jra55v13_ryf9091` product. This was chosen because it includes data that is particularly ill suited to the default chunking strategy. We can load the ACCESS-NRI catalog and open the Intake-ESM datastore for this product directly." + "In this example, we'll be using the ACCESS-OM2 `01deg_jra55v13_ryf9091` product. This was chosen because it includes data that is particularly ill suited to the default chunking strategy used by Intake-ESM. We can load the ACCESS-NRI catalog and open the Intake-ESM datastore for this product directly." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 2, "id": "40a7fee3-feec-450d-ba4f-1274f0e35aa5", "metadata": { "tags": [] @@ -62,7 +77,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 3, "id": "b0ab565b-51f5-411d-a102-74480abd6453", "metadata": { "tags": [] @@ -115,7 +130,7 @@ " \n", " \n", " end_date\n", - " 3360\n", + " 3361\n", " \n", " \n", " variable_long_name\n", @@ -123,11 +138,15 @@ " \n", " \n", " variable_standard_name\n", - " 35\n", + " 36\n", " \n", " \n", " variable_cell_methods\n", - " 2\n", + " 3\n", + " \n", + " \n", + " variable_units\n", + " 50\n", " \n", " \n", " filename\n", @@ -171,7 +190,7 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 4, "id": "8845b022-b6cf-4d33-b80c-b8119e33b4e1", "metadata": { "tags": [] @@ -183,7 +202,7 @@ "'/proxy/8787/status'" ] }, - "execution_count": 3, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -206,7 +225,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 5, "id": "16ed873e-d7e9-42e0-b946-9c7aa221fcc4", "metadata": { "tags": [] @@ -225,7 +244,17 @@ "id": "6eaebd7d-74e6-4c5e-8ddc-e9b14153b48d", "metadata": {}, "source": [ - "As mentioned above, the default dask chunking when opening data from our datastore is to have one chunk per file, which in this case results in chunks that are 250 GB each (see the \"Chunk\" column in the summary below)!" + "As mentioned above, the default dask chunking when opening data from our datastore with xarray < v2023.09.0 is to have one chunk per file, which in this case results in chunks that are 250 GB each (see the \"Chunk\" column in the DataArray summary below)!" + ] + }, + { + "cell_type": "markdown", + "id": "2bd099ac-cbe7-43e5-8391-594464fd876e", + "metadata": {}, + "source": [ + "```{note}\n", + " If you ran this notebook with xarray >= v2023.09.0, you will see chunk sizes of 3.2 MB. This is because the behaviour of the default chunking argument used by Intake-ESM changed with xarray v2023.09.0. 3.2 MB is definitely better than 250 GB, but its' still quite far from our rule-of-thumb of 100 MB - 1 GB, so what follows is still relevant.\n", + "```" ] }, { @@ -870,7 +899,17 @@ "source": [ "## Choosing better dask chunk sizes\n", "\n", - "To choose out chunk sizes, we want to take into account the chunk layout of the file we're reading. In this case, as for all products currently in the ACCESS-NRI catalog, the files are NetCDF4 format. We can see how they are chunked by looking at the output of `ncdump -hs ` run on one of the files that make up the dataset, or by looking at the `encoding` attribute of the variable we're interested in our xarray dataset" + "To choose our chunk sizes, we want to take into account the chunk layout of the files we're reading. In this case, as for the vast majority of the products currently in the ACCESS-NRI catalog, the files are NetCDF4 format. We can see how they are chunked by looking at the output of `ncdump -hs ` run on one of the files that make up the dataset, or by looking at the `encoding` attribute of the variable we're interested in in our xarray dataset" + ] + }, + { + "cell_type": "markdown", + "id": "feadcc85-6912-47e7-a8c8-af5410c627eb", + "metadata": {}, + "source": [ + "```{note}\n", + "If you ran this notebook with xarray >= v2023.09.0, the default chunking returned by Intake-ESM is the file chunking so there's no need to use `ncdump` or look at the `encoding` attribute - you can just read it from the \"Chunk\" column in the DataArray summary above.\n", + "```" ] }, { @@ -917,7 +956,7 @@ "source": [ "You can see that the NetCDF4 chunk sizes are 1, 7, 300 and 400 for the \"time\", \"st_ocean\", \"yt_ocean\" and \"xt_ocean\" dimensions, respectively.\n", "\n", - "Choosing the optimal dask chunk sizes for a given analysis can sometimes be a bit of an iterative process. As a first try we'll use a chunk layout that provides ~300 MB chunk sizes, doesn't divide the chunking on disk, and chunks along the spatial dimensions, since our \"analysis\" here is to take the standard deviation in time. We can specify the chunking we want when we open the dataset from our Intake-ESM datastore using the `xarray_open_kwargs` argument. This can be passed to either `to_dataset_dict` or `to_dask` depending on how many datasets you are trying to open." + "Choosing the optimal dask chunk sizes for a given analysis can sometimes be a bit of an iterative process. As a first try we'll use a chunk layout that provides ~300 MB chunk sizes, doesn't divide the chunking on disk, and chunks along the spatial dimensions, since our \"analysis\" here is to take the standard deviation in time. We can specify the chunking we want when we open the dataset from our Intake-ESM datastore using the `xarray_open_kwargs` argument. This can be passed to either `to_dataset_dict` or `to_dask` depending on how many datasets you are trying to open. For more details on specifying chunk sizes see the xarray documentation on [Parallel computing with Dask](https://xarray.pydata.org/en/v2023.09.0/user-guide/dask.html)." ] }, { @@ -929,7 +968,7 @@ }, "outputs": [], "source": [ - "xarray_open_kwargs = {\"chunks\": {\"st_ocean\": 7, \"xt_ocean\": 400, \"yt_ocean\": 300}}\n", + "xarray_open_kwargs = {\"chunks\": {\"time\": -1, \"st_ocean\": 7, \"xt_ocean\": 400, \"yt_ocean\": 300}}\n", "\n", "ds = esm_datastore_filtered.to_dask(xarray_open_kwargs=xarray_open_kwargs)" ] @@ -1606,7 +1645,7 @@ "id": "53f4b043-419b-41e5-bc21-ab90cbc2bfd3", "metadata": {}, "source": [ - "With this chunking, memory is now managed effectively my our dask workers and our computation now finishes in a timely manner." + "With this chunking, memory is now managed effectively by our dask workers and our computation now finishes in a timely manner." ] }, { @@ -1667,9 +1706,9 @@ ], "metadata": { "kernelspec": { - "display_name": "Python (access-nri-intake-test)", + "display_name": "Python 3 (ipykernel)", "language": "python", - "name": "access-nri-intake-test" + "name": "python3" }, "language_info": { "codemirror_mode": { @@ -1681,7 +1720,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.0" + "version": "3.10.12" } }, "nbformat": 4,