FESOM data compression

This repo contains code examples and description of ways how to compress FESOM data. For now we consider only netCDF compression, but leaverage new capabilities, that become available in netCDF 4.9.0.

TL;DR

In netCDF 4.9.0 support for new compressions (fast an efficient) were added.
To use it one have to install latest netCDF and couple of other libraries, since, for example levante do not have them in their modules. Good news is that it's easy to do it with conda.
To work with compressed data in python (e.g. xarray) one also have to upgrade netCDF and install one or two libraries, but it should work smoothly.
All result in up to 50% data reduction and improve the speed of parallel data processing, so probably worth the effort.

Installing necessary libs

You should have your own installation of conda/mamba. It might work with conda provided by HPC, but I never tried it, so no garantee.

conda env create -f requirements-compress.yaml

will create compression environment for you, that you should activate with:

conda activate compression

Now we have to setup a path to HDF plugins installed through hdf5plugin library. In my case the path looks like this:

export HDF5_PLUGIN_PATH='/work/ab0995/a270088/mambaforge/envs/cc/lib/python3.10/site-packages/hdf5plugin/plugins/'

Simple way to find out where your hdf5plugin library is installed is to run:

python -c "import hdf5plugin; print(hdf5plugin.__path__)"

Simple way

Now you hopefully have everything set up. We are going to use simple nccopy to perform the compression. To compress using zstd:

nccopy -4 -F '*,32015,3' input.nc output.nc

The -4 means the output file will be netCDF4 format, and cryptic numbers after -F is actually your compression format. Here is the link to Information on Registered Filter Plugins. The * at the beggining means we compress all variables, so if you want to only compress one of them (e.g. ignoring coordinate variables), you can do 'salt,32015,3'

So far we worked with:

*,32015,3 - zstd better compressions, a bit slower
*,32004,0 - lz4 a bit worse compression, but faster

There is a lot of work to explore what options could be better, but for now those two seems like a good starting point.

On 12 years of CORE2 data, results are:

Uncompressed	lz4	zstd
63.6Gb	40.87Gb	36.67Gb

We diecide to go for zstd for now, while lz4 is something to explore as well.

For 13 years of 3M D3 mesh:

Uncompressed	zstd	ratio
2139Gb	1327Gb	0.62

Additional thing you can do is to provide chunking options, like this:

nccopy -4 -c 'salt:5,5,10000'  -F '*,32004,0' input.nc output.nc

This should improve your post processing, but we have to test it still.

Simplier way

You can use the compress_data.sh script from this repo. It will simply convert all *.nc files from one folder to another. The usage is:

./compress_data_parallel.sh  /INPUT/FOLDER /OUTPUT/FOLDER/ zstd 1

Options are zstd and lz4, the last argument is the number of parallel processes, that will be handled by GNU parallel

Runing it in parallel seems to work pretty well - we can compress 13 years of D3 in just 7 minutes with 50 precesses. But PLEASE use compute nodes to do this, people on login nodes will thank you. On levante you can allocate ineractive session like this:

salloc --partition=compute --nodes=1 --time=03:00:00 --account abXXXX

Of course using batch mode is even better.

Although there are some simple checks done in the script, checking if everything is fine before deleting your original data is on you! :)

Advanced way

We are working on more advanced solution, that will allow you have more control over how you compress your files, but it is still very much work in progress

Reading the files

With command line utilities things seems to work fine as long as you have HDF5_PLUGIN_PATH pointing to the right place.

xarray

To make it work in xarray, you should add import of hdf5plugin to your scripts:

import xarray as xr
import hdf5plugin

data = xr.open_mfdataset('./salt.fesom.*.nc')
data.salt[0,0,0].values
>> array(33.12693892)

dask

With dask you should add one more trick to make it work, when create the cluster:

import xarray as xr
import hdf5plugin
from dask.distributed import Client
import dask

client = Client(n_workers=40, threads_per_worker=1, memory_limit='5GB')
client.run(hdf5plugin.register)

Test of 13 years of dask powered (40 workers, 5Gb each) TKE computation (tke = 0.5*(uu+vv)) for monthly D3 data, show that with uncompressed data it takes 4 minutes, while with compressed (zstd) only 3m 20sec. So there is no degradation in compute time, but rather improvement.

Authors

Nikolay Koldunov, with great help of Suvarchal Cheedela and Fabian Wachsmann.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
batch_compress_juwels.sh		batch_compress_juwels.sh
batch_compress_levante.sh		batch_compress_levante.sh
compress_data.sh		compress_data.sh
requirements-compress.yaml		requirements-compress.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FESOM data compression

TL;DR

Installing necessary libs

Simple way

Simplier way

Advanced way

Reading the files

xarray

dask

Authors

About

Releases

Packages

Contributors 2

Languages

FESOM/FESOM_compression

Folders and files

Latest commit

History

Repository files navigation

FESOM data compression

TL;DR

Installing necessary libs

Simple way

Simplier way

Advanced way

Reading the files

xarray

dask

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages