-
Notifications
You must be signed in to change notification settings - Fork 183
/
plot_disk_cache.py
264 lines (239 loc) · 9.73 KB
/
plot_disk_cache.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
"""
=================================================
Cache on disk intermediate data processing states
=================================================
This example shows how intermediate data processing
states can be cached on disk to speed up the loading
of this data in subsequent calls.
When a MOABB paradigm processes a dataset, it will
first apply processing steps to the raw data, this is
called the ``raw_pipeline``. Then, it will convert the
raw data into epochs and apply processing steps on the
epochs, this is called the ``epochs_pipeline``.
Finally, it will eventually convert the epochs into arrays,
this is called the ``array_pipeline``. In summary:
``raw_pipeline`` --> ``epochs_pipeline`` --> ``array_pipeline``
After each step, MOABB offers the possibility to save on disk
the result of the step. This is done by setting the ``cache_config``
parameter of the paradigm's ``get_data`` method.
The ``cache_config`` parameter is a dictionary that can take all
the parameters of ``moabb.datasets.base.CacheConfig`` as keys,
they are the following: ``use``, ``save_raw``, ``save_epochs``,
``save_array``, ``overwrite_raw``, ``overwrite_epochs``,
``overwrite_array``, and ``path``. You can also directly pass a
``CacheConfig`` object as ``cache_config``.
If ``use=False``, the ``save_*`` and ``overwrite_*``
parameters are ignored.
When trying to use the cache (i.e. ``use=True``), MOABB will
first check if there exist a cache of the result of the full
pipeline (i.e. ``raw_pipeline`` --> ``epochs_pipeline`` ->
``array_pipeline``).
If there is none, we remove the last step of the pipeline and
look for its cached result. We keep removing steps and looking
for a cached result until we find one or until we reach an
empty pipeline.
Every time, if the ``overwrite_*`` parameter
of the corresponding step is true, we first try to erase the
cache of this step.
Once a cache has been found or the empty pipeline has been reached,
depending on the case we either load the cache or the original dataset.
Then, apply the missing steps one by one and save their result
if their corresponding ``save_*`` parameter is true.
By default, only the result of the ``raw_pipeline`` is saved.
This is usually a good compromise between speed and disk space
because, when using cached raw data, the epochs can be obtained
without preloading the whole raw signals, only the necessary
intervals. Yet, because only the raw data is cached, the epoching
parameters can be changed without creating a new cache each time.
However, if your epoching parameters are fixed, you can directly
cache the epochs or the arrays to speed up the loading and
reduce the disk space used.
.. note::
The ``cache_config`` parameter is also available for the ``get_data``
method of the datasets. It works the same way as for a
paradigm except that it will save un-processed raw recordings.
"""
# Authors: Pierre Guetschel <[email protected]>
#
# License: BSD (3-clause)
import shutil
import tempfile
###############################################################################
import time
from pathlib import Path
from moabb import set_log_level
from moabb.datasets import Zhou2016
from moabb.paradigms import LeftRightImagery
set_log_level("info")
###############################################################################
# Basic usage
# -----------
#
# The ``cache_config`` parameter is a dictionary that has the
# following default values:
default_cache_config = dict(
save_raw=False,
save_epochs=False,
save_array=False,
use=False,
overwrite_raw=False,
overwrite_epochs=False,
overwrite_array=False,
path=None,
)
###############################################################################
# You don not need to specify all the keys of ``cache_config``, only the ones
# you want to change.
#
# By default, the cache is saved at the MNE data directory (i.e. when
# ``path=None``). The MNE data directory can be found with
# ``mne.get_config('MNE_DATA')``. For this example, we will save it in a
# temporary directory instead:
temp_dir = Path(tempfile.mkdtemp())
###############################################################################
# We will use the Zhou2016 dataset and the LeftRightImagery paradigm in this
# example, but this works for any dataset and paradigm pair.:
dataset = Zhou2016()
paradigm = LeftRightImagery()
###############################################################################
# And we will only use the first subject for this example:
subjects = [1]
###############################################################################
# Then, saving a cache can simply be done by setting the desired parameters
# in the ``cache_config`` dictionary:
cache_config = dict(
use=True,
save_raw=True,
save_epochs=True,
save_array=True,
path=temp_dir,
)
_ = paradigm.get_data(dataset, subjects, cache_config=cache_config)
###############################################################################
# Time comparison
# ---------------
#
# Now, we will compare the time it takes to load the with different levels of
# cache. For this, we will use the cache saved in the previous block and
# overwrite the steps results one by one so that we can compare the time it
# takes to load the data and compute the missing steps with an increasing
# number of missing steps.
#
# Using array cache:
cache_config = dict(
use=True,
path=temp_dir,
save_raw=False,
save_epochs=False,
save_array=False,
overwrite_raw=False,
overwrite_epochs=False,
overwrite_array=False,
)
t0 = time.time()
_ = paradigm.get_data(dataset, subjects, cache_config=cache_config)
t_array = time.time() - t0
###############################################################################
# Using epochs cache:
cache_config = dict(
use=True,
path=temp_dir,
save_raw=False,
save_epochs=False,
save_array=False,
overwrite_raw=False,
overwrite_epochs=False,
overwrite_array=True,
)
t0 = time.time()
_ = paradigm.get_data(dataset, subjects, cache_config=cache_config)
t_epochs = time.time() - t0
###############################################################################
# Using raw cache:
cache_config = dict(
use=True,
path=temp_dir,
save_raw=False,
save_epochs=False,
save_array=False,
overwrite_raw=False,
overwrite_epochs=True,
overwrite_array=True,
)
t0 = time.time()
_ = paradigm.get_data(dataset, subjects, cache_config=cache_config)
t_raw = time.time() - t0
###############################################################################
# Using no cache:
cache_config = dict(
use=False,
path=temp_dir,
save_raw=False,
save_epochs=False,
save_array=False,
overwrite_raw=False,
overwrite_epochs=False,
overwrite_array=False,
)
t0 = time.time()
_ = paradigm.get_data(dataset, subjects, cache_config=cache_config)
t_nocache = time.time() - t0
###############################################################################
# Time needed to load the data with different levels of cache:
print(f"Using array cache: {t_array:.2f} seconds")
print(f"Using epochs cache: {t_epochs:.2f} seconds")
print(f"Using raw cache: {t_raw:.2f} seconds")
print(f"Without cache: {t_nocache:.2f} seconds")
###############################################################################
# As you can see, using a raw cache is more than 5 times faster than
# without cache.
# This is because when using the raw cache, the data is not preloaded, only
# the desired epochs are loaded in memory.
#
# Using the epochs cache is a little faster than the raw cache. This is because
# there are several preprocessing steps done after the epoching by the
# ``epochs_pipeline``. This difference would be greater if the ``resample``
# argument was different that the sampling frequency of the dataset. Indeed,
# the data loading time is directly proportional to its sampling frequency
# and the resampling is done by the ``epochs_pipeline``.
#
# Finally, we observe very little difference between array and epochs cache.
# The main interest of the array cache is when the user passes a
# computationally heavy but fixed additional preprocessing (for example
# computing the covariance matrices of the epochs). This can be done by using
# the ``postprocess_pipeline`` argument. The output of this additional pipeline
# (necessary a numpy array) will be saved to avoid re-computing it each time.
#
#
# Technical details
# -----------------
#
# Under the hood, the cache is saved on disk in a Brain Imaging Data Structure
# (BIDS) compliant format. More details on this structure can be found in the
# tutorial :doc:`./plot_bids_conversion`.
#
# However, there are two particular aspects of the way MOABB saves the data
# that are not specific to BIDS:
#
# * For each file, we set a
# `description key <https://bids-specification.readthedocs.io/en/stable/appendices/entities.html#desc>`_.
# This key is a code that corresponds to a hash of the
# pipeline that was used to generate the data (i.e. from raw to the state
# of the cache). This code is unique for each different pipeline and allows
# to identify all the files that were generated by the same pipeline.
# * Once we finish saving all the files for a given combination of dataset,
# subject, and pipeline, we write a file ending in ``"_lockfile.json"`` at
# the root directory of this subject. This file serves two purposes:
#
# * It indicates that the cache is complete for this subject and pipeline.
# If it is not present, it means that something went wrong during the
# saving process and the cache is incomplete.
# * The file contains the un-hashed string representation of the pipeline.
# Therefore, it can be used to identify the pipeline used without having
# to decode the description key.
#
# Cleanup
# -------
#
# Finally, we can delete the temporary folder:
shutil.rmtree(temp_dir)