PoC for reading cuts in background thread in dynamic bucketing #680

pzelasko · 2022-04-19T01:56:39Z

@danpovey it may address the issue described in #678; but I haven't tested it beyond running unit tests successfully. I added a background thread for collect_cuts_in_buckets. Threading should be sufficient, as I expect the main process CPU to be mostly idle during forward passes on GPU. This implementation should be stable but I don't think it covers every possible edge case of multithreading hazards.. I might transform this into a full blown thread-safe queue kind of thing if you can confirm this helps with the training speed (or ends up in a deadlock when run in real training...)

EDIT: I'm not even 100% sure that the mutex is needed at all..

csukuangfj · 2022-04-19T02:09:13Z

Thanks! I will test it and post the training time with and without this PR.

csukuangfj · 2022-04-19T09:59:31Z

It throws the following exception

2022-04-19 16:48:07,508 INFO [train.py:1069] (2/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-19 16:48:07,611 INFO [asr_datamodule.py:266] (6/8) About to create dev dataset
2022-04-19 16:48:07,623 INFO [asr_datamodule.py:285] (6/8) About to create dev dataloader
2022-04-19 16:48:07,624 INFO [train.py:1069] (6/8) Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-04-19 16:48:24,925 INFO [train.py:1010] (0/8) Loading grad scaler state dict
2022-04-19 16:48:24,928 INFO [train.py:1010] (3/8) Loading grad scaler state dict
2022-04-19 16:48:24,929 INFO [train.py:1010] (6/8) Loading grad scaler state dict
2022-04-19 16:48:24,930 INFO [train.py:1010] (7/8) Loading grad scaler state dict
2022-04-19 16:48:24,930 INFO [train.py:1010] (2/8) Loading grad scaler state dict
2022-04-19 16:48:24,932 INFO [train.py:1010] (4/8) Loading grad scaler state dict
2022-04-19 16:48:24,936 INFO [train.py:1010] (5/8) Loading grad scaler state dict
2022-04-19 16:48:24,977 INFO [train.py:1010] (1/8) Loading grad scaler state dict
Traceback (most recent call last):
  File "./pruned_transducer_stateless3/train.py", line 1123, in <module>
    main()
  File "./pruned_transducer_stateless3/train.py", line 1114, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 7 terminated with the following error:
Traceback (most recent call last):
  File "/ceph-fj/fangjun/software/py38/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/ceph-fj/fangjun/open-source-2/icefall-multi-2/egs/librispeech/ASR/pruned_transducer_stateless3/train.py", line 1023, in run
    train_one_epoch(
  File "/ceph-fj/fangjun/open-source-2/icefall-multi-2/egs/librispeech/ASR/pruned_transducer_stateless3/train.py", line 840, in train_one_epoch
    loss_value = tot_loss["loss"] / tot_loss["frames"]
ZeroDivisionError: division by zero

csukuangfj · 2022-04-20T09:43:13Z

lhotse/dataset/sampling/dynamic_bucketing.py

@@ -284,6 +286,9 @@ def __init__(
            deque() for _ in range(len(duration_bins) + 1)
        ]

+        self._cut_reading_thread = ThreadPoolExecutor(1)


Any reason to not use a process pool? Due to the global interpreter lock, there can be only one running thread at any given time in Python, I think.

Yes, with some setups that use IterableDatasetWrapper you are placing the sampler in a dataloader worker process, and AFAIK you can't spawn a nested process pool there because that process is daemonic.

Anyway thread should be sufficient here as I expect the CPU to be mostly idle when running forward and backward passes on GPUs... The reason it didn't work for you is likely the thread could not populate the buckets fast enough and sampler thought they are depleted (race condition). This can be solved with a proper synchronization mechanism but unfortunately I don't have the time to add it right now. I'll return to it sometime.

danpovey · 2022-04-20T14:53:02Z

So do we know how the num-frames could be zero?
I don't know how the forward() could have succeeded if the tensors were empty.

pzelasko · 2022-04-20T17:39:43Z

I suppose the sampler yielded an empty cutset, the dataset somehow didn't crash and collated an empty tensor. It can be fixed with proper synchronization between threads, but to check really quickly if it works, it could be enough to put time.sleep(5) after this line of code to allow the buckets to be populated at the start of __iter__ before their consumption

lhotse/lhotse/dataset/sampling/dynamic_bucketing.py

Line 295 in a9134b9

self._collect_cuts_in_buckets(self.buffer_size)

pzelasko · 2022-04-20T17:40:56Z

... when I have more time again, I'll take care of it and test it end-to-end.

…namic-bucketer

pzelasko · 2022-05-16T18:16:22Z

@danpovey @csukuangfj please check if it is faster now (I checked that it does synchronize correctly with the latest changes). In quick local testing I could not see any difference, but maybe you will notice some in your setup.

csukuangfj · 2022-05-26T03:18:05Z

@danpovey @csukuangfj please check if it is faster now (I checked that it does synchronize correctly with the latest changes). In quick local testing I could not see any difference, but maybe you will notice some in your setup.

I will test it when we have free GPUs.

SongLi89 · 2024-02-06T04:06:44Z

Hi, Is this PR merged? Maybe I have similar problems, that reading the Cuts is not so fast.

pzelasko · 2024-02-09T00:31:26Z

No, it hasn’t been merged — I didn’t find any difference with this implementation in quick testing. Can you describe your environment a bit more? What’s your sampler, max_duration, num_workers, data size, are you reading audio or features, etc. Also I recommend running py-spy on your script (or dataloading worker processes) to understand where the time is being spent.

PoC for reading cuts in background thread in dynamic bucketing

a9134b9

csukuangfj reviewed Apr 20, 2022

View reviewed changes

pzelasko added 3 commits May 16, 2022 13:50

Merge branch 'master' into feature/threaded-dynamic-bucketer

6730dbc

Merge remote-tracking branch 'origin/master' into feature/threaded-dy…

72339bf

…namic-bucketer

Rework threaded reading to use futures for synchronization

c54c1d8

Remove unneeded imports

ff7aa25

pzelasko mentioned this pull request Jun 10, 2022

Remove file handle caching from LilcomChunkyReader #737

Merged

Merge branch 'master' into feature/threaded-dynamic-bucketer

2144b18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC for reading cuts in background thread in dynamic bucketing #680

PoC for reading cuts in background thread in dynamic bucketing #680

pzelasko commented Apr 19, 2022 •

edited

Loading

csukuangfj commented Apr 19, 2022

csukuangfj commented Apr 19, 2022

csukuangfj Apr 20, 2022

pzelasko Apr 20, 2022

danpovey commented Apr 20, 2022

pzelasko commented Apr 20, 2022

pzelasko commented Apr 20, 2022

pzelasko commented May 16, 2022

csukuangfj commented May 26, 2022

SongLi89 commented Feb 6, 2024

pzelasko commented Feb 9, 2024

PoC for reading cuts in background thread in dynamic bucketing #680

Are you sure you want to change the base?

PoC for reading cuts in background thread in dynamic bucketing #680

Conversation

pzelasko commented Apr 19, 2022 • edited Loading

csukuangfj commented Apr 19, 2022

csukuangfj commented Apr 19, 2022

csukuangfj Apr 20, 2022

Choose a reason for hiding this comment

pzelasko Apr 20, 2022

Choose a reason for hiding this comment

danpovey commented Apr 20, 2022

pzelasko commented Apr 20, 2022

pzelasko commented Apr 20, 2022

pzelasko commented May 16, 2022

csukuangfj commented May 26, 2022

SongLi89 commented Feb 6, 2024

pzelasko commented Feb 9, 2024

pzelasko commented Apr 19, 2022 •

edited

Loading