[BUG] EOF error in pickle when reading arrow file #155

AvisP · 2024-07-23T03:58:53Z

Bug report checklist

I provided code that demonstrates a minimal reproducible example.
I confirmed bug exists on the latest mainline of Chronos via source install.

Describe the bug
An error happens when executing the training script on dataset generated using the process mentioned here. Data files used can be downloaded from here Issue is similar to #149. The error message is shown below

D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Using SEED: 3565056063
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Logging dir: output\run-7
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Loading and filtering 2 datasets for training: ['D://Chronos-Finetune//noise-data.arrow', 'D://Chronos-Finetune//kernelsynth-data.arrow']
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Mixing probabilities: [0.9, 0.1]
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Initializing model
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Using random initialization
max_steps is given, it will override any value given in num_train_epochs
2024-07-22 23:38:16,324 - D:\Chronos-Finetune\train.py - INFO - Training
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "D:\Chronos-Finetune\train.py", line 692, in <module>
    app()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 326, in __call__
    raise e
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 661, in main
    return _main(
           ^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 193, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 692, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer_config\decorators.py", line 92, in wrapped
    return cmd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\train.py", line 679, in main
    trainer.train()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 1932, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 2230, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\accelerate\data_loader.py", line 671, in __iter__
    main_iterator = super().__iter__()
                    ^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in __init__
    w.start()
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "<stringsource>", line 2, in pyarrow.lib._RecordBatchFileReader.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]

(Chronos_venv) D:\Chronos-Finetune>D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

Expected behavior
Training/fine tuning should proceed smoothly

To reproduce

Download the data files from from here
Change the path location in config script for the data files
Run the training script using python train.py --config chronos-t5-small.yaml

Environment description
Operating system: Windows 11
CUDA version: 12.4
NVCC version: cuda_12.3.r12.3/compiler.33567101_0
PyTorch version: 2.3.1+cu121
HuggingFace transformers version: 4.42.4
HuggingFace accelerate version: 0.32.1

The text was updated successfully, but these errors were encountered:

lostella · 2024-07-23T07:33:33Z

This seems relevant, see also the first answer here.

TLDR: we probably need to add freeze_support() after if __name__ == "__main__": in the training script

lostella · 2024-07-23T12:27:21Z

@AvisP could you check if the fix proposed in #156 makes it work for you?

AvisP · 2024-07-23T17:55:31Z

@lostella added the freeze_support() after this line but not working. In the example from python website that you shared, there is a call to Process which I don't see happening in the training code, maybe it needs to be inserted before that?

I tried to get wsl on windows and make it from work from there but unfortunately it is not working properly.

abdulfatir · 2024-07-23T18:06:40Z

@AvisP can you share the exact config.yaml that you're using?

AvisP · 2024-07-23T18:09:28Z

Sure here it is. I tried with two datasets also and setting probability to 0.9,0.1

training_data_paths:
# - "D://Chronos-Finetune//noise-data.arrow"
- "D://Chronos-Finetune//kernelsynth-data.arrow"
probability:
- 1.0
# - 0.1
context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 32
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

abdulfatir · 2024-07-23T18:14:24Z

Config looks okay to me. Could you try the following and try again?

Set torch_compile: false.
Set dataloader_num_workers: 0.

Let's use just one kernel synth dataset like you have.

AvisP · 2024-07-24T15:48:25Z

It is running now after making these two changes. Does setting the dataloader_num_workers to 0 cause any slow down of data laoding process? I will try out the evaluation script next. Thanks for your time!

abdulfatir · 2024-08-08T12:34:28Z

@AvisP This looks like a multiprocessing on Windows issue. Setting dataloader_num_workers=0 may lead to some loss in training speed.

RemiKalbe · 2024-11-03T04:03:39Z

I'm having this issue on macos, setting dataloader_num_workers=0 does "fix" it.

The only difference is that it crash at:

TypeError: no default __reduce__ due to non-trivial __cinit__

AvisP added the bug Something isn't working label Jul 23, 2024

AvisP changed the title ~~[BUG] EOF error in pickle when reading arrow file~~ [BUG, WINDOWS] EOF error in pickle when reading arrow file Jul 23, 2024

AvisP changed the title ~~[BUG, WINDOWS] EOF error in pickle when reading arrow file~~ [BUG][WINDOWS] EOF error in pickle when reading arrow file Jul 23, 2024

AvisP changed the title ~~[BUG][WINDOWS] EOF error in pickle when reading arrow file~~ [BUG] EOF error in pickle when reading arrow file Jul 23, 2024

lostella added the windows Concerns running code on Windows label Jul 23, 2024

lostella linked a pull request Jul 23, 2024 that will close this issue

Add call to freeze_support in training script #156

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] EOF error in pickle when reading arrow file #155

[BUG] EOF error in pickle when reading arrow file #155

AvisP commented Jul 23, 2024 •

edited by abdulfatir

Loading

lostella commented Jul 23, 2024

lostella commented Jul 23, 2024

AvisP commented Jul 23, 2024

abdulfatir commented Jul 23, 2024

AvisP commented Jul 23, 2024

abdulfatir commented Jul 23, 2024

AvisP commented Jul 24, 2024

abdulfatir commented Aug 8, 2024

RemiKalbe commented Nov 3, 2024

[BUG] EOF error in pickle when reading arrow file #155

[BUG] EOF error in pickle when reading arrow file #155

Comments

AvisP commented Jul 23, 2024 • edited by abdulfatir Loading

lostella commented Jul 23, 2024

lostella commented Jul 23, 2024

AvisP commented Jul 23, 2024

abdulfatir commented Jul 23, 2024

AvisP commented Jul 23, 2024

abdulfatir commented Jul 23, 2024

AvisP commented Jul 24, 2024

abdulfatir commented Aug 8, 2024

RemiKalbe commented Nov 3, 2024

AvisP commented Jul 23, 2024 •

edited by abdulfatir

Loading