Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] EOF error in pickle when reading arrow file #155

Open
2 tasks done
AvisP opened this issue Jul 23, 2024 · 9 comments · May be fixed by #156
Open
2 tasks done

[BUG] EOF error in pickle when reading arrow file #155

AvisP opened this issue Jul 23, 2024 · 9 comments · May be fixed by #156
Labels
bug Something isn't working windows Concerns running code on Windows

Comments

@AvisP
Copy link

AvisP commented Jul 23, 2024

Bug report checklist

  • I provided code that demonstrates a minimal reproducible example.
  • I confirmed bug exists on the latest mainline of Chronos via source install.

Describe the bug
An error happens when executing the training script on dataset generated using the process mentioned here. Data files used can be downloaded from here Issue is similar to #149. The error message is shown below

D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Using SEED: 3565056063
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Logging dir: output\run-7
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Loading and filtering 2 datasets for training: ['D://Chronos-Finetune//noise-data.arrow', 'D://Chronos-Finetune//kernelsynth-data.arrow']
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Mixing probabilities: [0.9, 0.1]
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Initializing model
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Using random initialization
max_steps is given, it will override any value given in num_train_epochs
2024-07-22 23:38:16,324 - D:\Chronos-Finetune\train.py - INFO - Training
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "D:\Chronos-Finetune\train.py", line 692, in <module>
    app()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 326, in __call__
    raise e
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 661, in main
    return _main(
           ^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 193, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 692, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer_config\decorators.py", line 92, in wrapped
    return cmd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\train.py", line 679, in main
    trainer.train()
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 1932, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 2230, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\accelerate\data_loader.py", line 671, in __iter__
    main_iterator = super().__iter__()
                    ^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in __iter__
    return self._get_iterator()
           ^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in __init__
    w.start()
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "<stringsource>", line 2, in pyarrow.lib._RecordBatchFileReader.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
  0%|                                                                                                                                                                                                                    | 0/200000 [00:00<?, ?it/s]

(Chronos_venv) D:\Chronos-Finetune>D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
  warnings.warn(
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
    self = reduction.pickle.load(from_parent)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input

Expected behavior
Training/fine tuning should proceed smoothly

To reproduce

  1. Download the data files from from here
  2. Change the path location in config script for the data files
  3. Run the training script using python train.py --config chronos-t5-small.yaml

Environment description
Operating system: Windows 11
CUDA version: 12.4
NVCC version: cuda_12.3.r12.3/compiler.33567101_0
PyTorch version: 2.3.1+cu121
HuggingFace transformers version: 4.42.4
HuggingFace accelerate version: 0.32.1

@AvisP AvisP added the bug Something isn't working label Jul 23, 2024
@AvisP AvisP changed the title [BUG] EOF error in pickle when reading arrow file [BUG, WINDOWS] EOF error in pickle when reading arrow file Jul 23, 2024
@AvisP AvisP changed the title [BUG, WINDOWS] EOF error in pickle when reading arrow file [BUG][WINDOWS] EOF error in pickle when reading arrow file Jul 23, 2024
@AvisP AvisP changed the title [BUG][WINDOWS] EOF error in pickle when reading arrow file [BUG] EOF error in pickle when reading arrow file Jul 23, 2024
@lostella
Copy link
Contributor

This seems relevant, see also the first answer here.

TLDR: we probably need to add freeze_support() after if __name__ == "__main__": in the training script

@lostella lostella added the windows Concerns running code on Windows label Jul 23, 2024
@lostella lostella linked a pull request Jul 23, 2024 that will close this issue
3 tasks
@lostella lostella linked a pull request Jul 23, 2024 that will close this issue
3 tasks
@lostella
Copy link
Contributor

@AvisP could you check if the fix proposed in #156 makes it work for you?

@AvisP
Copy link
Author

AvisP commented Jul 23, 2024

@lostella added the freeze_support() after this line but not working. In the example from python website that you shared, there is a call to Process which I don't see happening in the training code, maybe it needs to be inserted before that?

I tried to get wsl on windows and make it from work from there but unfortunately it is not working properly.

@abdulfatir
Copy link
Contributor

@AvisP can you share the exact config.yaml that you're using?

@AvisP
Copy link
Author

AvisP commented Jul 23, 2024

Sure here it is. I tried with two datasets also and setting probability to 0.9,0.1

training_data_paths:
# - "D://Chronos-Finetune//noise-data.arrow"
- "D://Chronos-Finetune//kernelsynth-data.arrow"
probability:
- 1.0
# - 0.1
context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 32
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
  low_limit: -15.0
  high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true

@abdulfatir
Copy link
Contributor

Config looks okay to me. Could you try the following and try again?

  • Set torch_compile: false.
  • Set dataloader_num_workers: 0.

Let's use just one kernel synth dataset like you have.

@AvisP
Copy link
Author

AvisP commented Jul 24, 2024

It is running now after making these two changes. Does setting the dataloader_num_workers to 0 cause any slow down of data laoding process? I will try out the evaluation script next. Thanks for your time!

@abdulfatir
Copy link
Contributor

@AvisP This looks like a multiprocessing on Windows issue. Setting dataloader_num_workers=0 may lead to some loss in training speed.

@RemiKalbe
Copy link

I'm having this issue on macos, setting dataloader_num_workers=0 does "fix" it.

The only difference is that it crash at:

TypeError: no default __reduce__ due to non-trivial __cinit__

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working windows Concerns running code on Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants