OSError: [Errno 9] Bad file descriptor #42

aws-stdun · 2022-11-13T02:12:51Z

How to reproduce

Using a p4d.24xlarge:

from parallelformers import parallelize
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "facebook/opt-66b"
batch_size = [1]
batch = [["out story begins on"] * bs for bs in batch_size]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
parallelize(model, num_gpus=8, fp16=True)
for _ in range(100):
    model.generate(
        torch.cat(inputs, dim=0),
        do_sample=True,
        max_length=2048,
        num_return_sequences=1,
    )

It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while.
Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:

Traceback (most recent call last):                                                                         
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps                                                                                                         
    cls(buf, protocol).dump(obj)                                                                           
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage                                                                          
    df = multiprocessing.reduction.DupFd(fd)                                                               
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd                                                                                                        
    return resource_sharer.DupFd(fd)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__                                                                                                
    new_fd = os.dup(fd)                                                                                    
OSError: [Errno 9] Bad file descriptor

It then seems to hang forever from there.

I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?

Environment

OS : Ubuntu 20.04.4 LTS
Python version : 3.8.13
Transformers version : 4.24.0
Whether to use Docker : No
Misc. : N/A

The text was updated successfully, but these errors were encountered:

mkardas · 2023-01-11T12:54:11Z

Changing pytorch sharing strategy seems to help:

import torch
torch.multiprocessing.set_sharing_strategy("file_system")

aws-stdun added the bug Something isn't working label Nov 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OSError: [Errno 9] Bad file descriptor #42

OSError: [Errno 9] Bad file descriptor #42

aws-stdun commented Nov 13, 2022 •

edited

Loading

mkardas commented Jan 11, 2023

OSError: [Errno 9] Bad file descriptor #42

OSError: [Errno 9] Bad file descriptor #42

Comments

aws-stdun commented Nov 13, 2022 • edited Loading

How to reproduce

Environment

mkardas commented Jan 11, 2023

aws-stdun commented Nov 13, 2022 •

edited

Loading