You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to run tape-embed, but received the following error message (everything went fine when I ran it with the --no_cuda flag):
(protein) wbogud@cuda:~/projects/protein$ time tape-embed transformer ../data/test.fasta embeddings.npz models/tape/bert-base/
20/04/22 16:12:11 - INFO - tape.training - device: cuda n_gpu: 4
20/04/22 16:12:11 - INFO - tape.models.modeling_utils - loading configuration file models/tape/bert-base/config.json
20/04/22 16:12:11 - INFO - tape.models.modeling_utils - Model config {
"attention_probs_dropout_prob": 0.1,
"base_model": "transformer",
"finetuning_task": null,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"input_size": 768,
"intermediate_size": 3072,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 8192,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"num_labels": -1,
"output_attentions": false,
"output_hidden_states": false,
"output_size": 768,
"pruned_heads": {},
"torchscript": false,
"type_vocab_size": 1,
"vocab_size": 30
}
20/04/22 16:12:11 - INFO - tape.models.modeling_utils - loading weights file models/tape/bert-base/pytorch_model.bin
0%| | 0/1 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/wbogud/anaconda3/envs/protein/bin/tape-embed", line 8, in <module>
sys.exit(run_embed())
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/main.py", line 234, in run_embed
training.run_embed(**embed_args)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 642, in run_embed
outputs = runner.forward(batch, no_loss=True)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/training.py", line 86, in forward
outputs = self.model(**batch)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/home/wbogud/anaconda3/envs/protein/lib/python3.8/site-packages/tape/models/modeling_bert.py", line 443, in forward
dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration
Downgrading to PyTorch 1.4.0 solved the issue.
Could the error be related to a known issue of PyTorch 1.5.0 described at https://github.com/pytorch/pytorch/releases/tag/v1.5.0? (torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode)
The text was updated successfully, but these errors were encountered:
That seems plausible. We've moved to pytorch lightning in an internal version of this code, which sidesteps some of the version issues. We are looking into cleaning that up and making it public.
I was trying to run tape-embed, but received the following error message (everything went fine when I ran it with the --no_cuda flag):
Downgrading to PyTorch 1.4.0 solved the issue.
Could the error be related to a known issue of PyTorch 1.5.0 described at https://github.com/pytorch/pytorch/releases/tag/v1.5.0? (torch.nn.parallel.DistributedDataParallel does not work in Single-Process Multi-GPU mode)
The text was updated successfully, but these errors were encountered: