GPU행업 이슈 #19

jason9693 · 2022-01-09T23:54:23Z

How to reproduce

tokenizer = AutoTokenizer.from_pretrained(model_name,
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]') 
model = AutoModelForCausalLM.from_pretrained(model_name)#.to(device='cuda', non_blocking=True)
_ = model.eval()
parallelformers.parallelize(model, num_gpus=4, fp16=True, verbose='detail')

tok = tokenizer("My name is Kevin."*10, return_tensors="pt")
model.generate(
                  tok['input_ids'],
                  max_length=2048, 
                  use_cache=True, no_repeat_ngram_size=3, max_time=5.0)

Environment

OS : Ubuntu 18
Python version : 3.7.9
Transformers version :
Whether to use Docker:
Misc.: V100 x 4

반복적으로 인퍼런스를 하다보면,
간간히 특정 GPU노드에 util이 100%로 차면서 블록이 걸려버리는 이슈가 있습니다.
Ctrl+C를 해도 세마포어에 락이 걸려서 프로세스 중단이 안되네요.

혹시 코드상 에러인가 싶어 아래처럼 일부러 버그를 내도록 유도해 봤는데, 해당 이슈는 모든 노드의 util이 0%로 바뀌고 Ctrl+C를 하면
원인이 되는 에러를 내뱉어서 이 이슈는 아닌듯 합니다.

tok = tokenizer("My name is Kevin."*2048, return_tensors="pt")
model.generate(
                  tok['input_ids'],
                  max_length=2048, 
                  use_cache=True, no_repeat_ngram_size=3, max_time=5.0)

원인을 혹시 좀 알수 있을까 싶어 이슈 남겨봅니다.

hyunwoongko · 2022-01-11T08:16:51Z

그러게요... 원인이 뭘까요...
이게 항상 발생하는 상황이 아니다보니 디버깅이 쉽지 않네요.

jason9693 · 2022-01-13T04:50:45Z

@hyunwoongko oslo도 똑같은 현상이 발견되어
참고차 공유드립니다.
저도 추가로 확인되는 원인 있으면 공유드리겠습니다 :)

hyunwoongko · 2022-01-13T08:09:56Z

@jason9693 OSLO 배포런처 사용시 발생하는 문제인가요?

jason9693 · 2022-01-13T09:32:18Z

@hyunwoongko 넵넵 맞습니다 ㅎㅎ

jason9693 added the bug Something isn't working label Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU행업 이슈 #19

GPU행업 이슈 #19

jason9693 commented Jan 9, 2022 •

edited

Loading

hyunwoongko commented Jan 11, 2022

jason9693 commented Jan 13, 2022 •

edited

Loading

hyunwoongko commented Jan 13, 2022

jason9693 commented Jan 13, 2022

GPU행업 이슈 #19

GPU행업 이슈 #19

Comments

jason9693 commented Jan 9, 2022 • edited Loading

How to reproduce

Environment

hyunwoongko commented Jan 11, 2022

jason9693 commented Jan 13, 2022 • edited Loading

hyunwoongko commented Jan 13, 2022

jason9693 commented Jan 13, 2022

jason9693 commented Jan 9, 2022 •

edited

Loading

jason9693 commented Jan 13, 2022 •

edited

Loading