Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training deeplabv3_resnet50 help #342

Open
corey-dawson opened this issue Jan 24, 2024 · 0 comments
Open

Training deeplabv3_resnet50 help #342

corey-dawson opened this issue Jan 24, 2024 · 0 comments

Comments

@corey-dawson
Copy link

corey-dawson commented Jan 24, 2024

Hello,
I am not sure if this is the correct place for a training question for one of these models but will give it an attempt anyways. I am trying to start with the deeplabv3_resnet50 vision segmentation pre-trained model and run a training on the model to fit it to my application. Unfortunately, no matter how big of a GPU I try to use, I always get an error message about "CUDA out of memory". For my latest attempt on AWS, utilizing a 24GB GPU instance. Are there any suggestions for training a vision segmentation model? Thanks in advance.

Vars:

Dataloader batch size: 5
epochs: 5
classes: 1

Error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 860.00 MiB (GPU 0: 21.99 GiB total capacity: 21.38 GiB already allocated; 21.42 GiB reserved in total by PyTorch)
  from torchvision.models.segmentation import deeplabv3_resnet50, DeepLabV3_ResNet50_Weights
  model = deeplabv3_resnet50(weights=DeepLabV3_ResNet50_Weights.DEFAULT)
  num_classes = 2 # object+ background
  model.classifier[4] = torch.nn.Conv2d(256, 2, kernel_size=(1, 1), stride=(1, 1))
  model.aux_classifier[4] = torch.nn.Conv2d(256, 2, kernel_size=(1, 1), stride=(1, 1))

  epochs = 5
  device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
  model = model.to(device)
  model.train()
  losses = []
  criterion = smp.losses.DiceLoss(smp.losses.BINARY_MODE, from_logits=True)
  optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

  for i in range(epochs):
      for batch_idx, (images, masks) in enumerate(trainloader):
          images = images.to(device) # error is thrown when images and masks attempt to load to GPU
          masks = masks.to(device) 
          outputs = model(images)["out"]
          loss = criterion(outputs, masks)
          losses.append(loss)
      
          # Backward and optimize
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()

Note: utilizing AWS training jobs with g5.2xlarge container. Container stats are below:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant