Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

val_loss nan #13461

Open
1 task done
lqh964165950 opened this issue Dec 14, 2024 · 2 comments
Open
1 task done

val_loss nan #13461

lqh964165950 opened this issue Dec 14, 2024 · 2 comments
Labels
detect Object Detection issues, PR's question Further information is requested

Comments

@lqh964165950
Copy link

Search before asking

Question

对yolov5进行改进,在head和neck之间加了一个特征增强模块,却出现如下问题,验证损失有一段时间为nan,这是为什么呢?
val_nan

Additional

No response

@lqh964165950 lqh964165950 added the question Further information is requested label Dec 14, 2024
@UltralyticsAssistant UltralyticsAssistant added the detect Object Detection issues, PR's label Dec 14, 2024
@UltralyticsAssistant
Copy link
Member

👋 Hello @lqh964165950, thank you for your interest in YOLOv5 🚀! It sounds like you've made some interesting custom modifications to YOLOv5 by adding a feature enhancement module. Let's work together to troubleshoot this validation loss issue.

If this is a 🐛 Bug Report, we kindly request a minimum reproducible example to help us debug the problem. This includes:

  1. A clear explanation of the changes you made to the YOLOv5 model, especially the feature enhancement module you added.
  2. The exact steps and commands used to train and validate the model.
  3. Logs and outputs from your experiments, including any warnings or errors.
  4. Details of your dataset, including structure and image counts (if applicable).

If this is a custom training ❓ Question, please provide as much detailed information as possible. Be sure to include screenshots or examples of your dataset, training logs, and loss plots. Additionally, check that you're following best practices for training, such as carefully tuning learning rates, verifying dataset quality, and using appropriate augmentation techniques.

Requirements

Ensure you are using [Python>=3.8.0] with all necessary packages installed, including [PyTorch>=1.8]. To set up the environment:

git clone the YOLOv5 repository  # clone
cd into the directory
pip install requirements from the requirements file  # install

Environments

YOLOv5 supports multiple verified environments for running models, including notebooks with free GPU access, Google Cloud, Amazon AMI, and Docker. Please ensure your environment dependencies like CUDA, cuDNN, Python, and PyTorch are up to date, as out-of-date setups often cause instability.

Status

If all the tests in the YOLOv5 Continuous Integration (CI) workflow are passing, this indicates the base code is functioning correctly, and modifications are likely contributing to the issue. You can verify the training, validation, inference, export, and benchmarking features on various operating systems like macOS, Windows, and Ubuntu.

🔍 This is an automated response to help provide initial guidance. An Ultralytics engineer will take a look at your issue and assist you further as soon as possible.

@pderrenger
Copy link
Member

@lqh964165950 the issue of validation loss becoming nan often indicates instability in the training process. Since you've modified the YOLOv5 architecture by adding a feature enhancement module between the neck and head, the problem could stem from the following:

  1. Gradient Instabilities: Ensure that your modifications do not introduce exploding gradients. You can monitor gradients through debugging or by enabling gradient clipping.
  2. Loss Computation: Validate that the outputs from your feature enhancement module are compatible with the loss function expectations.
  3. Learning Rate: Experiment with lowering the learning rate, as architectural changes can affect training stability.
  4. Data Issues: Ensure your dataset is properly formatted and does not contain corrupted or inconsistent labels.

For debugging, consider starting with a smaller dataset and enabling verbose logging. Additionally, verify whether this issue persists with the latest YOLOv5 version. If the nan issue continues, inspect your custom module and its impact on the network's forward and backward passes.

For more details on YOLOv5 loss computation, refer to this documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detect Object Detection issues, PR's question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants