-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lambkin got stuck during a benchmark run #94
Comments
Qualitatively this became much more noticeable after instrumenting measurements with the timememory. While a previous run of the run without timememory got stuck once during the four (effective) days run, so far I've had to restart it five times and I'm only halfway through the same bagfiles set. A limited set of logs I observed seem to have these in common:
|
I'll try this: ros2/rmw_fastrtps#704 |
Did it make a difference? Service discovery in FastRTPS (or FastDDS) isn't great, some |
All the lambkin runs I did after I created this issue used the proposal in ros2/rmw_fastrtps#704 and I never saw this issue again, so I guess it did. |
Bug description
While running a large benchmark run testing beluga, lambkin got stuck during a case and never recovered.
How to reproduce
No idea.
Expected behavior
Continue to run until the final case.
Actual behavior
About two days into the run, it stopped moving forward. ROS nodes where up, but nothing relevant was logged, and output bagfile was empty.
Additional context
No resources were obviously missing in the computer, there was enough disk space, and the computer (beefy) was otherwise idle.
These are the logs of the final few cases/iterations leading to the stop. I removed the bagfiles due to their size, but all but the last one were of the expected size. The one of the iteration that got stuck was empty, like nothing had been recorded since the iteration started.
tor_wic_slam_error.tar.gz
The text was updated successfully, but these errors were encountered: