Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lambkin got stuck during a benchmark run #94

Open
glpuga opened this issue Aug 13, 2024 · 5 comments
Open

Lambkin got stuck during a benchmark run #94

glpuga opened this issue Aug 13, 2024 · 5 comments
Labels
bug Something isn't working needs-fix Bug confirmed, need a fix

Comments

@glpuga
Copy link
Collaborator

glpuga commented Aug 13, 2024

Bug description

While running a large benchmark run testing beluga, lambkin got stuck during a case and never recovered.

How to reproduce

No idea.

Expected behavior

Continue to run until the final case.

Actual behavior

About two days into the run, it stopped moving forward. ROS nodes where up, but nothing relevant was logged, and output bagfile was empty.

Additional context

No resources were obviously missing in the computer, there was enough disk space, and the computer (beefy) was otherwise idle.

These are the logs of the final few cases/iterations leading to the stop. I removed the bagfiles due to their size, but all but the last one were of the expected size. The one of the iteration that got stuck was empty, like nothing had been recorded since the iteration started.

tor_wic_slam_error.tar.gz

@glpuga glpuga added the bug Something isn't working label Aug 13, 2024
@glpuga
Copy link
Collaborator Author

glpuga commented Aug 26, 2024

Qualitatively this became much more noticeable after instrumenting measurements with the timememory. While a previous run of the run without timememory got stuck once during the four (effective) days run, so far I've had to restart it five times and I'm only halfway through the same bagfiles set.

A limited set of logs I observed seem to have these in common:

  • the logs never start, the amcl nodes never start processing data.
  • one node in the set (in one case nav2_amcl, in another the rosbag recorder), generates a log like this "failed to send response to /rosbag2_recorder/list_parameters (timeout)".

@glpuga
Copy link
Collaborator Author

glpuga commented Aug 26, 2024

I'll try this: ros2/rmw_fastrtps#704

@hidmic hidmic added the needs-fix Bug confirmed, need a fix label Oct 24, 2024
@hidmic
Copy link
Collaborator

hidmic commented Oct 24, 2024

I'll try this: ros2/rmw_fastrtps#704

Did it make a difference? Service discovery in FastRTPS (or FastDDS) isn't great, some ros2cli verb can hang forever, and LAMBKIN doesn't consistently assign timeouts to all the processes it manages.

@glpuga
Copy link
Collaborator Author

glpuga commented Oct 28, 2024

I'll try this: ros2/rmw_fastrtps#704

Did it make a difference? Service discovery in FastRTPS (or FastDDS) isn't great, some ros2cli verb can hang forever, and LAMBKIN doesn't consistently assign timeouts to all the processes it manages.

All the lambkin runs I did after I created this issue used the proposal in ros2/rmw_fastrtps#704 and I never saw this issue again, so I guess it did.

@glpuga
Copy link
Collaborator Author

glpuga commented Oct 30, 2024

I permamently added this to #93 with fc23f02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-fix Bug confirmed, need a fix
Projects
None yet
Development

No branches or pull requests

2 participants