Lambkin got stuck during a benchmark run #94

glpuga · 2024-08-13T20:43:12Z

Bug description

While running a large benchmark run testing beluga, lambkin got stuck during a case and never recovered.

How to reproduce

No idea.

Expected behavior

Continue to run until the final case.

Actual behavior

About two days into the run, it stopped moving forward. ROS nodes where up, but nothing relevant was logged, and output bagfile was empty.

Additional context

No resources were obviously missing in the computer, there was enough disk space, and the computer (beefy) was otherwise idle.

These are the logs of the final few cases/iterations leading to the stop. I removed the bagfiles due to their size, but all but the last one were of the expected size. The one of the iteration that got stuck was empty, like nothing had been recorded since the iteration started.

tor_wic_slam_error.tar.gz

glpuga · 2024-08-26T13:30:53Z

Qualitatively this became much more noticeable after instrumenting measurements with the timememory. While a previous run of the run without timememory got stuck once during the four (effective) days run, so far I've had to restart it five times and I'm only halfway through the same bagfiles set.

A limited set of logs I observed seem to have these in common:

the logs never start, the amcl nodes never start processing data.
one node in the set (in one case nav2_amcl, in another the rosbag recorder), generates a log like this "failed to send response to /rosbag2_recorder/list_parameters (timeout)".

glpuga · 2024-08-26T15:38:01Z

I'll try this: ros2/rmw_fastrtps#704

hidmic · 2024-10-24T19:43:40Z

I'll try this: ros2/rmw_fastrtps#704

Did it make a difference? Service discovery in FastRTPS (or FastDDS) isn't great, some ros2cli verb can hang forever, and LAMBKIN doesn't consistently assign timeouts to all the processes it manages.

glpuga · 2024-10-28T13:27:34Z

I'll try this: ros2/rmw_fastrtps#704

Did it make a difference? Service discovery in FastRTPS (or FastDDS) isn't great, some ros2cli verb can hang forever, and LAMBKIN doesn't consistently assign timeouts to all the processes it manages.

All the lambkin runs I did after I created this issue used the proposal in ros2/rmw_fastrtps#704 and I never saw this issue again, so I guess it did.

glpuga · 2024-10-30T13:50:11Z

I permamently added this to #93 with fc23f02

glpuga added the bug Something isn't working label Aug 13, 2024

hidmic added the needs-fix Bug confirmed, need a fix label Oct 24, 2024

glpuga mentioned this issue Oct 30, 2024

Failed to build beluga vs nav2 benchmark image for release target #114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lambkin got stuck during a benchmark run #94

Lambkin got stuck during a benchmark run #94

glpuga commented Aug 13, 2024

glpuga commented Aug 26, 2024 •

edited

Loading

glpuga commented Aug 26, 2024

hidmic commented Oct 24, 2024

glpuga commented Oct 28, 2024

glpuga commented Oct 30, 2024

Lambkin got stuck during a benchmark run #94

Lambkin got stuck during a benchmark run #94

Comments

glpuga commented Aug 13, 2024

Bug description

How to reproduce

Additional context

glpuga commented Aug 26, 2024 • edited Loading

glpuga commented Aug 26, 2024

hidmic commented Oct 24, 2024

glpuga commented Oct 28, 2024

glpuga commented Oct 30, 2024

glpuga commented Aug 26, 2024 •

edited

Loading