Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent error generated in wait_for_message #60

Open
dan9thsense opened this issue Jan 27, 2024 · 12 comments
Open

intermittent error generated in wait_for_message #60

dan9thsense opened this issue Jan 27, 2024 · 12 comments
Labels
bug Something isn't working

Comments

@dan9thsense
Copy link

When calling bdai_ros2_wrappers.subscription.wait_for_message I am getting this error chain about 70% of the time. Sometimes the same code completes successfully.

The call is made like this (type and topic are correct, it runs successfully now and then):

from bdai_ros2_wrappers.subscription import wait_for_message
............
msg = wait_for_message(type, topic, timeout_sec=10.0)

[simple_bdai.py-1] Exception in thread Thread-1 (spin):
[simple_bdai.py-1] Traceback (most recent call last):
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[simple_bdai.py-1] self.run()
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 953, in run
[simple_bdai.py-1] self._target(*self._args, **self._kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 295, in spin
[simple_bdai.py-1] self.spin_once()
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/utilities.py", line 55, in wrapper
[simple_bdai.py-1] return callable
(*args, **kwargs)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 628, in spin_once
[simple_bdai.py-1] self._do_spin_once(timeout_sec)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 590, in _do_spin_once
[simple_bdai.py-1] task, entity, node = self.wait_for_ready_callbacks(*args, **kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 723, in wait_for_ready_callbacks
[simple_bdai.py-1] return next(self._cb_iter)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 617, in _wait_for_ready_callbacks
[simple_bdai.py-1] waitable.add_to_wait_set(wait_set)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/event_handler.py", line 124, in add_to_wait_set
[simple_bdai.py-1] with self.__event:
[simple_bdai.py-1] rclpy._rclpy_pybind11.InvalidHandle: cannot use Destroyable because destruction was requested

@dan9thsense
Copy link
Author

Here is another example from a more minimal script. The call to get_floor_robot_gripper_state() worked a few times after the sim started, then failed as shown here. The top section of this message, enclosed in >>> ... <<< appeared with every call but things progressed normally . However, eventually a call generated the bottom section as well. The class here is included with the bdai ros_utilities call to graph() in the script that instantiates all the nodes.

import VacuumGripperControl
from bdai_ros2_wrappers.subscription import wait_for_message

class Gripper(Node):
  def __init__(self, **kwargs):
        super().__init__('gripper', **kwargs)
       
  def get_floor_robot_gripper_state(self):
        topic = "floor_robot_enable_gripper"
        msg = wait_for_message(VacuumGripperState, topic, timeout_sec=10.0)
        if msg is None:
	      self.get_logger().error(f"timed out waiting for floor_robot_enable_gripper message")
        return msg

[simple_bdai.py-1]
[simple_bdai.py-1] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[simple_bdai.py-1] This error state is being overwritten:
[simple_bdai.py-1]
[simple_bdai.py-1] 'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at ./src/type_support_dispatch.hpp:111'
[simple_bdai.py-1]
[simple_bdai.py-1] with this new error message:
[simple_bdai.py-1]
[simple_bdai.py-1] 'Handle's typesupport identifier (rosidl_typesupport_c) is not supported by this library, at ./src/type_support_dispatch.hpp:113'
[simple_bdai.py-1]
[simple_bdai.py-1] rcutils_reset_error() should be called after error handling to avoid this.
[simple_bdai.py-1] <<<
[simple_bdai.py-1] Exception in thread Thread-1 (spin):
[simple_bdai.py-1] Traceback (most recent call last):
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[simple_bdai.py-1] self.run()
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 953, in run
[simple_bdai.py-1] self._target(*self._args, **self._kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 295, in spin
[simple_bdai.py-1] self.spin_once()
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/utilities.py", line 55, in wrapper
[simple_bdai.py-1] return callable
(*args, **kwargs)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 628, in spin_once
[simple_bdai.py-1] self._do_spin_once(timeout_sec)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 590, in _do_spin_once
[simple_bdai.py-1] task, entity, node = self.wait_for_ready_callbacks(*args, **kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 723, in wait_for_ready_callbacks
[simple_bdai.py-1] return next(self._cb_iter)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 617, in _wait_for_ready_callbacks
[simple_bdai.py-1] waitable.add_to_wait_set(wait_set)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/event_handler.py", line 124, in add_to_wait_set
[simple_bdai.py-1] with self.__event:
[simple_bdai.py-1] rclpy._rclpy_pybind11.InvalidHandle: cannot use Destroyable because destruction was requested

@jbarry-bdai
Copy link
Contributor

@mhidalgo-bdai any ideas?

@mhidalgo-bdai
Copy link
Contributor

@dan9thsense I would suspect misuse. Errors indicate some ROS 2 state is getting clobbered. Would you mind showing that minimal script?

Note to myself: watch all activities on this repo. I was not notified when these issues were opened.

@dan9thsense
Copy link
Author

That was the more minimal script. To get more, you would need to run the whole ARIAC/gazebo ROS2 sim.
Other than wait_for_message, the ros utilities are working OK and the code in general is reliably getting good results and runs at reasonable speed. I don't know how to investigate a misuse case with ros utilities.

@mhidalgo-bdai
Copy link
Contributor

I see it uses wait_for_message without specifying any node, which means the global one if any will be used. How is that setup? Where's the ros_process.main decorator?

@dan9thsense
Copy link
Author

ros_process.main occurs in a node that instantiates the Gripper class shown above. It is created inside a graph method that is then run, as per the examples.

I didn't know that wait_for_message could take a node as a parameter. Is there an example of that I missed? Looking at the source code, I see it can. In the script above, I tried replacing:

msg = wait_for_message(VacuumGripperState, topic, timeout_sec=10.0)

with:

msg = wait_for_message(VacuumGripperState, topic, timeout_sec=10.0, node=self)

but it doesn't seem to matter much:

[simple_bdai.py-1]
[simple_bdai.py-1] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[simple_bdai.py-1] This error state is being overwritten:
[simple_bdai.py-1]
[simple_bdai.py-1] 'Handle's typesupport identifier (rosidl_typesupport_cpp) is not supported by this library, at ./src/type_support_dispatch.hpp:111'
[simple_bdai.py-1]
[simple_bdai.py-1] with this new error message:
[simple_bdai.py-1]
[simple_bdai.py-1] 'Handle's typesupport identifier (rosidl_typesupport_c) is not supported by this library, at ./src/type_support_dispatch.hpp:113'
[simple_bdai.py-1]
[simple_bdai.py-1] rcutils_reset_error() should be called after error handling to avoid this.
[simple_bdai.py-1] <<<
[simple_bdai.py-1] Exception in thread Thread-1 (spin):
[simple_bdai.py-1] Traceback (most recent call last):
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
[simple_bdai.py-1] self.run()
[simple_bdai.py-1] File "/usr/lib/python3.10/threading.py", line 953, in run
[simple_bdai.py-1] [INFO] [1707175069.798329631] [gripper]: current gripper type: part_gripper
[simple_bdai.py-1] self._target(*self._args, **self._kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 295, in spin
[simple_bdai.py-1] self.spin_once()
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/utilities.py", line 55, in wrapper
[simple_bdai.py-1] return callable
(*args, **kwargs)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 628, in spin_once
[simple_bdai.py-1] self._do_spin_once(timeout_sec)
[simple_bdai.py-1] File "/home/dbarry/ariac_ws/install/bdai_ros2_wrappers/lib/python3.10/site-packages/bdai_ros2_wrappers/executors.py", line 590, in _do_spin_once
[simple_bdai.py-1] task, entity, node = self.wait_for_ready_callbacks(*args, **kwargs)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 723, in wait_for_ready_callbacks
[simple_bdai.py-1] return next(self._cb_iter)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/executors.py", line 617, in _wait_for_ready_callbacks
[simple_bdai.py-1] waitable.add_to_wait_set(wait_set)
[simple_bdai.py-1] File "/opt/ros/iron/lib/python3.10/site-packages/rclpy/event_handler.py", line 124, in add_to_wait_set
[simple_bdai.py-1] with self.__event:
[simple_bdai.py-1] rclpy._rclpy_pybind11.InvalidHandle: cannot use Destroyable because destruction was requested

@mhidalgo-bdai
Copy link
Contributor

mhidalgo-bdai commented Feb 6, 2024

@dan9thsense it's hard to pinpoint where the issue is from code fragments. Like I said before (above?), the issue is that ROS 2 state is getting clobbered.

ros_process.main occurs in a node that instantiates the Gripper class shown above.

FWIW ros_process.main was designed to decorate the program entrypoint function, typically main(). That is, the first function invoked, or the last function in the call stack, so that its lifetime is almost the same as that of the program. If you are decorating some other function that gets invoked elsewhere, unless you were very careful (careful in a way that is not even documented), chances are that you are losing that state.

@dan9thsense
Copy link
Author

That's exactly where ros.process.main is located. There is only one main function and it decorates that.
I don't know what you mean by "clobbered." Can you be more specific?

@mhidalgo-bdai
Copy link
Contributor

I don't know what you mean by "clobbered."

I mean destroyed.


Hmm, on a closer look, why do you think this is caused by wait_for_message? I picked that up and didn't question it either. The backtrace indicates it's a waitable in the executor that's throwing. Waitables are usually associated to actions. I would try to debug this with pdb or similar.

@dan9thsense
Copy link
Author

It's definitely associated with wait_for_message. It is never there without that. If I take out this one line of code:

msg = wait_for_message(VacuumGripperState, topic, timeout_sec=10.0)

the problem never occurs. Also, that same line generates the same problem in other nodes.

In what way is the ros2 state being destroyed? Are you referring to this line:

[simple_bdai.py-1] rclpy._rclpy_pybind11.InvalidHandle: cannot use Destroyable because destruction was requested

Most of the other lines refer to spin_once, which I'm pretty sure is coming from ros_utilities since I'm not calling it at all in my code.

I can give you more of the code if you want, such as the main function.

@jbarry-bdai
Copy link
Contributor

@mhidalgo-bdai I took a look at the full code here (I know OP) and wait_for_message is being used correctly as far as I could tell. This is supported by the fact that other parts of the program work - only wait_for_message causes the error so it's not like a node or the ros2 context are getting destroyed. I tried to reproduce with a simple example and could not.

A few observations:

  • Removing future.add_done_callback(lambda future: node.destroy_subscription(sub)) from wait_for_message_async keeps this error from being fatal. This indicates to me that the problem is in something getting destroyed too soon (as you noted) but it appears to be associated with the infrastructure inside wait_for_message.
  • My first hypothesis was that the problem was that when we exit from wait_for_message, the future associated with it is destroyed and then this leads to issues when the executor tries to execute the done callbacks associated with that future later on. However, I tried reproducing wait_for_message inside the code using a class variable for the future and that still did not work so I don't think that's the exact problem but I wonder if it's something similar?

@mhidalgo-bdai
Copy link
Contributor

mhidalgo-bdai commented Feb 7, 2024

So, after some digging, I'm 95% sure this is a bug upstream.

I couldn't make sense of the backtrace until I rediscovered subscriptions have QoS event handlers. I believe subscription destruction is racing with wait set construction. QoS event handlers are collected as waitables, which prevents destruction, but doesn't prevent destruction from being requested. Then, as soon as you hit waitable.add_to_wait_set(wait_set), you get that destruction requested exception.

If the above is correct, waitable support in rclpy executors is somewhat broken. It doesn't show for action clients and servers because handle locking was conveniently left out.

We should open a ticket on https://github.com/ros2/rclpy.

@mhidalgo-bdai mhidalgo-bdai added the bug Something isn't working label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants