diff --git a/rep-0155.rst b/rep-0155.rst new file mode 100644 index 00000000..b4d08205 --- /dev/null +++ b/rep-0155.rst @@ -0,0 +1,912 @@ +REP: 155 +Title: Conventions, Topics, Interfaces for Perception in Human-Robot Interaction +Author: Séverin Lemaignan +Status: Draft +Type: Informational +Content-Type: text/x-rst +Created: 11-Jan-2022 + + +Abstract +======== + +This REP provides a set of conventions and common interfaces for Human-Robot +Interaction (HRI) scenarios, with a focus on the perception of humans and social +signals. +It aims at enabling interoperability and reusability of core +functionality between the many HRI-related software tools, from skeleton +tracking, to face recognition, to natural language processing. + +Besides, these interfaces are designed to be relevant for a broad range of HRI +situations, from crowd simulation, to kineastetic teaching, to social +interaction. + +Specifically, this REP covers: + +- human modeling, as a combination of a permanent identity (person) and + transient parts that are intermittently detected (eg face, skeleton, voice); +- topic naming conventions under the ``/humans/`` topic namespace; +- human kinematic modeling (based on dynamically generated URDF models), as + well as 3D TF frame conventions (naming, orientation); +- representation of group interactions (groups, mutual gaze) + +Rationale +========= + +ROS is widely used in the context of human-robot interactions (HRI). +However, to date, no single effort (e.g. [1]_ [2]_) has been successful at +coming up with broadly accepted interfaces and pipelines for that domain, as +found in other parts of the ROS ecosystem (for manipulation or 2D navigation for +instance). +As a result, many different implementations of common tasks (skeleton tracking, +face recognition, speech processing, etc) cohabit, and while they achieve +similar goals, they are not generally compatible, hampering the code +reusability, experiment replicability, and general sharing of knowledge. + +In order to address this issue, this REP aims at structuring the whole "ROS for +HRI" space by creating an adequate set of ROS messages and services to describe +the software interactions relevant to the HRI domain, as well as a set of +conventions (eg topics structure, tf frames) to expose human-related +information. + +The REP aims at modeling these interfaces based on existing, state-of-the-art +algorithms relevant to HRI perception, while considering the broad range of +application scenario in HRI. + +It is hoped that such an effort will allow easier collaboration between +projects and allow a reduction in duplicate efforts to implement the same +functionality. + +Overview +======== + +This REP specifies multiple aspects of human-robot interaction, with a primary +focus on human perception/social signal recognition. + +It is split into 4 sections: + +1. human representation, as a combination of a permanent identity (*person*) + and transient parts that are intermittently detected (e.g. *face*, + *skeleton*, *voice*); +2. topic naming conventions under the ``/humans/`` topic namespace; +3. kinematic model of the human and 3D tf frame conventions (naming, + orientation, compatible with REP-120 [3]_ where possible) +4. representation of group interactions (groups, mutual gaze) + +By following the naming conventions and leveraging the interfaces defined in +this REP, both tools and libraries can be designed to be reusable between +different frameworks and experiments. + +.. note:: Importantly, the REP does not mandate specific tools or algorithms to + perform human perception/social signal recognition per se. It only specify + naming conventions and interfaces between these nodes. + +The use of capitalized words such as MUST, SHOULD, MAY must be understood as +specified in the IETF RFC2119 'Key words for use in RFCs to Indicate Requirement +Levels' [6]_. + +Human Representation +==================== + +To accommodate existing tools and techniques used to detect and recognise +humans, the representation of a person is built on a combination of 4 +unique identifiers (UUIDs): a **person identifier**, a **face identifier**, a +**body identifier** and a **voice identifier**. +Future revisions of this REP might add additional identifiers. + +These four identifiers are not mutually exclusive, and depending on the +requirements of the application, the available sensing capabilities, and the +position/behaviour of the humans, only some might be available for a given +person, at a given time. + +Person Identifier +----------------- + +The **person identifier** MUST be a unique ID (typically, a UUID) permanently +associated with a unique person. +This person ID is normally assigned by a node able to perform person +identification (face recognition node, voice recognition node, sound source +localisation + name, identification based on physical features like +height/age/gender, person identification based on pre-defined features like the +colour of the clothes, etc.) +This ID is meant to be **persistent** so that the robot can recognize people +across encounters/sessions. +Nodes providing person IDs MAY serialise these IDs to a permanent storage, for +them to persist across robot reboots. + +When meaningful (see section `Person frame`_), a TF frame MUST be +associated to the person ID and named ``person_``. +Due to the importance of the head in human-robot interaction, the +``person_`` frame is expected to be placed as close as possible to the +head of the human. +If neither the face nor the skeleton are tracked, the ``person_`` +frame might be located to the last known position of the human, or removed +altogether if no meaningful estimate of the human location is available. +See section `Person frame`_ for details regarding the ``person_`` +frame. + +At any given time, the list of currently-seen persons is published under the +``/humans/persons/tracked`` topic as ``hri_msgs/IdsList`` messages, and the list +of all known persons (ie, persons that have been seen and recognized at least +once in the past) under ``/humans/persons/known``. + +Merging Person Identifiers +'''''''''''''''''''''''''' + +In certain cases, two person IDs must be merged (for instance, the robot +detects that a voice and a face that were thought to belong to different people +are indeed the same person). + +In such a case, one of the person IDs is marked as an *alias* of the other +person, by publishing the ID of the other person on a special subtopic named +``alias``. +See section `Topics structure`_ for details. + +.. note:: The reverse operation (splitting a person into two) can be realised + by simply publishing a second person ID. + +Anonymous persons +''''''''''''''''' + +While person IDs are generally expected to be permanent, one exception exists +for persons that the robot has detected but not yet identified. + +For instance, the robot hears a voice, and therefore knows that a person is +around, but no voice identification nodes is available -- or the voice +identification has not yet recognised the voice. In such a case, an *anonymous +person* MAY be created, ie a person who has not yet been assigned a permanent +ID. + +*Anonymous persons* are treated like regular persons. +They however publish a latched ``true`` boolean on their ``/anonymous`` +subtopic, and their ID is not guaranteed to be permanent (it can in fact +change/be removed at any point). + +Face Identifier +--------------- + +The **face identifier** MUST be a unique ID that identifies a detected face. +This ID is typically generated by the face detector/head pose estimator upon +face detection. + +Importantly, **this ID is not persistent**: once a face is lost (for instance, +the person goes out of frame), its ID is not valid nor meaningful anymore. +To cater for a broad range of applications (where re-identification might not be +always necessary), there is no expectation that the face detector will attempt +to recognise the face and re-assign the same face ID if the person reappears. + +.. note:: A face detector/face tracker MAY reuse the same face ID if it is + confident that the face is indeed the same. + +There is a one-to-one relationship between this face ID and the estimated 6D +pose of the head. If the node publishes a head pose estimation, the ROS TF frame +MUST be named ``face_`` (see section `Face and Gaze Frames`_ for the +face frame conventions). + +At any given time, the list of tracked faces SHOULD be published under the +``/humans/faces/tracked`` topic as ``hri_msgs/IdsList`` messages. + +Body Identifier +--------------- + +Similarly to the face identifier, the **body identifier** MUST be a unique ID, +associated to a person’s skeleton. +It is normally created by a skeleton tracker upon detection of a skeleton. + +Like the face ID, **the body ID is not persistent**, and is valid only as long +as the specific skeleton is tracked by the skeleton tracker which initially +detected it. + +The corresponding TF frame is ``body_``, and TF frames associated with +each of the body parts of the person, MUST suffixed with the same ID (see +section `Body frames`_). + +At any given time, the list of tracked bodies SHOULD be published under the +``/humans/bodies/tracked`` topic as ``hri_msgs/IdsList`` messages. + +Voice Identifier +---------------- + +Likewise, a speech separation node MUST assign a unique, non-persistent, ID +for each detected voice. +Tracked voices SHOULD be published under the ``/humans/voices/tracked`` topic as +``hri_msgs/IdsList`` messages. + +Identifier Matching +------------------- + +Associations between IDs (for instance to denote that a given voice belongs to +a given person, or a given face to a given body) are expressed by publishing +``hri_msgs/IdsMatch`` messages on the ``/humans/candidate_matches`` topic. +The ``hri_msgs/IdsMatch`` message MAY include a confidence level. + +A typical implementation will have several specialised nodes publishing +candidate matches on ``/humans/candidate_matches`` (e.g. a face recognition node +providing matches between faces and persons; a voice recognition node providing +matches between voices and persons) and one 'person manager' node collecting +the candidates, and publishing the most likely associations between a person ID +and its face/body/voice IDs under the ``/humans/persons/`` namespace. + + +Identifier Syntax +----------------- + +Identifiers can be arbitrary, as long as they are unique. We recommend to follow +standard syntaxic rules so that they are also valid indentifiers in mainstream +programming languages (eg, start with a letter). + +Note that using people’s names as identifier is possible, but not +generally recommended as collisions are likely. + +Common Parameters +================= + +- ``/humans/faces/width`` (default: 128): width in pixels of the cropped faces + published under ``/humans/faces/XYZ/cropped``, ``/humans/faces/XYZ/aligned`` + and ``/humans/faces/XYZ/frontalized`` +- ``/humans/faces/height`` (default: 128): height in pixels of the cropped + faces published under ``/humans/faces/XYZ/cropped``, + ``/humans/faces/XYZ/aligned`` and ``/humans/faces/XYZ/frontalized`` +- ``/human_description_``: URDF models of detected humans. + See Section `Kinematic Model of the Human`_ for details. +- ``/humans/match_threshold`` (``float``, default: 0.5): the minimum level of + likelihood to consider a face/body/voice to belong to a given person. +- ``/humans/reference_frame`` (``string``, default: `map`): persons' TF frames + must be published with respect to `reference_frame`. + Typically, faces/bodies/voices frames are published wrt to their respective + sensors frame. `reference_frame` is usually a 'static' frame (eg `map`), so + that if the person moves out of view of the robot (and therefore, its position + can not be updated anymore), it remains where it was last seen, independently + of any possible robot motion. + +Topics Structure +================ + +A system implementing this REP MUST follow the following conventions +for all HRI-related topics: + +1. all topics are grouped under the global namespace ``/humans/`` +2. five sub-namespaces are available: + + - ``/humans/faces`` + - ``/humans/bodies`` + - ``/humans/voices`` + - ``/humans/persons`` + - ``/humans/interactions`` + +3. the first four (``/faces``, ``/bodies``, ``/voices``, ``/persons``) expose + one sub-namespace per face, body, voice, person detected, named after the + corresponding ID (for instance, ``/humans/faces/bd34a/``). + In addition, they expose a topic ``/tracked`` (of type ``hri_msgs/IdsList``) + where the list of currently tracked faces/bodies/voices/persons is published; +4. matches between faces/bodies/voices/persons are published on the + ``/humans/candidate_matches`` topic, as explained in Section `Identifier + matching`_; +5. the ``/humans/interactions`` topic exposes group- + level signals, including gazing patterns and social + groups. + +.. note:: the ``hri_msgs`` messages are defined in the `hri_msgs + `_ repository. + +.. note:: The slightly unconvential structure of topics (with one namespace per + face, body, person, etc.) enables modular pipelines. + + For instance, a face detector might publish cropped images of detected faces + under ``/humans/faces/face_1/cropped``, ``/humans/faces/face_2/cropped``, + etc. + + Then, depending on the application, an additional facial expression + recognizer might be needed as well. + For each detected face, that node would subscribe to the corresponding + `/cropped` topic and publish its results under + ``/humans/faces/face_1/expression``, ``/humans/faces/face_2/expression``, + etc., augmenting the available information about each face in a modular way. + + Such modularity would not be easily possible if we had chosen to publish + instead a generic ``Face`` message, as a single node would have had first to + fuse all possible information about faces. + + See the `Illustrative Example`_ below for a complete example. + +.. note:: `libhri `_ can be used to hide away + the complexity of tracking new persons/faces/bodies/voices. It automatically + handles subscribing/unsubcribing to the right topics when new + persons/faces/bodies/voices are detected. + +Faces +----- + +The list of currently detected faces (list of face IDs) is published +under ``/humans/faces/tracked`` (as a ``hri_msgs/IdsList`` message). + +For each detected face, a namespace ``/humans/faces//`` is +created (eg ``/humans/faces/bf3d/``). + +The following subtopics MAY then be available, depending on available +detectors: + +=================== ==================================== ======== ======================== +Name Message type Required Description +=================== ==================================== ======== ======================== +``/roi`` ``sensor_msgs/RegionOfInterest`` x Region of the face in + the source image +``/cropped`` ``sensor_msgs/Image`` x Cropped face image, if + necessary scaled, + centered and 0-padded + to match the + ``/humans/faces/width`` + and + ``/humans/faces/height`` + ROS parameters +``/aligned`` ``sensor_msgs/Image`` Aligned (eg, the two + eyes are horizontally + aligned) version of the + cropped face, with same + resolution as + ``/cropped`` +``/frontalized`` ``sensor_msgs/Image`` Frontalized version of + the cropped face, with + same resolution as + ``/cropped`` +``/landmarks`` ``hri_msgs/FacialLandmarks`` 2D facial landmarks + extracted from the face +``/facs`` ``hri_msgs/FacialActionUnits`` The presence and + intensity of facial + action units found in + the face +``/expression`` ``hri_msgs/Expression`` The expression + recognised from the + face +``/softbiometrics`` ``hri_msgs/SoftBiometrics`` Detected age and gender + of the person +=================== ==================================== ======== ======================== + +Bodies +------ + +The list of currently detected bodies (list of body IDs) is published +under ``/humans/bodies/tracked`` (as a ``hri_msgs/IdsList`` message). + +For each detected body, a namespace ``/humans/bodies//`` is +created. +The following subtopics MAY then be available, depending on available +detectors: + +================= ==================================== ======== ======================== +Name Message type Required Description +================= ==================================== ======== ======================== +``/roi`` ``sensor_msgs/RegionOfInterest`` x Region of the whole body + body in the source image +``/cropped`` ``sensor_msgs/Image`` x Cropped body image +``/skeleton2d`` ``hri_msgs/Skeleton2D`` The 2D points of the + the detected skeleton +``/joint_states`` ``sensor_msgs/JointState`` The joint state of the + human body, following + the `Kinematic Model + of the Human`_ +``/posture`` ``hri_msgs/BodyPosture`` Recognised body posture + (eg standing, sitting) +``/gesture`` ``hri_msgs/Gesture`` Recognised symbolic + gesture (eg waving) +================= ==================================== ======== ======================== + +3D body poses SHOULD be exposed via TF frames. +This is discussed in +Section `Kinematic Model and Coordinate Frames`_. + +Voices +------ + +The list of currently detected voices (list of voice IDs) is published +under ``/humans/voices/tracked`` (as a ``hri_msgs/IdsList`` message). + +For each detected voice, a namespace ``/humans/voices//`` is +created. + +The following subtopics MAY then be available, depending on available +detectors: + +================ ==================================== ======== ======================== +Name Message type Required Description +================ ==================================== ======== ======================== +``/audio`` ``audio_common_msgs/AudioData`` x Separated audio stream + for this voice +``/features`` ``hri_msgs/AudioFeatures`` INTERSPEECH’09 Emotion + challenge [4]_ + low-level audio features +``/is_speaking`` ``std_msgs/Bool`` Whether or not speech is + recognised from this + voice +``/speech`` ``hri_msgs/LiveSpeech`` The live stream of speech + recognized via an ASR + engine +================ ==================================== ======== ======================== + +Persons +------- + +The list of currently tracked persons (list of person IDs) is published +under ``/humans/persons/tracked`` (as a ``hri_msgs/IdsList`` message). + +The list of known persons (either actively tracked, or known but not tracked +anymore) is published under ``/humans/persons/known`` (as a ``hri_msgs/IdsList`` +message). + +For each detected person, a namespace ``/humans/persons//`` is +created. + +The following subtopics MAY then be available, depending on available +detectors, and whether or not the person has yet been matched to a face/body/voice: + +======================== ==================================== ======== ======================== +Name Message type Required Description +======================== ==================================== ======== ======================== +``/anonymous`` ``std_msgs/Bool`` x If true, the person is + (latched) *anonymous*, ie has + not yet been identified, + and has not been issued + a permanent ID +``/face_id`` ``std_msgs/String`` Face matched to that + (latched) person (if any) +``/body_id`` ``std_msgs/String`` Body matched to that + (latched) person (if any) +``/voice_id`` ``std_msgs/String`` Voice matched to that + (latched) person (if any) +``/alias`` ``std_msgs/String`` If this person has been + (latched) merged with another, + this topic contains the + person ID of the new + person +``/engagement_status`` ``hri_msgs/EngagementLevel`` Engagement status of the + person with the robot +``/location_confidence`` ``std_msgs/Float32`` Location confidence; 1 + means *person currently + seen*, 0 means *person + location unknown*. See + `Person Frame`_ +``/name`` ``std_msgs/String`` Name, if known +``/native_language`` ``std_msgs/String`` IETF language codes like + EN_gb, if known +======================== ==================================== ======== ======================== + +Interactions +------------ + +Finally, the namespace ``/humans/interactions`` exposes topics where group-level +interactions are published when detected. + +=========== ============================== =========================== +Name Message type Description +=========== ============================== =========================== +``/groups`` ``hri_msgs/Group`` Estimated social groups +``/gazing`` ``hri_msgs/Gaze`` Estimated gazing behaviours +=========== ============================== =========================== + +See section `Group-level Interactions`_ for details. + +Illustrative Example +-------------------- + +You run a node ``your_face_detector_node``. +This node detects two faces, and +publishes the corresponding regions of interest and cropped faces. +The node +effectively advertises and publishes onto the following topics: + +.. code:: + + > rostopic list + /humans/faces/23bd5/roi # sensor_msgs/RegionOfInterest + /humans/faces/23bd5/cropped # sensor_msgs/Image + /humans/faces/b092e/roi # sensor_msgs/RegionOfInterest + /humans/faces/b092e/cropped # sensor_msgs/Image + +.. note:: The IDs (in this example, ``23bd5`` and ``b092e``) are arbitrary, as + long as they are unique. + However, for practical reasons, it is recommended to keep them reasonably + short. + +You start an additional node to recognise expressions: +``your_expression_classifier_node``. +The node subscribes to the +``/humans/faces//cropped`` topics and publishes expressions for each +faces under the same namespace: + +.. code:: + + > rostopic list + /humans/faces/23bd5/roi + /humans/faces/23bd5/cropped + /humans/faces/23bd5/expression # hri_msgs/Expression + /humans/faces/b092e/roi + /humans/faces/b092e/cropped + /humans/faces/b092e/expression # hri_msgs/Expression + + +You then launch ``your_body_tracker_node``. +It detects one body: + +.. code:: + + > rostopic list + /humans/faces/23bd5/... + /humans/faces/b092e/... + /humans/bodies/67dd1/roi # sensor_msgs/RegionOfInterest + /humans/bodies/67dd1/cropped # sensor_msgs/Image + +In addition, you start a 2D/3D pose estimator ``your_skeleton_estimator_node``. +The 2D skeleton can be published under the same body namespace, and the 3D +skeleton is published as a joint state. +The joint state can then be converted +into TF frames using eg a URDF model of the human, alongside a +``robot_state_publisher``: + +.. code:: + + > rostopic list + /humans/faces/23bd5/... + /humans/faces/b092e/... + /humans/bodies/67dd1/roi + /humans/bodies/67dd1/cropped + /humans/bodies/67dd1/skeleton2d # hri_msgs/Skeleton2D + /humans/bodies/67dd1/joint_states # sensor_msgs/JointState + + + > xacro ws/human_description/urdf/human-tpl.xacro id:=67dd1 height:=1.7 > body-67dd1.urdf + > rosparam set human_description_67dd1 -t body-67dd1.urdf + > rosrun robot_state_publisher robot_state_publisher joint_states:=/humans/bodies/67dd1/joint_states robot_description:=human_description_67dd1 + +.. note:: In this example, we manually generate the URDF model of the human, + load it to the ROS parameter server, and start a ``robot_state_publisher``. + In practice, this should be done programmatically everytime a new body is + detected. + + +So far, faces and bodies are detected, but they are not yet 'unified' as a +person. + +First, we need a stable way to associate a face to a person. +This would typically require a node for facial recognition. Such a node would +subscribe to each of the detected faces' ``/cropped`` subtopics, and publish +*candidate matches* on the ``/humans/candidate_matches`` topic, using a +``hri_msgs/IdsMatch`` message. +For instance: + +.. code:: + + > rostopic echo /humans/candidate_matches + face_id: "23bd5" + body_id: '' + voice_id: '' + person_id: "76c0c" + confidence: 0.73 + --- + +In that example, the person ID ``76c0c`` is created and assigned by the face +recognition node itself. + +Finally, you would need a ``your_person_manager_node`` to publish the +``/humans/persons/76c0c/`` subtopics based on the candidate matches: + +.. code:: + + > rostopic list + /humans/faces/23bd5/... + /humans/faces/b092e/... + /humans/bodies/67dd1/... + /humans/persons/76c0c/face_id + +In this simple example, only the ``/face_id`` subtopic would be advertised (with a +latched message pointing to the face ID ``23bd5``). +In practice, additional +information could be gathered by the ``your_person_manager_node`` to expose eg +soft biometrics, engagement, etc. +Similarly, the association between the person +and its body would have to be performed by a dedicated node. + +Overall, six independent nodes are combined to implement this pipeline: + +.. raw:: html + +
+ graph TD + img(image) + FR[your_face_recognizer_node] + PE[your_skeleton_estimator_node] + BT[your_body_tracker_node] + EC[your_expression_classifier_node] + FD[your_face_detector_node] + PM[your_person_manager_node] + img --> FD + img --> BT + FD --> EC + FD --> FR + FR --> PM + BT --> PE +
+ +This possible pipeline is only for illustration purposes: depending on each +specific pipeline implementations, some of these nodes might be merged or on +the contrary, further divided into smaller nodes. +For instance, one might choose to integrate together the face recogniser node +and the person manager. + +Note as well that in order to build a complete perception pipeline for HRI, +additional nodes would be needed, for instance for voice processing. + +Kinematic Model and Coordinate Frames +===================================== + +Where meaningful, the coordinate frames used for humans follow the +conventions set out in REP-120 [3]_. + +These conventions also follow the REP-103 [5]_. + +Kinematic Model of the Human +---------------------------- + +.. image:: rep-0155/frames.png + :width: 600 + :alt: Main joints of the human kinematic model (right: human URDF model, + rendered in rviz) + +The main 15 links defined on the human body are presented in the above diagram. +Frames orientations and naming are based on REP-103 and REP-120. +Right: render of the reference URDF model of the human body in `rviz`. + +The following diagram presents all the link (boxes) and joints (arrows) in the +recommended human kinematic model. + +.. raw:: html + +
+ graph TD + B[body] -->|waist| W[waist] + W --> |"torso [fixed]"| T[torso] + + T -->|r_head| D[r_head] + D -->|y_head| E[y_head] + E -->|p_head| F[p_head] + F -->|"head [fixed]"| G[head] + + T -->|l_y_shoulder| SLY[l_y_shoulder] + SLY -->|l_p_shoulder| SLP[l_p_shoulder] + SLP -->|l_r_shoulder| SL[l_shoulder] + SL -->|l_elbow| EL[l_elbow] + EL -->|"l_wrist [fixed]"| WL[l_wrist] + + T -->|r_y_shoulder| SRY[r_y_shoulder] + SRY -->|r_p_shoulder| SRP[r_p_shoulder] + SRP -->|r_r_shoulder| SR[r_shoulder] + SR -->|r_elbow| ER[r_elbow] + ER -->|"r_wrist [fixed]"| WR[r_wrist] + + B -->|l_y_hip| HLY[l_y_hip] + HLY -->|l_p_hip| HLP[l_p_hip] + HLP -->|l_r_hip| HL[l_hip] + HL -->|l_knee| KL[l_knee] + KL -->|"l_ankle [fixed]"| AL[l_ankle] + + B -->|r_y_hip| HRY[r_y_hip] + HRY -->|r_p_hip| HRP[r_p_hip] + HRP -->|r_r_hip| HR[r_hip] + HR -->|r_knee| KR[r_knee] + KR -->|"r_ankle [fixed]"| AR[r_ankle] +
+ +In practice, each of these links and joints must be suffixed with the +corresponding ````, as several skeletons might be present at the same +time. + +A parametric URDF model of humans is available in the ``human_description`` +package. +It SHOULD be used to instantiate at run-time new human URDF model, +adjusted for the e.g. height of the detected persons. +The person's joint state +(published under ``/humans/bodies//joint_states``) can then be used with +eg a `robot_state_publisher node ` to +publish the body's TF frames. + +When generated, the URDF models of the humans should be loaded on the ROS +parameter server under ``/human_description_``. + +.. note:: the `human_description + `_ ROS package contains a launch + script ``visualize.launch`` that can be used to quickly experiment with the + kinematic model of humans. + + +Face and Gaze Frames +-------------------- + + +- Head pose estimation nodes MUST publish the head 6D pose as + a TF frame named ``face_`` where ```` stands for the + unique face identifier. +- the parent of this frame is the sensor frame used to estimate the + face pose. +- The origin of the frame must be the sellion (defined as the deepest + midline point of the angle formed between the nose and forehead. + It can generally be approximated to the mid point of line connecting the two + eyes). +- The ``x`` axis is expected to point forward (ie, out of the face), + the ``z`` axis is expected to point toward the scalp (ie, up when the + person is standing vertically). +- Any other facial landmark published as a TF frame must be parented to + the head TF frame. + It should be suffixed with the same ``_``. + +In addition, the person's gaze direction MUST be published as a +``gaze_`` frame, collocated with the ``face_`` frame, and with +its ``z`` axis aligned with the estimated gaze vector, ``x`` right, and ``y`` +down ('optical frame' convention). + +If gaze is not estimated beyond general head orientation, the +``gaze_``'s ``z`` axis will be colinear with the ``face_``'s +``x`` axis. + +Finally, nodes performing attention estimation MAY publish a frame +``focus_`` representing the estimated focus of attention of the person. + + +Body Frames +----------- + + +- The body frame is named ``body_`` where ```` stands + for the unique skeleton identifier. +- The origin of the frame is located at the mid point of the line + connecting the hips. +- the parent of this frame is the sensor frame used to estimate the + body pose. +- The ``x`` axis is expected to point forward (ie, out of the body), + the ``z`` axis is expected to point toward the head (ie, up when the + person is standing vertically). +- The other skeleton points published as TF frames must be parented to + the root ``body_`` frame, and all be suffixed with the same + ``_``. + Section `Kinematic Model of the Human`_ lists the recommended names of body + links and body joints. +- if the skeleton tracker provide an estimate of the head pose, it + might publish a frame named ``head_``. *It is the joint + responsibility of the face tracker and skeleton tracker to ensure + that* ``face_`` *and* ``head_`` *are consistent with + each other*. + +Voice Frame +----------- + +- Sound source localisation algorithms can broadcast estimated TF frames for + detected voices. + These frames should be named ``voice_``. +- The orientation of the frame is meaningless, and should be ignored. + +Person Frame +------------ + +The ``person_`` frame has a slightly more complex semantic and +must be interpreted in conjunction with the person's ``location_confidence`` +value (see `Persons`_ topics). + +We can distinguish three cases: + +1. The human is currently being tracked (ie ``personID`` is set, and at + least one of ``faceID`` or ``bodyID`` is set). + In this case, ``location_confidence`` MUST be set to 1 and: + + - when a face ID is also defined, the ``person_`` frame must + be collocated with the ``face_`` frame. + - when a body ID is defined (ie the skeleton is being tracked), the + ``person_`` frame must be collocated with the skeleton + frame the closest to the head. + - if both the face and body IDs are defined, the ``person_`` + frame must be collocated with the ``face_`` frame. + +2. The human is not seen, but has been previously seen. In this case, + ``location_confidence`` MUST be set to a value ``< 1`` and a + ``person_`` TF frame MUST be published **as long as** + ``location_confidence > 0``. + Simple implementations might choose to set ``location_confidence = 0.5`` as + soon as the person is not actively seen anymore, continuously broadcast the + last known location. + More advanced implementations might slowly decrease ``location_confidence`` + over time to represent the fact that the human might have walked away, for + instance. + +3. The human is known, but has never been seen before. + In this case, ``location_confidence`` MUST be set to ``0``, and no TF frame + should be broadcast. + +Group-level Interactions +======================== + +Representation of Groups +------------------------ + +When detected, group-level interactions are published on the +``/humans/interactions/groups``, using the ``hri_msgs/Group.msg`` message +type. + +Each group is defined by a unique group ID, and a list of person IDs. +(groups can only be defined between persons). + +Representation of gazing behaviours +----------------------------------- + +Social gazing (eg, gazing between people) is represented as +``hri_msgs/Gaze.msg`` messages, published on the +``/humans/interactions/gazing`` topic. + +Each ``Gaze.msg`` messages contain a *sender* and a *receiver* that MUST be +known persons. +Note that the relationship is not symmetrical: "A gazes at B" does not imply "B +gazes at A". +As such, *mutual gaze* will lead to two messages being published. + +If one or the other of the sender and receiver IDs are not set, the robot is +assumed to respectively originate or be the target of the gaze. + +Nodes publishing gazing information are expected to continuously publish +gaze messages, until the person is not gazing at the target anymore. + + +References +========== + +.. [1] ``people`` package, last commit in 2015 + (https://github.com/wg-perception/people) + +.. [2] ``cob_people_perception`` package, mainly developed between 2012 and + 2014 (https://github.com/ipa320/cob_people_perception) + +.. [3] REP 120, Coordinate Frames for Humanoid Robots + (https://ros.org/reps/pep-0120.html) + +.. [4] *The INTERSPEECH 2009 emotion challenge*, Schuller, Steidl and Batliner, + Tenth Annual Conference of the International Speech Communication Association, + 2009 + +.. [5] REP 103, Standard Units of Measure and Coordinate Conventions + (http://www.ros.org/reps/rep-0103.html) + +.. [6] RFC2119, Key words for use in RFCs to Indicate Requirement Levels + (https://datatracker.ietf.org/doc/html/rfc2119) + +Acknowledgements +================ + +Contributors +------------ + +Antonio Andriella, Lorenzo Ferrini, Youssef Mohamed, Andres Ramirez-Duque + +*(alphabetic order)* + +Funding +------- + +This work has been primarily funded by PAL Robotics, with the Bristol Robotics +Lab/University of the West of England funding the initial research. + +In addition, the work leading to this REP has received funding from the European Union +through the H2020 SPRING project (grant agreement 871245), and the ACCIÓ Tecniospring +TALBOT project. + +Copyright +========= + +Copyright (c) 2021 by PAL Robotics. This material may be distributed only +subject to the terms and conditions set forth in the Open Publication License, +v1.0 or later (the latest version is presently available at +http://www.opencontent.org/openpub/). + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + coding: utf-8 + End: diff --git a/rep-0155/frames.png b/rep-0155/frames.png new file mode 100644 index 00000000..c34979f0 Binary files /dev/null and b/rep-0155/frames.png differ