Skip to content

Race condition in ActionClient.*_async functions which result in a future that never complete #1123

@apockill

Description

@apockill

Bug report

Hi all, thank you for maintaining this wonderful library!

I've been encountering an intermittent bug that has left my head scratching. Unfortunately recently it started occurring more often, so I sat down and spent some time tracking down what was going on. Now, I've got a good idea of the problem- but I need some more expert eyes to decide on a good solution. I'll lay out all my findings and some perceived paths forward in this bug report.

Required Info:

  • Operating System:
    • Ubuntu 22.04, within a docker image that is also running Ubuntu 22.04
  • Installation type:
    • Binaries as installed from ros2 apt.
  • Version or commit hash:
    • rclpy==3.3.7, python3.10
  • DDS implementation:
    • CycloneDDS
  • Client library (if applicable):
    • rclpy==3.3.7, python3.10

Steps to reproduce issue

There are multiple places where this race condition manifests itself. I will demonstrate how to reproduce it when calling send_goal_async. Take the ActionClient.send_goal_async and add a sleep(#) directly below the line sequence_number = self._client_handle.send_goal_request(request). For example:

    def send_goal_async(self, goal, feedback_callback=None, goal_uuid=None):
        """
        ...
        """

        if not isinstance(goal, self._action_type.Goal):
            raise TypeError()

        request = self._action_type.Impl.SendGoalService.Request()
        request.goal_id = (
            self._generate_random_uuid() if goal_uuid is None else goal_uuid
        )
        request.goal = goal
        sequence_number = self._client_handle.send_goal_request(request)
        sleep(5)   # <--- this line causes this race condition nearly 100% of the time

        if sequence_number in self._pending_goal_requests:
            raise RuntimeError(
                "Sequence ({}) conflicts with pending goal request".format(
                    sequence_number
                )
            )

        if feedback_callback is not None:
            # TODO(jacobperron): Move conversion function to a general-use package
            goal_uuid = bytes(request.goal_id.uuid)
            self._feedback_callbacks[goal_uuid] = feedback_callback

        future = Future()
        self._pending_goal_requests[sequence_number] = future
        self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id
        future.add_done_callback(self._remove_pending_goal_request)
        # Add future so executor is aware
        self.add_future(future)

        return future

Expected behavior

I would expect 'send_goal_async' to run a bit slower than usual, then return a future. That future might complete quite quickly (or already be complete).

Actual behavior

The future never completes, and I see the following line in the logs

Ignoring unexpected goal response. There may be more than one action server for the action '{ACTION_NAME}'

Additional information

By my sleuthing, it appears that this error is caused by the following important lines being called:

client:Thread1

  1. send_goal_async is called, and within it:
  2. sequence_number = self._client_handle.send_goal_request(request)
  3. self._goal_sequence_number_to_goal_id[sequence_number] = request.goal_id

client:Thread2

  1. A response to a goal request is received in the client C bindings, causing it to return True to is_ready(...)
  2. execute(...) is called
  3. if sequence_number in self._goal_sequence_number_to_goal_id: evaluates to False, because client:Thread1 has not reached its step 3
  4. It ignores the taken_data, and thus the future will never have set_result called.

server:Thread1

  1. A goal request is received, and responded to

Cause Summary

There is a race condition when sending goal requests, because the goal request can complete before the sequence_number is tracked in the self._goal_sequence_number_to_goal_id dictionary. Thus execute is called before it is tracked, and the taken_data is ignored, assumed to be a response from a duplicate action server.

I believe this same race condition affects all usages of the client API, such as _cancel_goal_async, or _get_result_async.

Recommendations

  1. Put a lock around every usage of self._client_handle so that python-side objects can be edited and synced up. That way if execute is called it will block until the self._client_handle isn't being used, and the _goal_sequence_number_to_goal_id will be filled in.
  2. Alternatively, create locks for the various in-memory dictionaries that are used in execute:
  • _goal_handles
  • _pending_goal_requests
  • _goal_sequence_number_to_goal_id
  • _pending_cancel_requests
  • _pending_result_requests
  • _result_sequence_number_to_goal_id

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions