Skip to content

Fix multiprocessing deadlock#255

Merged
gmloose merged 1 commit intomasterfrom
RAP-401_fix-multiprocessing-deadlock
May 6, 2025
Merged

Fix multiprocessing deadlock#255
gmloose merged 1 commit intomasterfrom
RAP-401_fix-multiprocessing-deadlock

Conversation

@gmloose
Copy link
Collaborator

@gmloose gmloose commented May 2, 2025

Bug description

A deadlock may occur in the function run_tasks() in multi_proc.py, if out_q is empty. This may happen, if one or more of the started processes die unexpectedly.

Solution

There are two ways to resolve this problem:

  1. Do a non-blocking read on out_q and check if it's empty. If so, some results are apparently missing.
  2. Check for a non-zero exit code of each joined process.

In both cases an exception should be raised. The second solution has the advantage that you can more easily access the exit code of the first process that died, and use it in the exception message. This MR therefore implements the second solution.

How to test

To test that this solution works, make sure that one of the underlying processes die with a SIGSEGV. This can, for example, be accomplished by reverting this change in src/c++/MGFunction1.cc. Doing so, will raise a SIGSEGV at runtime, when the software is compiled and run with Numpy 2.x.

Note

Building the software with Numpy 2.x is not trivial, because on most host systems, all packages are still compiled against Numpy 1.x. Follow the approach used in the CI pipeline https://github.com/lofar-astron/PyBDSF/blob/master/.github/workflows/ci.yml to build the software in a clean docker container.

It may be easier to just raise a signal in the aforementioned C++ file, instead of triggering the SIGSEGV through the incorrect call of a Numpy function (because that requires Numpy 2.x for building the software). The software can then be built on the host system. Just raising the signal should yield the same effect.

We need to check the exit code of each joined process in the `run_tasks()` function in `multi_proc.py`. A non-zero exit status (especially a negative exit status) indicates that an error occurred, and it is possible that the output queue (`out_q`) will be empty. Not handling this situation properly, will cause `out_q.get()` (a few lines further) to hang.
@gmloose gmloose requested a review from aroffringa May 2, 2025 15:40
@gmloose gmloose self-assigned this May 2, 2025
@gmloose gmloose merged commit 6fc72bc into master May 6, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants