Skip to content

fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dshivashankar1994
Copy link

Description

This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior.

Details

  • Introduced usage of register_at_fork(after_in_child=...) from the os module to reinitialize thread locks.
  • Used weakref.WeakMethod() to safely refer to bound instance methods in register_at_fork.
  • Added _at_fork_reinit() methods to classes using threading locks and registered them to run in child processes post-fork.
  • Applied this to all usages of Lock, RLock

Rationale

Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states.

This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Test A

Does This PR Require a Contrib Repo Change?

  • Yes. - Link to PR:
  • No.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

…esses

Summary
This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior.

Details
Introduced usage of register_at_fork(after_in_child=...) from the os module to reinitialize thread locks.

Used weakref.WeakMethod() to safely refer to bound instance methods in register_at_fork.

Added _at_fork_reinit() methods to classes using threading locks and registered them to run in child processes post-fork.

Applied this to all usages of Lock, RLock

Rationale
Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states.

This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.
@dshivashankar1994 dshivashankar1994 requested a review from a team as a code owner June 9, 2025 09:16
Copy link

CLA Not Signed

@dshivashankar1994
Copy link
Author

Adding @aabmass @srikanthccv @ocelotl @codeboten to comment

@aabmass
Copy link
Member

aabmass commented Jul 24, 2025

Thanks for the PR and apologies no one has take a look yet. First thing before we get too into it–is it possible to sign the CLA?

We handle some similar cases in the SDK already which I thought covered most situations e.g.

def _at_fork_reinit(self):
self._export_lock = threading.Lock()
self._worker_awaken = threading.Event()
self._queue.clear()
self._worker_thread = threading.Thread(
name=f"OtelBatch{self._exporting}RecordProcessor",
target=self.worker,
daemon=True,
)
self._worker_thread.start()
self._pid = os.getpid()

My understanding is that pre-fork workers typically create forks off a single non-worker process, which probably doesn't hit most of these code paths in the parent. I wonder if you were able to pinpoint the exact cause of the deadlock so we can limit the proliferation of this type of code.

On a separate note, Python is really discouraging use of fork() and even has some warnings about potential deadlock in CPython. I don't think this will ever be 100% fork safe. I know you're probably hitting this through Gunicorn or similar, but I wonder if Gunicorn (or whatever you're using) has plans to address this or you can use a workaround.

@dshivashankar1994
Copy link
Author

Reg CLA,

If you are in doubt whether your contribution is owned by you or your employer, you should check with your employer or an attorney.

I'll check this part and and get back

@dshivashankar1994
Copy link
Author

Regarding the issue, #4345 (comment) is one such thing I ran into.

I had also ran into a similar issue because of the now-removed RuntimeContext lock (#3763). To be on safer side, I believe it is better to make all the locks fork safe.

This was not observed with the gunicorn but when running things with ProcessPoolExecutor which internally uses fork. I agree that fork is not recommended, but it might take a while for all the downstream teams to switch, hence making the locks fork safe would be a good idea. Let me know what you think ?

@aabmass
Copy link
Member

aabmass commented Jul 28, 2025

This was not observed with the gunicorn but when running things with ProcessPoolExecutor which internally uses fork.

Ah gotcha. Would it be feasible for you to switch to the forkserver start method? It's been around I believe since 3.4 and is now the default in 3.14+.

A couple of questions on implementation:

@dshivashankar1994
Copy link
Author

Ah gotcha. Would it be feasible for you to switch to the forkserver start method? It's been around I believe since 3.4 and is now the default in 3.14+.

Agree that this will be useful. But we have lot of usecases where the parent process state is used in the child. While we might switch to such things in very long run, this seems not possible for the time being.

Do you know if Lock._at_fork_reinit() is recommended or safe to depend on, since it's a hidden API?
Is there a reason that the standard library doesn't automatically set up register_at_fork hooks for all locks? They do have a hook to make the new main thread fork safe:
Are there any libraries to help with fork safety, potentially that could be added as an optional dep?

I am not very sure on why this is a hidden API. I see this being used in places like logging (python/cpython#19416) and multiprocessing (python/cpython#84402) already and is recommended.

I believe it is not public because forking in multi-threaded environment is being discouraged and starting python 3.12, a warning gets raised if forking is detected in multithreaded environment (detailed discussion)

The author of the _at_fork_reinit mentioned the below, when asked similar question - python/cpython#84272 (comment)

My question is, will there be a way to reinit standard library locks in general using _at_fork_reinit()? That is, should we expect a future fix in python to do this or is the recommendation to continue to ensure the application reinits locks during process start if we know the process could be a child?

Each module has to setup an os.register_at_fork() callback to reinitialize its locks. It's done by threading and logging modules for example. The multiprocessing has its own util.register_after_fork() machinery (see bpo-40221)..

And per my understanding and research, os.register_at_fork is the recommended way of handling fork safetly.

@aabmass
Copy link
Member

aabmass commented Aug 1, 2025

Got it thanks for looking into this.

I feel like we might need to rethink the approach to make things more maintainable and robust. For example I'm not sure if gRPC clients behave correctly after fork, metrics get double-counted in the forked process, the resource attributes are stale etc.

Chatting with @quentinmit offline, raised a great a suggestion about re-using the ProxyMeterProvider machinery to completely swap the SDK in the fork. With a quick test, this approach looks promising

import os

from opentelemetry.metrics._internal import _ProxyMeterProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader

pmp = _ProxyMeterProvider()
meter = pmp.get_meter("foo")
counter = meter.create_counter("foo_counter")


def init_sdk() -> None:
    global reader, mp
    reader = InMemoryMetricReader()
    mp = MeterProvider(metric_readers=[reader])
    pmp.on_set_meter_provider(mp)


# init_sdk must be called before registering the at_fork handler
# (The SDK indirectly registers its own after_in_child hook that needs to run first.)
init_sdk()
os.register_at_fork(after_in_child=init_sdk)

counter.add(123)


def print_info():
    print(f"{os.getpid()=} got", reader.get_metrics_data().to_json())


print_info()

pid = os.fork()
counter.add(100)
print_info()

if pid:
    os.waitpid(pid, 0)

Some tweaks would still be needed and we'd have to expose the proxy impl to users. @dshivashankar1994

  • Could you add your own post-fork hook to do this re-initialization in your use case?
  • Do you have interest/time to work on a more robust fix like this?

I think the next step would be to open up a feature request issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants