fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

dshivashankar1994 · 2025-06-09T09:16:41Z

Description

This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior.

Details

Introduced usage of register_at_fork(after_in_child=...) from the os module to reinitialize thread locks.
Used weakref.WeakMethod() to safely refer to bound instance methods in register_at_fork.
Added _at_fork_reinit() methods to classes using threading locks and registered them to run in child processes post-fork.
Applied this to all usages of Lock, RLock

Rationale

Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states.

This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Test A

Does This PR Require a Contrib Repo Change?

Yes. - Link to PR:
No.

Checklist:

Followed the style guidelines of this project
Changelogs have been updated
Unit tests have been added
Documentation has been updated

…esses Summary This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior. Details Introduced usage of register_at_fork(after_in_child=...) from the os module to reinitialize thread locks. Used weakref.WeakMethod() to safely refer to bound instance methods in register_at_fork. Added _at_fork_reinit() methods to classes using threading locks and registered them to run in child processes post-fork. Applied this to all usages of Lock, RLock Rationale Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states. This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.

linux-foundation-easycla · 2025-06-09T09:16:47Z

❌ - login: @dshivashankar1994 / name: dshivashankar . The commit (20b0ef4) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

dshivashankar1994 · 2025-07-22T09:36:59Z

Adding @aabmass @srikanthccv @ocelotl @codeboten to comment

aabmass · 2025-07-24T02:53:41Z

Thanks for the PR and apologies no one has take a look yet. First thing before we get too into it–is it possible to sign the CLA?

We handle some similar cases in the SDK already which I thought covered most situations e.g.

opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/_shared_internal/__init__.py

Lines 118 to 128 in ff9dc82

    
           def _at_fork_reinit(self): 
        
               self._export_lock = threading.Lock() 
        
               self._worker_awaken = threading.Event() 
        
               self._queue.clear() 
        
               self._worker_thread = threading.Thread( 
        
                   name=f"OtelBatch{self._exporting}RecordProcessor", 
        
                   target=self.worker, 
        
                   daemon=True, 
        
               ) 
        
               self._worker_thread.start() 
        
               self._pid = os.getpid()

My understanding is that pre-fork workers typically create forks off a single non-worker process, which probably doesn't hit most of these code paths in the parent. I wonder if you were able to pinpoint the exact cause of the deadlock so we can limit the proliferation of this type of code.

On a separate note, Python is really discouraging use of fork() and even has some warnings about potential deadlock in CPython. I don't think this will ever be 100% fork safe. I know you're probably hitting this through Gunicorn or similar, but I wonder if Gunicorn (or whatever you're using) has plans to address this or you can use a workaround.

dshivashankar1994 · 2025-07-26T08:35:00Z

Reg CLA,

If you are in doubt whether your contribution is owned by you or your employer, you should check with your employer or an attorney.

I'll check this part and and get back

dshivashankar1994 · 2025-07-26T08:48:03Z

Regarding the issue, #4345 (comment) is one such thing I ran into.

I had also ran into a similar issue because of the now-removed RuntimeContext lock (#3763). To be on safer side, I believe it is better to make all the locks fork safe.

This was not observed with the gunicorn but when running things with ProcessPoolExecutor which internally uses fork. I agree that fork is not recommended, but it might take a while for all the downstream teams to switch, hence making the locks fork safe would be a good idea. Let me know what you think ?

aabmass · 2025-07-28T15:49:17Z

This was not observed with the gunicorn but when running things with ProcessPoolExecutor which internally uses fork.

Ah gotcha. Would it be feasible for you to switch to the forkserver start method? It's been around I believe since 3.4 and is now the default in 3.14+.

A couple of questions on implementation:

Do you know if Lock._at_fork_reinit() is recommended or safe to depend on, since it's a hidden API?
Is there a reason that the standard library doesn't automatically set up register_at_fork hooks for all locks? They do have a hook to make the new main thread fork safe: https://github.com/python/cpython/blob/7040aa54f14676938970e10c5f74ea93cd56aa38/Lib/threading.py#L1633-L1634
Are there any libraries to help with fork safety, potentially that could be added as an optional dep?

dshivashankar1994 · 2025-07-28T17:21:52Z

Ah gotcha. Would it be feasible for you to switch to the forkserver start method? It's been around I believe since 3.4 and is now the default in 3.14+.

Agree that this will be useful. But we have lot of usecases where the parent process state is used in the child. While we might switch to such things in very long run, this seems not possible for the time being.

Do you know if Lock._at_fork_reinit() is recommended or safe to depend on, since it's a hidden API?
Is there a reason that the standard library doesn't automatically set up register_at_fork hooks for all locks? They do have a hook to make the new main thread fork safe:
Are there any libraries to help with fork safety, potentially that could be added as an optional dep?

I am not very sure on why this is a hidden API. I see this being used in places like logging (python/cpython#19416) and multiprocessing (python/cpython#84402) already and is recommended.

I believe it is not public because forking in multi-threaded environment is being discouraged and starting python 3.12, a warning gets raised if forking is detected in multithreaded environment (detailed discussion)

The author of the _at_fork_reinit mentioned the below, when asked similar question - python/cpython#84272 (comment)

My question is, will there be a way to reinit standard library locks in general using _at_fork_reinit()? That is, should we expect a future fix in python to do this or is the recommendation to continue to ensure the application reinits locks during process start if we know the process could be a child?

Each module has to setup an os.register_at_fork() callback to reinitialize its locks. It's done by threading and logging modules for example. The multiprocessing has its own util.register_after_fork() machinery (see bpo-40221)..

And per my understanding and research, os.register_at_fork is the recommended way of handling fork safetly.

aabmass · 2025-08-01T21:14:18Z

Got it thanks for looking into this.

I feel like we might need to rethink the approach to make things more maintainable and robust. For example I'm not sure if gRPC clients behave correctly after fork, metrics get double-counted in the forked process, the resource attributes are stale etc.

Chatting with @quentinmit offline, raised a great a suggestion about re-using the ProxyMeterProvider machinery to completely swap the SDK in the fork. With a quick test, this approach looks promising

import os

from opentelemetry.metrics._internal import _ProxyMeterProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader

pmp = _ProxyMeterProvider()
meter = pmp.get_meter("foo")
counter = meter.create_counter("foo_counter")


def init_sdk() -> None:
    global reader, mp
    reader = InMemoryMetricReader()
    mp = MeterProvider(metric_readers=[reader])
    pmp.on_set_meter_provider(mp)


# init_sdk must be called before registering the at_fork handler
# (The SDK indirectly registers its own after_in_child hook that needs to run first.)
init_sdk()
os.register_at_fork(after_in_child=init_sdk)

counter.add(123)


def print_info():
    print(f"{os.getpid()=} got", reader.get_metrics_data().to_json())


print_info()

pid = os.fork()
counter.add(100)
print_info()

if pid:
    os.waitpid(pid, 0)

Some tweaks would still be needed and we'd have to expose the proxy impl to users. @dshivashankar1994

Could you add your own post-fork hook to do this re-initialization in your use case?
Do you have interest/time to work on a more robust fix like this?

I think the next step would be to open up a feature request issue

dshivashankar1994 requested a review from a team as a code owner June 9, 2025 09:16

dshivashankar1994 marked this pull request as draft June 9, 2025 09:22

dshivashankar1994 mentioned this pull request Jun 9, 2025

Deadlock accessing metric reader storage when running under Flask/Gunicorn with gevent workers #4345

Open

dshivashankar1994 marked this pull request as ready for review July 22, 2025 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

dshivashankar1994 commented Jun 9, 2025

Uh oh!

linux-foundation-easycla bot commented Jun 9, 2025

Uh oh!

dshivashankar1994 commented Jul 22, 2025

Uh oh!

aabmass commented Jul 24, 2025

Uh oh!

dshivashankar1994 commented Jul 26, 2025

Uh oh!

dshivashankar1994 commented Jul 26, 2025

Uh oh!

aabmass commented Jul 28, 2025

Uh oh!

dshivashankar1994 commented Jul 28, 2025

Uh oh!

aabmass commented Aug 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

Are you sure you want to change the base?

fix: reinitialize locks after fork to prevent deadlocks in child processes #4626

Conversation

dshivashankar1994 commented Jun 9, 2025

Description

Details

Rationale

How Has This Been Tested?

Does This PR Require a Contrib Repo Change?

Checklist:

Uh oh!

linux-foundation-easycla bot commented Jun 9, 2025

Uh oh!

dshivashankar1994 commented Jul 22, 2025

Uh oh!

aabmass commented Jul 24, 2025

Uh oh!

dshivashankar1994 commented Jul 26, 2025

Uh oh!

dshivashankar1994 commented Jul 26, 2025

Uh oh!

aabmass commented Jul 28, 2025

Uh oh!

dshivashankar1994 commented Jul 28, 2025

Uh oh!

aabmass commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aabmass commented Aug 1, 2025 •

edited

Loading