-
Notifications
You must be signed in to change notification settings - Fork 725
fix: reinitialize locks after fork to prevent deadlocks in child processes #4626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: reinitialize locks after fork to prevent deadlocks in child processes #4626
Conversation
…esses Summary This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior. Details Introduced usage of register_at_fork(after_in_child=...) from the os module to reinitialize thread locks. Used weakref.WeakMethod() to safely refer to bound instance methods in register_at_fork. Added _at_fork_reinit() methods to classes using threading locks and registered them to run in child processes post-fork. Applied this to all usages of Lock, RLock Rationale Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states. This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.
|
Adding @aabmass @srikanthccv @ocelotl @codeboten to comment |
Thanks for the PR and apologies no one has take a look yet. First thing before we get too into it–is it possible to sign the CLA? We handle some similar cases in the SDK already which I thought covered most situations e.g. opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/_shared_internal/__init__.py Lines 118 to 128 in ff9dc82
My understanding is that pre-fork workers typically create forks off a single non-worker process, which probably doesn't hit most of these code paths in the parent. I wonder if you were able to pinpoint the exact cause of the deadlock so we can limit the proliferation of this type of code. On a separate note, Python is really discouraging use of |
Reg CLA,
I'll check this part and and get back |
Regarding the issue, #4345 (comment) is one such thing I ran into. I had also ran into a similar issue because of the now-removed RuntimeContext lock (#3763). To be on safer side, I believe it is better to make all the locks fork safe. This was not observed with the gunicorn but when running things with ProcessPoolExecutor which internally uses fork. I agree that fork is not recommended, but it might take a while for all the downstream teams to switch, hence making the locks fork safe would be a good idea. Let me know what you think ? |
Ah gotcha. Would it be feasible for you to switch to the A couple of questions on implementation:
|
Agree that this will be useful. But we have lot of usecases where the parent process state is used in the child. While we might switch to such things in very long run, this seems not possible for the time being.
I am not very sure on why this is a hidden API. I see this being used in places like logging (python/cpython#19416) and multiprocessing (python/cpython#84402) already and is recommended. I believe it is not public because forking in multi-threaded environment is being discouraged and starting python 3.12, a warning gets raised if forking is detected in multithreaded environment (detailed discussion) The author of the
And per my understanding and research, |
Got it thanks for looking into this. I feel like we might need to rethink the approach to make things more maintainable and robust. For example I'm not sure if gRPC clients behave correctly after fork, metrics get double-counted in the forked process, the resource attributes are stale etc. Chatting with @quentinmit offline, raised a great a suggestion about re-using the import os
from opentelemetry.metrics._internal import _ProxyMeterProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader
pmp = _ProxyMeterProvider()
meter = pmp.get_meter("foo")
counter = meter.create_counter("foo_counter")
def init_sdk() -> None:
global reader, mp
reader = InMemoryMetricReader()
mp = MeterProvider(metric_readers=[reader])
pmp.on_set_meter_provider(mp)
# init_sdk must be called before registering the at_fork handler
# (The SDK indirectly registers its own after_in_child hook that needs to run first.)
init_sdk()
os.register_at_fork(after_in_child=init_sdk)
counter.add(123)
def print_info():
print(f"{os.getpid()=} got", reader.get_metrics_data().to_json())
print_info()
pid = os.fork()
counter.add(100)
print_info()
if pid:
os.waitpid(pid, 0) Some tweaks would still be needed and we'd have to expose the proxy impl to users. @dshivashankar1994
I think the next step would be to open up a feature request issue |
Description
This commit adds post-fork reinitialization of threading locks across multiple components in the OpenTelemetry Python SDK and API. It ensures that threading.Lock() instances are safely reinitialized in child processes after a fork(), preventing potential deadlocks and undefined behavior.
Details
Rationale
Forked child processes inherit thread state from the parent, including the internal state of locks. This can cause deadlocks or runtime errors if a lock was held at the time of the fork. By reinitializing locks using the register_at_fork mechanism, we ensure child processes start with clean lock states.
This is especially relevant for WSGI servers and environments that use pre-fork models (e.g., gunicorn, uWSGI), where instrumentation and telemetry components may misbehave without this precaution.
Please delete options that are not relevant.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Does This PR Require a Contrib Repo Change?
Checklist: