[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

kwen2501 · 2025-05-21T20:28:18Z

Stack from ghstack (oldest at bottom):

-> [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

Work around issues like #153960, #152623

NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. device_id passed) to avoid init overhead.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-05-21T20:28:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154055

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e6b5bbe with merge base fa85434 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…NCCL 2.26 ghstack-source-id: e3cd0a6 Pull-Request-resolved: #154055

nWEIdia · 2025-05-21T20:54:26Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  // hang in NCCL 2.26 in non-blocking mode. We can revisit if NCCL fixes the
+  // bug. See https://github.com/pytorch/pytorch/issues/153960
+  // else if (getBoundDeviceId()) {
+  //   useNonblocking_ = true;


Is there any existing unit test that would need to be adjusted when this option is toggled on or off?

The flag is more of an internal choice than contract.
There are several tests that passes device_id, so hopefully they don't break.

I was hoping that this would fix the test_non_blocking_with_eager_init
but with v2.7.1RC, (docker pull http://ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-runtime), I am still reproducing the timeout/hang:

root@d70999cd4c34:/my_workspace/wei-pytorch/test/distributed# python test_c10d_nccl.py -v -k test_non_blocking_with_eager_init test_non_blocking_with_eager_init (__main__.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...

I am switching to a platform that has better OS (was previously using a ubuntu 20.04 based system and there could be known issues).
But now encountering with v2.7.1RC:
ModuleNotFoundError: No module named 'torch.distributed._spmd'
update: used wrong (runtime) container, should use devel container. Never mind on this command.
~~cc @atalman~~

The test didn't hang for me. On H100 machine.

Yes, I confirm this test does not hang for me as well on H100. I would follow up internally on the potential issues with the ubuntu 20.04 stack.
Below is on H100
`python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ... ok

Ran 1 test in 10.152s

OK
`

Though this potentially mean that even if your change lands on main, the upstream CI may still hang, due to the potential OS related issue. I would double check on this front (ubuntu 20.04 or Amazon Linux 2023 + SM75) distributed.
Below is what I get from ubuntu 20.04 based host + ghcr.io/pytorch/pytorch-test:2.7.1-cuda12.6-cudnn9-devel on T4x2

root@d6abe3d5c3dd:/workspace/pytorch/test/distributed# time timeout 30 python3 test_c10d_nccl.py -v -k test_non_blocking_with_eager_init
test_non_blocking_with_eager_init (main.ProcessGroupNCCLGroupTest.test_non_blocking_with_eager_init) ...
real 0m30.040s (i.e. hang)
user 0m4.055s
sys 0m2.664s

atalman

lgtm

kwen2501 · 2025-05-21T23:44:46Z

@pytorchbot merge -f "Unblocking an urgent issue"

pytorchmergebot · 2025-05-21T23:46:39Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

atalman · 2025-05-22T00:07:55Z

@pytorchbot cherry-pick --onto release/2.7 -c critical

…NCCL 2.26 (#154055) Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: #154055 Approved by: https://github.com/atalman (cherry picked from commit 87fc5af)

pytorchbot · 2025-05-22T00:13:08Z

Cherry picking #154055

The cherry pick PR is at #154085 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v2.7.1] Release Tracker #152627 (comment)

Details for Dev Infra team

Raised by workflow job

…NCCL 2.26 (#154085) [c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 (#154055) Work around issues like #153960, #152623 NCCL 2.26 seems to introduce random hang in non-blocking API mode. This PR opts out of non-blocking mode to work around it. Previously torch turned it on by default in eager init (i.e. `device_id` passed) to avoid init overhead. Pull Request resolved: #154055 Approved by: https://github.com/atalman (cherry picked from commit 87fc5af) Co-authored-by: Ke Wen <kw2501@meta.com>

Update

e6b5bbe

[ghstack-poisoned]

kwen2501 added a commit that referenced this pull request May 21, 2025

[c10d] Turn off default non-blocking API mode to work around hang in …

f7ab6d1

…NCCL 2.26 ghstack-source-id: e3cd0a6 Pull-Request-resolved: #154055

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 21, 2025

kwen2501 requested review from d4l3k, fduwjj, eqy and ngimel May 21, 2025 20:44

kwen2501 added the topic: bug fixes topic category label May 21, 2025

nWEIdia reviewed May 21, 2025

View reviewed changes

atalman approved these changes May 21, 2025

View reviewed changes

pytorchmergebot added the merging label May 21, 2025

pytorchmergebot closed this in 87fc5af May 21, 2025

pytorchmergebot added Merged and removed merging labels May 21, 2025

This was referenced May 22, 2025

[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154085

Merged

[v2.7.1] Release Tracker #152627

Closed

This was referenced May 22, 2025

Passing device_id to torch.distributed.init_process_group() results in NCCL randomly hanging during communications #153960

Closed

modded-nanogpt flaky NCCL hang starting 3/30 nightly #152623

Closed

github-actions bot deleted the gh/kwen2501/155/head branch June 23, 2025 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

Uh oh!

kwen2501 commented May 21, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 21, 2025 •

edited

Loading

Uh oh!

nWEIdia May 21, 2025 •

edited

Loading

Uh oh!

kwen2501 May 21, 2025 •

edited

Loading

Uh oh!

nWEIdia May 23, 2025 •

edited

Loading

Uh oh!

nWEIdia May 23, 2025 •

edited

Loading

Uh oh!

kwen2501 May 23, 2025

Uh oh!

nWEIdia May 24, 2025

Uh oh!

atalman left a comment

Uh oh!

kwen2501 commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Uh oh!

atalman commented May 22, 2025

Uh oh!

pytorchbot commented May 22, 2025

Uh oh!

Uh oh!

[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

[c10d] Turn off default non-blocking API mode to work around hang in NCCL 2.26 #154055

Uh oh!

Conversation

kwen2501 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154055

✅ No Failures

Uh oh!

nWEIdia May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nWEIdia May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nWEIdia May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kwen2501 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

nWEIdia May 24, 2025

Choose a reason for hiding this comment

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented May 21, 2025

Uh oh!

pytorchmergebot commented May 21, 2025

Merge started

Uh oh!

atalman commented May 22, 2025

Uh oh!

pytorchbot commented May 22, 2025

Cherry picking #154055

Uh oh!

Uh oh!

kwen2501 commented May 21, 2025 •

edited

Loading

pytorch-bot bot commented May 21, 2025 •

edited

Loading

nWEIdia May 21, 2025 •

edited

Loading

kwen2501 May 21, 2025 •

edited

Loading

nWEIdia May 23, 2025 •

edited

Loading

nWEIdia May 23, 2025 •

edited

Loading