[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

valeriy42 · 2025-06-13T07:44:06Z

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148

elasticsearchmachine · 2025-06-13T07:45:42Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2025-06-13T07:45:43Z

Hi @valeriy42, I've created a changelog YAML for you.

davidkyle

LGTM

… node is temoporarily unavailable (elastic#129391) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes elastic#126148

elasticsearchmachine · 2025-06-13T12:11:01Z

💚 Backport successful

Status	Branch	Result
✅	8.19
✅	9.0
✅	8.17

… node is temoporarily unavailable (elastic#129391) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes elastic#126148

valeriy42 · 2025-06-16T08:31:15Z

💚 All backports created successfully

Status	Branch	Result
✅	8.18

Questions ?

Please refer to the Backport tool documentation

… node is temoporarily unavailable (elastic#129391) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes elastic#126148 (cherry picked from commit d487eb5)

… node is temoporarily unavailable (#129391) (#129402) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes #126148

… node is temoporarily unavailable (#129391) (#129461) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes #126148 (cherry picked from commit d487eb5)

… node is temoporarily unavailable (#129391) (#129401) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes #126148

… node is temoporarily unavailable (#129391) (#129403) During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening". This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade. Fixes #126148

valeriy42 added 3 commits June 13, 2025 09:41

initial implementation

ff69cfa

Unit tests implemented

c7c8f3d

Unit tests commented and fixed

c8a2e76

elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Jun 13, 2025

valeriy42 added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v8.19.0 v9.0.3 v8.17.8 and removed needs:triage Requires assignment of a team area label labels Jun 13, 2025

valeriy42 self-assigned this Jun 13, 2025

Update docs/changelog/129391.yaml

78cd7dd

valeriy42 mentioned this pull request Jun 13, 2025

[ML] Ensure that AD job state update retries if master node is temoporarily unavailable #129329

Closed

update min delay and set timeout

11844ea

valeriy42 requested a review from davidkyle June 13, 2025 07:51

valeriy42 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jun 13, 2025

valeriy42 added 2 commits June 13, 2025 09:52

remove obsolete todo

04571d6

removed delay and timeout from the constructor

607e1f4

davidkyle approved these changes Jun 13, 2025

View reviewed changes

[CI] Auto commit changes from spotless

e75c4a6

valeriy42 merged commit d487eb5 into elastic:main Jun 13, 2025
19 checks passed

valeriy42 deleted the bugfix/is-1543-opening-state-2 branch June 13, 2025 12:09

valeriy42 mentioned this pull request Jun 13, 2025

[8.17] [ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable (#129391) #129403

Merged

valeriy42 added the v8.18.3 label Jun 13, 2025

valeriy42 mentioned this pull request Jun 16, 2025

[8.18] [ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable (#129391) #129461

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

Uh oh!

valeriy42 commented Jun 13, 2025

Uh oh!

elasticsearchmachine commented Jun 13, 2025

Uh oh!

elasticsearchmachine commented Jun 13, 2025

Uh oh!

davidkyle left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 13, 2025

Uh oh!

valeriy42 commented Jun 16, 2025

Uh oh!

Uh oh!

[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

Uh oh!

Conversation

valeriy42 commented Jun 13, 2025

Uh oh!

elasticsearchmachine commented Jun 13, 2025

Uh oh!

elasticsearchmachine commented Jun 13, 2025

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 13, 2025

💚 Backport successful

Uh oh!

valeriy42 commented Jun 16, 2025

💚 All backports created successfully

Questions ?

Uh oh!

Uh oh!