Skip to content

[ML] Ensure that anomaly detection job state update retries if master node is temoporarily unavailable #129391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 13, 2025

Conversation

valeriy42
Copy link
Contributor

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148

@elasticsearchmachine elasticsearchmachine added v9.1.0 needs:triage Requires assignment of a team area label labels Jun 13, 2025
@valeriy42 valeriy42 added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged v8.19.0 v9.0.3 v8.17.8 and removed needs:triage Requires assignment of a team area label labels Jun 13, 2025
@valeriy42 valeriy42 self-assigned this Jun 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @valeriy42, I've created a changelog YAML for you.

@valeriy42 valeriy42 requested a review from davidkyle June 13, 2025 07:51
@valeriy42 valeriy42 added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Jun 13, 2025
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@valeriy42 valeriy42 merged commit d487eb5 into elastic:main Jun 13, 2025
19 checks passed
@valeriy42 valeriy42 deleted the bugfix/is-1543-opening-state-2 branch June 13, 2025 12:09
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Jun 13, 2025
… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.19
9.0
8.17

valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Jun 13, 2025
… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148
valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Jun 13, 2025
… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148
@valeriy42
Copy link
Contributor Author

💚 All backports created successfully

Status Branch Result
8.18

Questions ?

Please refer to the Backport tool documentation

valeriy42 added a commit to valeriy42/elasticsearch that referenced this pull request Jun 16, 2025
… node is temoporarily unavailable (elastic#129391)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes elastic#126148

(cherry picked from commit d487eb5)
elasticsearchmachine pushed a commit that referenced this pull request Jun 16, 2025
… node is temoporarily unavailable (#129391) (#129402)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148
elasticsearchmachine pushed a commit that referenced this pull request Jun 16, 2025
… node is temoporarily unavailable (#129391) (#129461)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148

(cherry picked from commit d487eb5)
elasticsearchmachine pushed a commit that referenced this pull request Jun 16, 2025
… node is temoporarily unavailable (#129391) (#129401)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148
elasticsearchmachine pushed a commit that referenced this pull request Jun 16, 2025
… node is temoporarily unavailable (#129391) (#129403)

During cluster upgrade, the anomaly detection jobs must be reassigned from one ML node to another. During this reassignment, the jobs transition through several states, including "opening" and "opened". If, during this transition, the master node becomes temporarily unavailable, e.g., due to reassignment, the new job state is not successfully committed to the cluster state. Therefore, once the new master became available, the cluster state was inconsistent: some anomaly detection jobs were opened, but their state got stuck as "opening".

This PR introduces a retryable action for updating the job state to ensure that the job state is successfully updated and the cluster state remains consistent during the upgrade.

Fixes #126148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.17.8 v8.18.3 v8.19.0 v9.0.3 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Prevent AD jobs from being stuck in the "opening" state
3 participants