Reset relocation/allocation failure counter on node join/shutdown #119968

pxsalehi · 2025-01-10T16:25:10Z

We prevent retries of allocations/relocations once they see index.allocation.max_retries failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too.

Closes ES-10492

pxsalehi · 2025-01-15T14:15:09Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+     * Note that removing a non-RESTART shutdown metadata from a node that is still in the cluster is treated similarly and
+     * will cause resetting the allocation/relocation failures.
+     */
+    private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {


In the PR allocation and relocation are kind of teated similarly and sometimes I've used "allocation" to mean both.

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

pxsalehi · 2025-01-15T14:19:06Z

...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java

+import static org.hamcrest.CoreMatchers.notNullValue;
+
+@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0)
+public class AllocationFailuresResetOnShutdownIT extends ESIntegTestCase {


In these tests, sometimes we need three nodes since when use a REPLACE shutdown type, NodeReplacementAllocationDecider can prevent the shard to be assigned to any available node.

elasticsearchmachine · 2025-01-15T14:27:50Z

Hi @pxsalehi, I've created a changelog YAML for you.

elasticsearchmachine · 2025-01-15T14:30:56Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd

Haven't read the tests. Left a few minor comments. Also, I think we want to add some logs when the failure counters are 5 (max) and are resetted. IIUC from the conversation, this reaching max failures might be an indication of issues in some part of the system. We reset the counters to ease support burden but in the meantime it sorta hides the problems and logs should be helpful to make them discoverable.

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

ywangd · 2025-01-16T10:25:44Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+        final var nodes = clusterState.nodes();
+        final var nodeShutdowns = clusterState.metadata().nodeShutdowns();
+        // If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
+        shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
+            .anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
+        // Clean up processed shutdowns that are removed from the cluster metadata
+        processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);
+        for (var shutdown : nodeShutdowns.getAll().entrySet()) {
+            // A RESTART doesn't necessarily move around shards, so no need to consider it for a reset.
+            // Furthermore, once the node rejoins after restarting, there will be a reset if necessary.
+            if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {
+                shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey());
+            }
+        }
+        return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures);


Nit: can we short-circuit earlier when nodesAdd() is true which feels a bit easier to read to me?

We'd need to do the book-keeping of the seen shutdowns any way. I can add a comment or break the return line into multiple lines maybe. Frankly, I found this more readable. Maybe a comment-equivalent of that helps?

I wonder whether we can do away with the book-keeping? Can we compare the old and new cluster states and decide whether a shutdown record has been "added" or "removed but still in the cluster" and thus needing reset? It maybe helpful to enhance ClusterChangedEvent to provide some shutdownChanged() method similar to nodesChanged() so that it can be reused. This can be a separate PR. For the work here, a localized solution is fine.

Unless you feel strongly about this, I'd rather not revisit that part in this PR. We have used this elsewhere and its edge cases have been reviewed (as mentioned in the PR description), and I think well-tested in this PR too. So reusing it makes sense. I can follow up with your suggestion and if it works out also simplify the similar code we have for auto-resetting the desired balance.

I can break off the return statement to something like

if (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) { return hasAllocationFailures || hasRelocationFailures; } return false;

Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.

To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT and AllocationFailuresResetIT and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.

private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) { if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false && changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) { return false; } if (changeEvent.nodesAdded()) { return true; } final var previous = nonRestartShutdownNodes(changeEvent.previousState()); final var current = nonRestartShutdownNodes(changeEvent.state()); if (previous.equals(current)) { return false; } return Sets.difference(current, previous).isEmpty() == false // new shutdown // removed shutdown but the node is still in cluster || Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null); } // This can be a method on either ClusterState or NodesShutdownMetadata private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) { return clusterState.metadata() .nodeShutdowns() .getAll() .values() .stream() .filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART) .map(SingleNodeShutdownMetadata::getNodeId) .collect(Collectors.toUnmodifiableSet()); }

ywangd · 2025-01-16T10:31:35Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+        // If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
+        shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
+            .anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);


This will cover restarted nodes as well. I think it's ok since it is covered by nodesAdded anyway. It may lead to reset a few more times:

On shutdown metadata (either restart or sigterm for serverless azure cluster)

On node join after restart

On shutdown metadata removal (if in a later cluster state update)

Still might be ok. Just to be explicit.

We explicitly avoid reset when the RESTART metadata is applied. When the node rejoins, yes, there will be a reset. We could explicitly prevent it, but not sure it is worth it. Removal of RESTART shouldn't cause a reset if the node is in the cluster. I think it is tested as well.

I should have been clear, by "restart", I meant the Azure specific behaviour where the node still goes through the sigterm shutdown record lifecyle while restarted by the Azure "auto-repair" feature. When it happens, we will reset the counter a few more times which should be ok.

pxsalehi · 2025-01-16T13:55:47Z

add some logs when the failure counters are 5

Yeah, we could. You mean the same way that we log resetting desired balance or including details of which shards had failures? The latter seems a bit verbose, but some summary of it maybe.

ywangd

You mean the same way that we log resetting desired balance or including details of which shards had failures?

I was thinking more like the later otherwise it is hard to go back in the logs to find the actual shards. It would be great if the log message has a list of shards (potentially truncated if the list is too large) that have 5 failures. What do you think?

pxsalehi · 2025-01-17T15:04:09Z

I was thinking more like the later otherwise it is hard to go back in the logs to find the actual shards. It would be great if the log message has a list of shards (potentially truncated if the list is too large) that have 5 failures. What do you think?

Yes, I think that's helpful too. I'll add it in this PR.

...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java

ywangd · 2025-01-20T04:01:48Z

...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java

+            assertBusy(() -> {
+                var stateAfterNodeJoin = internalCluster().clusterService().state();
+                var relocatedShard = stateAfterNodeJoin.routingTable().index("index1").shard(0).primaryShard();
+                assertThat(relocatedShard.relocationFailureInfo().failedRelocations(), Matchers.lessThan(maxAttempts));


Probably for my knowledge: Why does this assert failedRelocations < maxAttemps instead of failedRelocations == 0?

what we care about is the reset of the counter. so anything below max would do. I don't see a reason to be too strict here, since the test will be extra brittle if for whatever reason it takes more than one attempt to allocate/relocate.

...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java

ywangd · 2025-01-20T04:36:56Z

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java

+        final var nodes = clusterState.nodes();
+        final var nodeShutdowns = clusterState.metadata().nodeShutdowns();
+        // If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
+        shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
+            .anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
+        // Clean up processed shutdowns that are removed from the cluster metadata
+        processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);
+        for (var shutdown : nodeShutdowns.getAll().entrySet()) {
+            // A RESTART doesn't necessarily move around shards, so no need to consider it for a reset.
+            // Furthermore, once the node rejoins after restarting, there will be a reset if necessary.
+            if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {
+                shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey());
+            }
+        }
+        return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures);


Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.

To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT and AllocationFailuresResetIT and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.

private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) { if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false && changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) { return false; } if (changeEvent.nodesAdded()) { return true; } final var previous = nonRestartShutdownNodes(changeEvent.previousState()); final var current = nonRestartShutdownNodes(changeEvent.state()); if (previous.equals(current)) { return false; } return Sets.difference(current, previous).isEmpty() == false // new shutdown // removed shutdown but the node is still in cluster || Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null); } // This can be a method on either ClusterState or NodesShutdownMetadata private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) { return clusterState.metadata() .nodeShutdowns() .getAll() .values() .stream() .filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART) .map(SingleNodeShutdownMetadata::getNodeId) .collect(Collectors.toUnmodifiableSet()); }

ywangd

The PR looks good to me. I have only minor comments. I'll review it again if you are adding logging to it. Alternatively, I am also ready to approve it if you prefer loggings to be a separate PR. Please let me know. Thanks!

pxsalehi · 2025-01-20T17:25:03Z

Thanks. Since we've already gotten into both points here, I'll just add it here and ping you. Didn't have time to get back to this.

…erOnShutdown

pxsalehi · 2025-01-24T09:57:13Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+        return false;
+    }
+
+    public void resetFailedCounter(RoutingAllocation allocation) {


I've chosen to add the logging here, to be able to produce some reliable information and also it integrates with the existing code of going through the failures. SETTING_ALLOCATION_MAX_RETRY is an index scoped setting so I'm not referring to any value, just picking those reaching max. Anything under max is not mentioned, as was the case before.

pxsalehi · 2025-01-24T10:02:55Z

Simplified the prod code a bit in d4140e5 and added logging in f6918de. The simplification doesn't apply to #106998 since we don't have the ClusterChangedEvent like here.

ywangd

LGTM

Thanks for the iterations! Test cases look very solid 👍

ywangd · 2025-01-28T00:58:46Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+    }
+
+    public void resetFailedCounter(RoutingAllocation allocation) {
+        final var observer = allocation.changes();


Nit: can we keep the old name of routingChangesObserver?

I very much prefer the observer since it returns a RoutingChangesObserver.

ywangd · 2025-01-28T01:02:35Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+                    );
+                    if (failedAllocations >= maxRetry) {
+                        shardsWithMaxFailedAllocations++;
+                        if (topShardIdsWithFailedAllocations.size() <= MAX_SHARDS_IN_LOG_MSG) {


Nit: I think we can check the size along with failedAllocations > 0 to short-circuit a bit earlier and avoid resolving the setting value.

I don't think so,shardsWithMaxFailedAllocations counts all of the ones that have reached max, while topShardIdsWithFailedAllocations stores only a portion of the shard IDs.

ywangd · 2025-01-28T01:07:56Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+                            if (topShardIdsWithFailedRelocations.size() <= MAX_SHARDS_IN_LOG_MSG) {
+                                topShardIdsWithFailedRelocations.add(shardRouting.shardId());
+                            }


Similar nit: checking size before resolving setting value would be my preference.

ywangd · 2025-01-28T01:12:45Z

server/src/main/java/org/elasticsearch/cluster/routing/RoutingNodes.java

+        "Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)";
+    public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =
+        "Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)";


Nit:

Suggested change

"Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)";

public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =

"Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)";

"Resetting failure counter for [%d] shard(s) that have reached their max allocation retires (%s)";

public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =

"Resetting failure counter for [%d] shard(s) that have reached their max relocation retries (%s)";

pxsalehi · 2025-01-28T09:10:15Z

@elasticmachine update branch

When elastic#119968 was merged into multi-project we introduced a regression by inserting a call to `.getProject()` within the `RoutingNodes` class that was supposed to be multi-project-aware. This commit replaces those calls with `.indexMetadata` lookups

When #119968 was merged into multi-project we introduced a regression by inserting a call to `.getProject()` within the `RoutingNodes` class that was supposed to be multi-project-aware. This commit replaces those calls with `.indexMetadata` lookups

pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 10, 2025

elasticsearchmachine added the v9.0.0 label Jan 10, 2025

pxsalehi force-pushed the ps250109-resetCounterOnShutdown branch 4 times, most recently from d5686fc to 8f93a29 Compare January 15, 2025 14:01

pxsalehi changed the title ~~Reset relocation failure counter on node join/shutdown~~ Reset relocation/allocation failure counter on node join/shutdown Jan 15, 2025

pxsalehi commented Jan 15, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java Show resolved Hide resolved

pxsalehi commented Jan 15, 2025

View reviewed changes

reset both

cd6a896

pxsalehi force-pushed the ps250109-resetCounterOnShutdown branch from 8f93a29 to cd6a896 Compare January 15, 2025 14:21

pxsalehi added >enhancement and removed >non-issue labels Jan 15, 2025

Update docs/changelog/119968.yaml

dd28da3

pxsalehi marked this pull request as ready for review January 15, 2025 14:30

pxsalehi requested review from ywangd, DaveCTurner and henningandersen January 15, 2025 14:30

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 15, 2025

ywangd reviewed Jan 16, 2025

View reviewed changes

pxsalehi requested a review from ywangd January 16, 2025 13:56

ywangd reviewed Jan 17, 2025

View reviewed changes

ywangd reviewed Jan 20, 2025

View reviewed changes

pxsalehi added 6 commits January 21, 2025 15:11

Merge remote-tracking branch 'upstream/main' into ps250109-resetCount…

f215455

…erOnShutdown

only rely on ClusterChangedEvent

d4140e5

add logging

f6918de

Merge remote-tracking branch 'upstream/main' into ps250109-resetCount…

f2d951e

…erOnShutdown

Use safeGet

9615d7f

Merge remote-tracking branch 'upstream/main' into ps250109-resetCount…

2f04afa

…erOnShutdown

pxsalehi commented Jan 24, 2025

View reviewed changes

pxsalehi requested a review from ywangd January 24, 2025 10:03

ywangd approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'main' into ps250109-resetCounterOnShutdown

82533cc

pxsalehi enabled auto-merge (squash) January 28, 2025 09:12

pxsalehi merged commit b94a20e into elastic:main Jan 28, 2025
15 checks passed

tvernum mentioned this pull request Feb 27, 2025

Replace use of 'getProject' in RoutingNodes #123567

Merged

Reset relocation/allocation failure counter on node join/shutdown #119968

Reset relocation/allocation failure counter on node join/shutdown #119968

Uh oh!

Conversation

pxsalehi commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

elasticsearchmachine commented Jan 15, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jan 16, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jan 17, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ywangd Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jan 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jan 24, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pxsalehi commented Jan 10, 2025 •

edited

Loading

pxsalehi Jan 17, 2025 •

edited

Loading

ywangd Jan 20, 2025 •

edited

Loading

ywangd Jan 20, 2025 •

edited

Loading

pxsalehi Jan 28, 2025 •

edited

Loading