Skip to content

Reset relocation/allocation failure counter on node join/shutdown #119968

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 28, 2025

Conversation

pxsalehi
Copy link
Member

@pxsalehi pxsalehi commented Jan 10, 2025

We prevent retries of allocations/relocations once they see index.allocation.max_retries failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too.

Closes ES-10492

@pxsalehi pxsalehi added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) labels Jan 10, 2025
@pxsalehi pxsalehi force-pushed the ps250109-resetCounterOnShutdown branch 4 times, most recently from d5686fc to 8f93a29 Compare January 15, 2025 14:01
@pxsalehi pxsalehi changed the title Reset relocation failure counter on node join/shutdown Reset relocation/allocation failure counter on node join/shutdown Jan 15, 2025
* Note that removing a non-RESTART shutdown metadata from a node that is still in the cluster is treated similarly and
* will cause resetting the allocation/relocation failures.
*/
private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR allocation and relocation are kind of teated similarly and sometimes I've used "allocation" to mean both.

import static org.hamcrest.CoreMatchers.notNullValue;

@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0)
public class AllocationFailuresResetOnShutdownIT extends ESIntegTestCase {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these tests, sometimes we need three nodes since when use a REPLACE shutdown type, NodeReplacementAllocationDecider can prevent the shard to be assigned to any available node.

@pxsalehi pxsalehi force-pushed the ps250109-resetCounterOnShutdown branch from 8f93a29 to cd6a896 Compare January 15, 2025 14:21
@elasticsearchmachine
Copy link
Collaborator

Hi @pxsalehi, I've created a changelog YAML for you.

@pxsalehi pxsalehi marked this pull request as ready for review January 15, 2025 14:30
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Jan 15, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't read the tests. Left a few minor comments. Also, I think we want to add some logs when the failure counters are 5 (max) and are resetted. IIUC from the conversation, this reaching max failures might be an indication of issues in some part of the system. We reset the counters to ease support burden but in the meantime it sorta hides the problems and logs should be helpful to make them discoverable.

Comment on lines 601 to 615
final var nodes = clusterState.nodes();
final var nodeShutdowns = clusterState.metadata().nodeShutdowns();
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
// Clean up processed shutdowns that are removed from the cluster metadata
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);
for (var shutdown : nodeShutdowns.getAll().entrySet()) {
// A RESTART doesn't necessarily move around shards, so no need to consider it for a reset.
// Furthermore, once the node rejoins after restarting, there will be a reset if necessary.
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {
shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey());
}
}
return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we short-circuit earlier when nodesAdd() is true which feels a bit easier to read to me?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd need to do the book-keeping of the seen shutdowns any way. I can add a comment or break the return line into multiple lines maybe. Frankly, I found this more readable. Maybe a comment-equivalent of that helps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can do away with the book-keeping? Can we compare the old and new cluster states and decide whether a shutdown record has been "added" or "removed but still in the cluster" and thus needing reset? It maybe helpful to enhance ClusterChangedEvent to provide some shutdownChanged() method similar to nodesChanged() so that it can be reused. This can be a separate PR. For the work here, a localized solution is fine.

Copy link
Member Author

@pxsalehi pxsalehi Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you feel strongly about this, I'd rather not revisit that part in this PR. We have used this elsewhere and its edge cases have been reviewed (as mentioned in the PR description), and I think well-tested in this PR too. So reusing it makes sense. I can follow up with your suggestion and if it works out also simplify the similar code we have for auto-resetting the desired balance.

I can break off the return statement to something like

if (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) {
  return hasAllocationFailures || hasRelocationFailures;
}
return false;

Copy link
Member

@ywangd ywangd Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.

To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT and AllocationFailuresResetIT and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.

    private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {
        if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false
            && changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) {
            return false;
        }
        if (changeEvent.nodesAdded()) {
            return true;
        }
        final var previous = nonRestartShutdownNodes(changeEvent.previousState());
        final var current = nonRestartShutdownNodes(changeEvent.state());
        if (previous.equals(current)) {
            return false;
        }
        return Sets.difference(current, previous).isEmpty() == false // new shutdown
            // removed shutdown but the node is still in cluster
            || Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null);
    }

    // This can be a method on either ClusterState or NodesShutdownMetadata
    private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) {
        return clusterState.metadata()
            .nodeShutdowns()
            .getAll()
            .values()
            .stream()
            .filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART)
            .map(SingleNodeShutdownMetadata::getNodeId)
            .collect(Collectors.toUnmodifiableSet());
    }

Comment on lines 603 to 605
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cover restarted nodes as well. I think it's ok since it is covered by nodesAdded anyway. It may lead to reset a few more times:

  1. On shutdown metadata (either restart or sigterm for serverless azure cluster)
  2. On node join after restart
  3. On shutdown metadata removal (if in a later cluster state update)

Still might be ok. Just to be explicit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly avoid reset when the RESTART metadata is applied. When the node rejoins, yes, there will be a reset. We could explicitly prevent it, but not sure it is worth it. Removal of RESTART shouldn't cause a reset if the node is in the cluster. I think it is tested as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been clear, by "restart", I meant the Azure specific behaviour where the node still goes through the sigterm shutdown record lifecyle while restarted by the Azure "auto-repair" feature. When it happens, we will reset the counter a few more times which should be ok.

@pxsalehi
Copy link
Member Author

add some logs when the failure counters are 5

Yeah, we could. You mean the same way that we log resetting desired balance or including details of which shards had failures? The latter seems a bit verbose, but some summary of it maybe.

@pxsalehi pxsalehi requested a review from ywangd January 16, 2025 13:56
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the same way that we log resetting desired balance or including details of which shards had failures?

I was thinking more like the later otherwise it is hard to go back in the logs to find the actual shards. It would be great if the log message has a list of shards (potentially truncated if the list is too large) that have 5 failures. What do you think?

@pxsalehi
Copy link
Member Author

I was thinking more like the later otherwise it is hard to go back in the logs to find the actual shards. It would be great if the log message has a list of shards (potentially truncated if the list is too large) that have 5 failures. What do you think?

Yes, I think that's helpful too. I'll add it in this PR.

assertBusy(() -> {
var stateAfterNodeJoin = internalCluster().clusterService().state();
var relocatedShard = stateAfterNodeJoin.routingTable().index("index1").shard(0).primaryShard();
assertThat(relocatedShard.relocationFailureInfo().failedRelocations(), Matchers.lessThan(maxAttempts));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably for my knowledge: Why does this assert failedRelocations < maxAttemps instead of failedRelocations == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we care about is the reset of the counter. so anything below max would do. I don't see a reason to be too strict here, since the test will be extra brittle if for whatever reason it takes more than one attempt to allocate/relocate.

Comment on lines 601 to 615
final var nodes = clusterState.nodes();
final var nodeShutdowns = clusterState.metadata().nodeShutdowns();
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations.
shutdownEventAffectsAllocation = processedNodeShutdowns.stream()
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null);
// Clean up processed shutdowns that are removed from the cluster metadata
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false);
for (var shutdown : nodeShutdowns.getAll().entrySet()) {
// A RESTART doesn't necessarily move around shards, so no need to consider it for a reset.
// Furthermore, once the node rejoins after restarting, there will be a reset if necessary.
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) {
shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey());
}
}
return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures);
Copy link
Member

@ywangd ywangd Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.

To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT and AllocationFailuresResetIT and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.

    private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {
        if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false
            && changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) {
            return false;
        }
        if (changeEvent.nodesAdded()) {
            return true;
        }
        final var previous = nonRestartShutdownNodes(changeEvent.previousState());
        final var current = nonRestartShutdownNodes(changeEvent.state());
        if (previous.equals(current)) {
            return false;
        }
        return Sets.difference(current, previous).isEmpty() == false // new shutdown
            // removed shutdown but the node is still in cluster
            || Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null);
    }

    // This can be a method on either ClusterState or NodesShutdownMetadata
    private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) {
        return clusterState.metadata()
            .nodeShutdowns()
            .getAll()
            .values()
            .stream()
            .filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART)
            .map(SingleNodeShutdownMetadata::getNodeId)
            .collect(Collectors.toUnmodifiableSet());
    }

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me. I have only minor comments. I'll review it again if you are adding logging to it. Alternatively, I am also ready to approve it if you prefer loggings to be a separate PR. Please let me know. Thanks!

@pxsalehi
Copy link
Member Author

Thanks. Since we've already gotten into both points here, I'll just add it here and ping you. Didn't have time to get back to this.

return false;
}

public void resetFailedCounter(RoutingAllocation allocation) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've chosen to add the logging here, to be able to produce some reliable information and also it integrates with the existing code of going through the failures. SETTING_ALLOCATION_MAX_RETRY is an index scoped setting so I'm not referring to any value, just picking those reaching max. Anything under max is not mentioned, as was the case before.

@pxsalehi
Copy link
Member Author

Simplified the prod code a bit in d4140e5 and added logging in f6918de. The simplification doesn't apply to #106998 since we don't have the ClusterChangedEvent like here.

@pxsalehi pxsalehi requested a review from ywangd January 24, 2025 10:03
Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the iterations! Test cases look very solid 👍

}

public void resetFailedCounter(RoutingAllocation allocation) {
final var observer = allocation.changes();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we keep the old name of routingChangesObserver?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much prefer the observer since it returns a RoutingChangesObserver.

);
if (failedAllocations >= maxRetry) {
shardsWithMaxFailedAllocations++;
if (topShardIdsWithFailedAllocations.size() <= MAX_SHARDS_IN_LOG_MSG) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think we can check the size along with failedAllocations > 0 to short-circuit a bit earlier and avoid resolving the setting value.

Copy link
Member Author

@pxsalehi pxsalehi Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so,shardsWithMaxFailedAllocations counts all of the ones that have reached max, while topShardIdsWithFailedAllocations stores only a portion of the shard IDs.

Comment on lines +1383 to +1385
if (topShardIdsWithFailedRelocations.size() <= MAX_SHARDS_IN_LOG_MSG) {
topShardIdsWithFailedRelocations.add(shardRouting.shardId());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar nit: checking size before resolving setting value would be my preference.

Comment on lines +72 to +74
"Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)";
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =
"Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
"Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)";
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =
"Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)";
"Resetting failure counter for [%d] shard(s) that have reached their max allocation retires (%s)";
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG =
"Resetting failure counter for [%d] shard(s) that have reached their max relocation retries (%s)";

@pxsalehi
Copy link
Member Author

@elasticmachine update branch

@pxsalehi pxsalehi enabled auto-merge (squash) January 28, 2025 09:12
@pxsalehi pxsalehi merged commit b94a20e into elastic:main Jan 28, 2025
15 checks passed
tvernum added a commit to tvernum/elasticsearch that referenced this pull request Feb 27, 2025
When elastic#119968 was merged into multi-project we introduced a regression
by inserting a call to `.getProject()` within the `RoutingNodes` class
that was supposed to be multi-project-aware.

This commit replaces those calls with `.indexMetadata` lookups
elasticsearchmachine pushed a commit that referenced this pull request Feb 27, 2025
When #119968 was merged into multi-project we introduced a regression by
inserting a call to `.getProject()` within the `RoutingNodes` class that
was supposed to be multi-project-aware.

This commit replaces those calls with `.indexMetadata` lookups
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants