-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Reset relocation/allocation failure counter on node join/shutdown #119968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset relocation/allocation failure counter on node join/shutdown #119968
Conversation
d5686fc
to
8f93a29
Compare
* Note that removing a non-RESTART shutdown metadata from a node that is still in the cluster is treated similarly and | ||
* will cause resetting the allocation/relocation failures. | ||
*/ | ||
private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR allocation and relocation are kind of teated similarly and sometimes I've used "allocation" to mean both.
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Show resolved
Hide resolved
import static org.hamcrest.CoreMatchers.notNullValue; | ||
|
||
@ESIntegTestCase.ClusterScope(scope = ESIntegTestCase.Scope.TEST, numDataNodes = 0) | ||
public class AllocationFailuresResetOnShutdownIT extends ESIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these tests, sometimes we need three nodes since when use a REPLACE
shutdown type, NodeReplacementAllocationDecider
can prevent the shard to be assigned to any available node.
8f93a29
to
cd6a896
Compare
Hi @pxsalehi, I've created a changelog YAML for you. |
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't read the tests. Left a few minor comments. Also, I think we want to add some logs when the failure counters are 5 (max) and are resetted. IIUC from the conversation, this reaching max failures might be an indication of issues in some part of the system. We reset the counters to ease support burden but in the meantime it sorta hides the problems and logs should be helpful to make them discoverable.
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Outdated
Show resolved
Hide resolved
final var nodes = clusterState.nodes(); | ||
final var nodeShutdowns = clusterState.metadata().nodeShutdowns(); | ||
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations. | ||
shutdownEventAffectsAllocation = processedNodeShutdowns.stream() | ||
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null); | ||
// Clean up processed shutdowns that are removed from the cluster metadata | ||
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false); | ||
for (var shutdown : nodeShutdowns.getAll().entrySet()) { | ||
// A RESTART doesn't necessarily move around shards, so no need to consider it for a reset. | ||
// Furthermore, once the node rejoins after restarting, there will be a reset if necessary. | ||
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) { | ||
shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey()); | ||
} | ||
} | ||
return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we short-circuit earlier when nodesAdd()
is true
which feels a bit easier to read to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd need to do the book-keeping of the seen shutdowns any way. I can add a comment or break the return line into multiple lines maybe. Frankly, I found this more readable. Maybe a comment-equivalent of that helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether we can do away with the book-keeping? Can we compare the old and new cluster states and decide whether a shutdown record has been "added" or "removed but still in the cluster" and thus needing reset? It maybe helpful to enhance ClusterChangedEvent
to provide some shutdownChanged()
method similar to nodesChanged()
so that it can be reused. This can be a separate PR. For the work here, a localized solution is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you feel strongly about this, I'd rather not revisit that part in this PR. We have used this elsewhere and its edge cases have been reviewed (as mentioned in the PR description), and I think well-tested in this PR too. So reusing it makes sense. I can follow up with your suggestion and if it works out also simplify the similar code we have for auto-resetting the desired balance.
I can break off the return statement to something like
if (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) {
return hasAllocationFailures || hasRelocationFailures;
}
return false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.
To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT
and AllocationFailuresResetIT
and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.
private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {
if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false
&& changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) {
return false;
}
if (changeEvent.nodesAdded()) {
return true;
}
final var previous = nonRestartShutdownNodes(changeEvent.previousState());
final var current = nonRestartShutdownNodes(changeEvent.state());
if (previous.equals(current)) {
return false;
}
return Sets.difference(current, previous).isEmpty() == false // new shutdown
// removed shutdown but the node is still in cluster
|| Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null);
}
// This can be a method on either ClusterState or NodesShutdownMetadata
private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) {
return clusterState.metadata()
.nodeShutdowns()
.getAll()
.values()
.stream()
.filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART)
.map(SingleNodeShutdownMetadata::getNodeId)
.collect(Collectors.toUnmodifiableSet());
}
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations. | ||
shutdownEventAffectsAllocation = processedNodeShutdowns.stream() | ||
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cover restarted nodes as well. I think it's ok since it is covered by nodesAdded
anyway. It may lead to reset a few more times:
- On shutdown metadata (either restart or sigterm for serverless azure cluster)
- On node join after restart
- On shutdown metadata removal (if in a later cluster state update)
Still might be ok. Just to be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly avoid reset when the RESTART metadata is applied. When the node rejoins, yes, there will be a reset. We could explicitly prevent it, but not sure it is worth it. Removal of RESTART shouldn't cause a reset if the node is in the cluster. I think it is tested as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been clear, by "restart", I meant the Azure specific behaviour where the node still goes through the sigterm shutdown record lifecyle while restarted by the Azure "auto-repair" feature. When it happens, we will reset the counter a few more times which should be ok.
Yeah, we could. You mean the same way that we log resetting desired balance or including details of which shards had failures? The latter seems a bit verbose, but some summary of it maybe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the same way that we log resetting desired balance or including details of which shards had failures?
I was thinking more like the later otherwise it is hard to go back in the logs to find the actual shards. It would be great if the log message has a list of shards (potentially truncated if the list is too large) that have 5 failures. What do you think?
Yes, I think that's helpful too. I'll add it in this PR. |
...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java
Outdated
Show resolved
Hide resolved
...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java
Outdated
Show resolved
Hide resolved
assertBusy(() -> { | ||
var stateAfterNodeJoin = internalCluster().clusterService().state(); | ||
var relocatedShard = stateAfterNodeJoin.routingTable().index("index1").shard(0).primaryShard(); | ||
assertThat(relocatedShard.relocationFailureInfo().failedRelocations(), Matchers.lessThan(maxAttempts)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably for my knowledge: Why does this assert failedRelocations < maxAttemps
instead of failedRelocations == 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we care about is the reset of the counter. so anything below max would do. I don't see a reason to be too strict here, since the test will be extra brittle if for whatever reason it takes more than one attempt to allocate/relocate.
...alClusterTest/java/org/elasticsearch/xpack/shutdown/AllocationFailuresResetOnShutdownIT.java
Outdated
Show resolved
Hide resolved
final var nodes = clusterState.nodes(); | ||
final var nodeShutdowns = clusterState.metadata().nodeShutdowns(); | ||
// If we remove a shutdown marker from a node, but it is still in the cluster, we could re-attempt failed relocations/allocations. | ||
shutdownEventAffectsAllocation = processedNodeShutdowns.stream() | ||
.anyMatch(nodeId -> nodeShutdowns.contains(nodeId) == false && nodes.get(nodeId) != null); | ||
// Clean up processed shutdowns that are removed from the cluster metadata | ||
processedNodeShutdowns.removeIf(nodeId -> nodeShutdowns.contains(nodeId) == false); | ||
for (var shutdown : nodeShutdowns.getAll().entrySet()) { | ||
// A RESTART doesn't necessarily move around shards, so no need to consider it for a reset. | ||
// Furthermore, once the node rejoins after restarting, there will be a reset if necessary. | ||
if (shutdown.getValue().getType() != SingleNodeShutdownMetadata.Type.RESTART) { | ||
shutdownEventAffectsAllocation |= processedNodeShutdowns.add(shutdown.getKey()); | ||
} | ||
} | ||
return (changeEvent.nodesAdded() || shutdownEventAffectsAllocation) && (hasAllocationFailures || hasRelocationFailures); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for consistency purpose, I am ok with this PR to keep the part as is. We can have a follow-up to then potentially change both places.
To make sure we are not cross-talking, I am posting below the concrete code suggestion. To me the main simplicity comes from no need for tracking shutdown nodes locally. I tested it with AllocationFailuresResetOnShutdownIT
and AllocationFailuresResetIT
and they both pass though I likely have not exhausted all possible test varaints. Let's put it through its own review process if you agree it's worthwhile to follow.
private boolean shouldResetAllocationFailures(ClusterChangedEvent changeEvent) {
if (changeEvent.state().getRoutingNodes().hasAllocationFailures() == false
&& changeEvent.state().getRoutingNodes().hasRelocationFailures() == false) {
return false;
}
if (changeEvent.nodesAdded()) {
return true;
}
final var previous = nonRestartShutdownNodes(changeEvent.previousState());
final var current = nonRestartShutdownNodes(changeEvent.state());
if (previous.equals(current)) {
return false;
}
return Sets.difference(current, previous).isEmpty() == false // new shutdown
// removed shutdown but the node is still in cluster
|| Sets.difference(previous, current).stream().anyMatch(nodeId -> changeEvent.state().nodes().get(nodeId) != null);
}
// This can be a method on either ClusterState or NodesShutdownMetadata
private static Set<String> nonRestartShutdownNodes(ClusterState clusterState) {
return clusterState.metadata()
.nodeShutdowns()
.getAll()
.values()
.stream()
.filter(m -> m.getType() != SingleNodeShutdownMetadata.Type.RESTART)
.map(SingleNodeShutdownMetadata::getNodeId)
.collect(Collectors.toUnmodifiableSet());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks good to me. I have only minor comments. I'll review it again if you are adding logging to it. Alternatively, I am also ready to approve it if you prefer loggings to be a separate PR. Please let me know. Thanks!
Thanks. Since we've already gotten into both points here, I'll just add it here and ping you. Didn't have time to get back to this. |
return false; | ||
} | ||
|
||
public void resetFailedCounter(RoutingAllocation allocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've chosen to add the logging here, to be able to produce some reliable information and also it integrates with the existing code of going through the failures. SETTING_ALLOCATION_MAX_RETRY
is an index scoped setting so I'm not referring to any value, just picking those reaching max. Anything under max is not mentioned, as was the case before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the iterations! Test cases look very solid 👍
} | ||
|
||
public void resetFailedCounter(RoutingAllocation allocation) { | ||
final var observer = allocation.changes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we keep the old name of routingChangesObserver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much prefer the observer
since it returns a RoutingChangesObserver
.
); | ||
if (failedAllocations >= maxRetry) { | ||
shardsWithMaxFailedAllocations++; | ||
if (topShardIdsWithFailedAllocations.size() <= MAX_SHARDS_IN_LOG_MSG) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we can check the size along with failedAllocations > 0
to short-circuit a bit earlier and avoid resolving the setting value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so,shardsWithMaxFailedAllocations
counts all of the ones that have reached max, while topShardIdsWithFailedAllocations
stores only a portion of the shard IDs.
if (topShardIdsWithFailedRelocations.size() <= MAX_SHARDS_IN_LOG_MSG) { | ||
topShardIdsWithFailedRelocations.add(shardRouting.shardId()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar nit: checking size before resolving setting value would be my preference.
"Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)"; | ||
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG = | ||
"Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
"Resetting failure counter for %d shard(s) that have reached their max allocation retires (%s)"; | |
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG = | |
"Resetting failure counter for %d shard(s) that have reached their max relocation retries (%s)"; | |
"Resetting failure counter for [%d] shard(s) that have reached their max allocation retires (%s)"; | |
public static final String RESET_FAILED_RELOCATION_COUNTER_LOG_MSG = | |
"Resetting failure counter for [%d] shard(s) that have reached their max relocation retries (%s)"; |
@elasticmachine update branch |
When elastic#119968 was merged into multi-project we introduced a regression by inserting a call to `.getProject()` within the `RoutingNodes` class that was supposed to be multi-project-aware. This commit replaces those calls with `.indexMetadata` lookups
When #119968 was merged into multi-project we introduced a regression by inserting a call to `.getProject()` within the `RoutingNodes` class that was supposed to be multi-project-aware. This commit replaces those calls with `.indexMetadata` lookups
We prevent retries of allocations/relocations once they see
index.allocation.max_retries
failed attempts (default 5). In #108987, we added reseting the allocation failure counters when a node joins the cluster. As discussed in the linked discussion, it would make sense to extend this reset also to relocations AND also consider node shutdown events. With this change we reset both allocation/relocation failures if a new node joins the cluster or a shutdown metadata is applied. The subset of shutdown events that we consider and how we track them is more or less copied from what was done for #106998. To me the logic seemed to make sense here too.Closes ES-10492