Remove INDEX_REFRESH_BLOCK after index becomes searchable #120807

fcofdez · 2025-01-24T15:43:32Z

This commit enhances the ShardStartedClusterStateTaskExecutor by
introducing functionality to automatically remove the
INDEX_REFRESH_BLOCK once an index becomes searchable.

The change ensures search availability by checking that at least one
copy of each searchable shard is available whenever an unpromotable
shard is started. Once this condition is met, the INDEX_REFRESH_BLOCK
is removed.

Closes ES-10278

This commit enhances the ShardStartedClusterStateTaskExecutor by introducing functionality to automatically remove the INDEX_REFRESH_BLOCK once an index becomes searchable. The change ensures search availability by checking that at least one copy of each searchable shard is available whenever an unpromotable shard is started. Once this condition is met, the INDEX_REFRESH_BLOCK is removed. Closes ES-10278

elasticsearchmachine · 2025-01-24T15:43:56Z

Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing)

fcofdez · 2025-01-24T15:44:11Z

I'll enhance the integration tests in a separate PR.

elasticsearchmachine · 2025-01-24T15:44:19Z

Hi @fcofdez, I've created a changelog YAML for you.

…thub.com:fcofdez/elasticsearch into clear-refresh-block-once-unpromotables-available

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

tlrx · 2025-01-27T10:23:41Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+            ClusterBlocks.Builder clusterBlocksBuilder = null;
+            for (Index indexWithUnpromotableShardsStarted : indicesWithUnpromotableShardsStarted) {
+                String indexName = indexWithUnpromotableShardsStarted.getName();
+                assert clusterState.blocks().hasIndexBlock(indexName, INDEX_REFRESH_BLOCK);


nit:

Suggested change

assert clusterState.blocks().hasIndexBlock(indexName, INDEX_REFRESH_BLOCK);

assert clusterState.blocks().hasIndexBlock(indexName, INDEX_REFRESH_BLOCK) : indexWithUnpromotableShardsStarted;

Done in b766d8e

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

tlrx · 2025-01-27T10:26:15Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

@@ -776,6 +789,35 @@ public ClusterState execute(BatchExecutionContext<StartedShardUpdateTask> batchE
            return maybeUpdatedState;
        }

+        private ClusterState maybeRemoveIndexRefreshBlocks(
+            ClusterState clusterState,


should we document that the cluster state here is already updated with the STARTED state of search shards?

Done in b766d8e

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

…nce-unpromotables-available

fcofdez · 2025-01-27T18:55:03Z

@elasticmachine update branch

…able

fcofdez · 2025-01-27T20:59:15Z

@elasticmachine update branch

…able

fcofdez · 2025-01-28T08:22:21Z

@elasticmachine update branch

…able

tlrx

LGTM

fcofdez · 2025-01-28T09:43:01Z

@elasticmachine update branch

…able

henningandersen · 2025-01-28T10:05:19Z

server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

@@ -760,7 +772,10 @@ public ClusterState execute(BatchExecutionContext<StartedShardUpdateTask> batchE
                    maybeUpdatedState = ClusterState.builder(maybeUpdatedState).metadata(metadataBuilder).build();
                }

+                maybeUpdatedState = maybeRemoveIndexRefreshBlocks(maybeUpdatedState, indicesWithUnpromotableShardsStarted);


I am fine with this going in for now, it is a good improvement as is.

I wonder though if we should follow-up with a change to remove the refresh block in a subsequent cluster state update. That would avoid problems with a coordinator not seeing the shard started message but an indexing node seeing the refresh block removed, which I think could cause a refresh to be ack'ed and a subsequent search not seeing the refreshed data?

That would avoid problems with a coordinator not seeing the shard started message but an indexing node seeing the refresh block removed, which I think could cause a refresh to be ack'ed and a subsequent search not seeing the refreshed data?

I believe that there's the same risk with the deferred removal, right? The cluster state update with the unblocked index might be applied a bit later on the search node anyway, unless you're proposing to react to the shard started events differently on the search nodes?

Also, the shard is marked as started as soon as it is ready to serve searches, so that should be fine too?

On the last part, I think we would let the coordinator of the search assume an empty result when refresh is blocked?

Maybe you are suggesting to only do so for unassigned shards, whereas we could send the shard search request to the initializing shard, which could then make the determination. I think that would be safe too.

On the first part, I would rely on cluster state updates trying hard to be applied everywhere before proceeding with the next update. This means we would be sure to see the shard started everywhere before the refresh block is removed.

Implicit in that is that I thought we would then forward search requests always when there is a started shard.

we could send the shard search request to the initializing shard, which could then make the determination. I think that would be safe too.

This would come with the downside of risking sending search requests to search nodes where the shard is still unassigned. I think that is possible to handle.

But it seems there are two options and the trade-off is not entirely clear.

(but as mentioned I am ok with doing what is in this PR for now and then let us tackle that other piece).

(but as mentioned I am ok with doing what is in this PR for now and then let us tackle that other piece).

Sounds good, I'll open a follow up ticket for this.

fcofdez · 2025-01-29T13:40:35Z

@elasticmachine update branch

…able

…astic#120807)" This reverts commit ae0f1a6. The refresh block would be removed in a subsequent cluster state update instead of removing it immediately after an index is ready for searches. Closes ES-10697

…20807)" (#121427) This reverts commit ae0f1a6. The refresh block would be removed in a subsequent cluster state update instead of removing it immediately after an index is ready for searches. Closes ES-10697

…astic#120807)" (elastic#121427) This reverts commit ae0f1a6. The refresh block would be removed in a subsequent cluster state update instead of removing it immediately after an index is ready for searches. Closes ES-10697

fcofdez added >enhancement :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Indexing Meta label for Distributed Indexing team labels Jan 24, 2025

fcofdez requested review from tlrx and henningandersen January 24, 2025 15:43

elasticsearchmachine added the v9.0.0 label Jan 24, 2025

Update docs/changelog/120807.yaml

542ffcb

fcofdez added 4 commits January 24, 2025 16:48

Use INDEX_ONLY role

5f4b677

Merge branch 'clear-refresh-block-once-unpromotables-available' of gi…

75ac414

…thub.com:fcofdez/elasticsearch into clear-refresh-block-once-unpromotables-available

Better assertions

ac4b395

Improve comment

32873ea

tlrx reviewed Jan 27, 2025

View reviewed changes

Review comments

b766d8e

fcofdez requested a review from tlrx January 27, 2025 16:14

fcofdez added 3 commits January 27, 2025 17:16

Merge remote-tracking branch 'origin/main' into clear-refresh-block-o…

8df2929

…nce-unpromotables-available

Assertion

40ed339

Assertion...

cdcfdb0

Merge branch 'main' into clear-refresh-block-once-unpromotables-avail…

05aef46

…able

Merge branch 'main' into clear-refresh-block-once-unpromotables-avail…

8667168

…able

Merge branch 'main' into clear-refresh-block-once-unpromotables-avail…

2a70c64

…able

tlrx approved these changes Jan 28, 2025

View reviewed changes

Merge branch 'main' into clear-refresh-block-once-unpromotables-avail…

8af82f5

…able

henningandersen reviewed Jan 28, 2025

View reviewed changes

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jan 28, 2025

elasticmachine and others added 3 commits January 29, 2025 14:40

Merge branch 'main' into clear-refresh-block-once-unpromotables-avail…

21be7a4

…able

Fix compilation

f612a75

More compilation failures

82e47b8

fcofdez merged commit ae0f1a6 into elastic:main Jan 29, 2025
16 checks passed

	assert clusterState.blocks().hasIndexBlock(indexName, INDEX_REFRESH_BLOCK);
	assert clusterState.blocks().hasIndexBlock(indexName, INDEX_REFRESH_BLOCK) : indexWithUnpromotableShardsStarted;

Remove INDEX_REFRESH_BLOCK after index becomes searchable #120807

Remove INDEX_REFRESH_BLOCK after index becomes searchable #120807

Uh oh!

Conversation

fcofdez commented Jan 24, 2025

Uh oh!

elasticsearchmachine commented Jan 24, 2025

Uh oh!

fcofdez commented Jan 24, 2025

Uh oh!

elasticsearchmachine commented Jan 24, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcofdez commented Jan 27, 2025

Uh oh!

fcofdez commented Jan 27, 2025

Uh oh!

fcofdez commented Jan 28, 2025

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Jan 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Jan 29, 2025

Uh oh!

Uh oh!

Uh oh!