-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
Consider a primary shard P
hosted on node p
and its replica shard Q
hosted on node q
. If p
is isolated from the cluster (e.g., through node failure, a flapping NIC, or an excessively long garbage collection pause), indexing operations can continue on q
after Q
is promoted to primary; these indexing operations will be acknowledged to the requesting clients. If q
is subsequently isolated before p
rejoins and before a new replica is assigned to another node in the cluster, the subsequent rejoining of p
can currently lead to P
being promoted to primary again. The indexing operations acknowledged by q
will be lost.
A mechanism needs to be built to prevent the automatic promotion of a stale shard in such a scenario and instead only promote a non-stale shard to primary (if a non-stale shard is availabie). The only scenario in which a stale shard should be promoted to primary is through manual intervention by a system operator (e.g., in cases when q
suffers a total hardware failure).
Relates #10933