-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Description
@aphyr recently discovered this resilience issue [https://github.com/crate/crate/issues/3711] while running the jespen test suite against Crate.
After I created an integration test (based on current ES master) [https://github.com/crate/elasticsearch/commit/41ed5ebe7304710fda4de4e69479e17081042c38] out of the relevant jepsen code using your nice network partition simulation helper, I was able to reproduce this error not only using Crate but also using plain Elasticsearch.
I've reproduced this issue on ES 2.3, 5.0-alpha3 & master.
The longer the test is running the more often it will fail, with current default runtime of 180sec it fails almost always on my machine. (the relevant jepsen test is running 360sec)
Currently I've no real idea why this is happening, my guess is that some reads are reading a stale version value but I did not yet figured out how/why.
I've also run this scenario on a single node with one shard because my first guess was that this is maybe not network partition related but this test never failed..
I've read the current ES resilience issues and I couldn't see anything which could be related to this issue, but I'm also not completely sure.