-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Include clusterApplyListener in long cluster apply warnings #120087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include clusterApplyListener in long cluster apply warnings #120087
Conversation
public long rawRelativeTimeInMillis() { | ||
assertThat(Thread.currentThread().getName(), containsString(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME)); | ||
return currentTimeMillis; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that we use rawRelativeTimeInMillis
in the Recorder
but the threshold is in the order of seconds, I wonder if we could use relativeTimeInMillis
instead (to reduce the calls to System.nanoTime()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if the cluster is under heavy GC or the cachedTimer thread is experiencing starvation, the elapsed calcuation will be off if we use the cached timer. This can be detected by log message "timer thread slept ...". But it's a bit indirect. Not sure if that was the intention behind using rawRelativeTimeInMillis
or may I am just overly trying to explain it. I noticed we use it in places like MasterService, In/OutBoundHandlers and HttpTransport etc. But not in places such as PersistedClusterStateService or FsHealthService. Not sure whether they are intentional or accidental either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cached timer only updates every 200ms or so, and although the total threshold is many seconds in length many of the individual steps we record should take much less than 200ms. It's not a huge deal to call System::nanoTime
here.
final ClusterState newClusterState; | ||
try { | ||
try (Releasable ignored = stopWatch.record("running task [" + source + ']')) { | ||
newClusterState = updateFunction.apply(previousClusterState); | ||
} | ||
} catch (Exception e) { | ||
timedListener.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the listener throws, the following code will not run. Probably shouldn't happen in practice. But maybe still worthwhile to wrap in try-finally?
I also wonder whether we should add more details in the log message for the time spent on applying the cluster and calling the listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TimedListener
will measure the time spent in the listener. e.g. from the tests
cluster state applier task [test4] took [36s] which is above the warn threshold of [30s]: [running task [test4]] took [0ms], [listener.onResponse] took [36000ms]
Is that what you meant by "time spent ... calling the listener"
I will look at using one of the ActionListener....
utils or base classes to make exception handling more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Yang is right, we just need a try
/finally
here. TBH it's an error for this listener to throw, we should probably assert that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no real need to propagate the exception to the caller either, it just bubbles up to the unhandled exception handler which logs it and drops it. We may as well catch and log (and assert) in TimedListener
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClusterApplyActionListener
already wraps the provided listener to prevent onFailure
exceptions from propagating, but I added similar logic to TimedListener#onFailure
just in case that changes and we end up with a less safe listener being passed in.
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
Hi @nicktindall, I've created a changelog YAML for you. |
try (Releasable ignored = recorder.record("listener.onFailure")) { | ||
listener.onFailure(e); | ||
} catch (Exception inner) { | ||
inner.addSuppressed(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally we'd suppress the second-thrown exception and propagate the first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved in a71181e
server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Expands the long cluster apply warnings to include execution of
clusterApplyListener.(onResponse|onFailure)
.Relates: ES-10249