Skip to content

Include clusterApplyListener in long cluster apply warnings #120087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nicktindall
Copy link
Contributor

Expands the long cluster apply warnings to include execution of clusterApplyListener.(onResponse|onFailure).

Relates: ES-10249

public long rawRelativeTimeInMillis() {
assertThat(Thread.currentThread().getName(), containsString(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME));
return currentTimeMillis;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that we use rawRelativeTimeInMillis in the Recorder but the threshold is in the order of seconds, I wonder if we could use relativeTimeInMillis instead (to reduce the calls to System.nanoTime())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if the cluster is under heavy GC or the cachedTimer thread is experiencing starvation, the elapsed calcuation will be off if we use the cached timer. This can be detected by log message "timer thread slept ...". But it's a bit indirect. Not sure if that was the intention behind using rawRelativeTimeInMillis or may I am just overly trying to explain it. I noticed we use it in places like MasterService, In/OutBoundHandlers and HttpTransport etc. But not in places such as PersistedClusterStateService or FsHealthService. Not sure whether they are intentional or accidental either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cached timer only updates every 200ms or so, and although the total threshold is many seconds in length many of the individual steps we record should take much less than 200ms. It's not a huge deal to call System::nanoTime here.

@nicktindall nicktindall marked this pull request as ready for review January 14, 2025 06:15
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 14, 2025
final ClusterState newClusterState;
try {
try (Releasable ignored = stopWatch.record("running task [" + source + ']')) {
newClusterState = updateFunction.apply(previousClusterState);
}
} catch (Exception e) {
timedListener.onFailure(e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the listener throws, the following code will not run. Probably shouldn't happen in practice. But maybe still worthwhile to wrap in try-finally?

I also wonder whether we should add more details in the log message for the time spent on applying the cluster and calling the listener?

Copy link
Contributor Author

@nicktindall nicktindall Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TimedListener will measure the time spent in the listener. e.g. from the tests

cluster state applier task [test4] took [36s] which is above the warn threshold of [30s]: [running task [test4]] took [0ms], [listener.onResponse] took [36000ms]

Is that what you meant by "time spent ... calling the listener"

I will look at using one of the ActionListener.... utils or base classes to make exception handling more robust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Yang is right, we just need a try/finally here. TBH it's an error for this listener to throw, we should probably assert that too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no real need to propagate the exception to the caller either, it just bubbles up to the unhandled exception handler which logs it and drops it. We may as well catch and log (and assert) in TimedListener.

Copy link
Contributor Author

@nicktindall nicktindall Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClusterApplyActionListener already wraps the provided listener to prevent onFailure exceptions from propagating, but I added similar logic to TimedListener#onFailure just in case that changes and we end up with a less safe listener being passed in.

@nicktindall nicktindall added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 14, 2025
@elasticsearchmachine elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label labels Jan 14, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @nicktindall, I've created a changelog YAML for you.

try (Releasable ignored = recorder.record("listener.onFailure")) {
listener.onFailure(e);
} catch (Exception inner) {
inner.addSuppressed(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally we'd suppress the second-thrown exception and propagate the first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in a71181e

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nicktindall nicktindall merged commit 0a98bf8 into elastic:main Jan 14, 2025
16 checks passed
@nicktindall nicktindall deleted the include_listener_in_long_apply_warning branch January 14, 2025 21:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants