Include clusterApplyListener in long cluster apply warnings #120087

nicktindall · 2025-01-14T05:04:40Z

Expands the long cluster apply warnings to include execution of clusterApplyListener.(onResponse|onFailure).

Relates: ES-10249

nicktindall · 2025-01-14T05:09:13Z

server/src/test/java/org/elasticsearch/cluster/service/ClusterApplierServiceTests.java

+            public long rawRelativeTimeInMillis() {
+                assertThat(Thread.currentThread().getName(), containsString(ClusterApplierService.CLUSTER_UPDATE_THREAD_NAME));
+                return currentTimeMillis;
+            }


I notice that we use rawRelativeTimeInMillis in the Recorder but the threshold is in the order of seconds, I wonder if we could use relativeTimeInMillis instead (to reduce the calls to System.nanoTime())

I guess if the cluster is under heavy GC or the cachedTimer thread is experiencing starvation, the elapsed calcuation will be off if we use the cached timer. This can be detected by log message "timer thread slept ...". But it's a bit indirect. Not sure if that was the intention behind using rawRelativeTimeInMillis or may I am just overly trying to explain it. I noticed we use it in places like MasterService, In/OutBoundHandlers and HttpTransport etc. But not in places such as PersistedClusterStateService or FsHealthService. Not sure whether they are intentional or accidental either.

The cached timer only updates every 200ms or so, and although the total threshold is many seconds in length many of the individual steps we record should take much less than 200ms. It's not a huge deal to call System::nanoTime here.

ywangd · 2025-01-14T07:27:38Z

server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java

        final ClusterState newClusterState;
        try {
            try (Releasable ignored = stopWatch.record("running task [" + source + ']')) {
                newClusterState = updateFunction.apply(previousClusterState);
            }
        } catch (Exception e) {
+            timedListener.onFailure(e);


If the listener throws, the following code will not run. Probably shouldn't happen in practice. But maybe still worthwhile to wrap in try-finally?

I also wonder whether we should add more details in the log message for the time spent on applying the cluster and calling the listener?

The TimedListener will measure the time spent in the listener. e.g. from the tests

cluster state applier task [test4] took [36s] which is above the warn threshold of [30s]: [running task [test4]] took [0ms], [listener.onResponse] took [36000ms]

Is that what you meant by "time spent ... calling the listener"

I will look at using one of the ActionListener.... utils or base classes to make exception handling more robust.

I think Yang is right, we just need a try/finally here. TBH it's an error for this listener to throw, we should probably assert that too.

Actually no real need to propagate the exception to the caller either, it just bubbles up to the unhandled exception handler which logs it and drops it. We may as well catch and log (and assert) in TimedListener.

The ClusterApplyActionListener already wraps the provided listener to prevent onFailure exceptions from propagating, but I added similar logic to TimedListener#onFailure just in case that changes and we end up with a less safe listener being passed in.

elasticsearchmachine · 2025-01-14T11:35:51Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-01-14T11:35:51Z

Hi @nicktindall, I've created a changelog YAML for you.

DaveCTurner · 2025-01-14T11:36:31Z

server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java

+            try (Releasable ignored = recorder.record("listener.onFailure")) {
+                listener.onFailure(e);
+            } catch (Exception inner) {
+                inner.addSuppressed(e);


I think generally we'd suppress the second-thrown exception and propagate the first.

Resolved in a71181e

server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java

DaveCTurner

LGTM

nicktindall added 2 commits January 14, 2025 16:03

Include clusterApplyListener in long cluster apply warnings

035bbc0

Relates: ES-10249

Merge branch 'main' into include_listener_in_long_apply_warning

95694da

elasticsearchmachine added the v9.0.0 label Jan 14, 2025

nicktindall commented Jan 14, 2025

View reviewed changes

nicktindall marked this pull request as ready for review January 14, 2025 06:15

nicktindall requested review from DaveCTurner and ywangd January 14, 2025 06:15

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jan 14, 2025

De-duplicate source string

c0395b6

ywangd reviewed Jan 14, 2025

View reviewed changes

Ensure exceptions don't escape TimedListener.onFailure

915951e

nicktindall added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 14, 2025

elasticsearchmachine added Team:Distributed Coordination Meta label for Distributed Coordination team and removed needs:triage Requires assignment of a team area label labels Jan 14, 2025

Update docs/changelog/120087.yaml

eaddfdb

DaveCTurner reviewed Jan 14, 2025

View reviewed changes

Ensure exceptions don't escape TimedListener.onResponse

a71181e

DaveCTurner approved these changes Jan 14, 2025

View reviewed changes

nicktindall merged commit 0a98bf8 into elastic:main Jan 14, 2025
16 checks passed

nicktindall deleted the include_listener_in_long_apply_warning branch January 14, 2025 21:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include clusterApplyListener in long cluster apply warnings #120087

Include clusterApplyListener in long cluster apply warnings #120087

Uh oh!

nicktindall commented Jan 14, 2025

Uh oh!

nicktindall Jan 14, 2025

Uh oh!

ywangd Jan 14, 2025

Uh oh!

DaveCTurner Jan 14, 2025

Uh oh!

ywangd Jan 14, 2025

Uh oh!

nicktindall Jan 14, 2025 •

edited

Loading

Uh oh!

DaveCTurner Jan 14, 2025

Uh oh!

DaveCTurner Jan 14, 2025

Uh oh!

nicktindall Jan 14, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jan 14, 2025

Uh oh!

elasticsearchmachine commented Jan 14, 2025

Uh oh!

DaveCTurner Jan 14, 2025

Uh oh!

nicktindall Jan 14, 2025

Uh oh!

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

Uh oh!

Include clusterApplyListener in long cluster apply warnings #120087

Include clusterApplyListener in long cluster apply warnings #120087

Uh oh!

Conversation

nicktindall commented Jan 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicktindall Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jan 14, 2025

Uh oh!

elasticsearchmachine commented Jan 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nicktindall Jan 14, 2025 •

edited

Loading

nicktindall Jan 14, 2025 •

edited

Loading