ConnectTransportException returns retryable BAD_GATEWAY #118681

DiannaHohensee · 2024-12-13T15:26:33Z

I took a spin through the code with new ConnectTransportException, and saw this error case. That one doesn't seem like it should be a retryable error 🤔 Is this status code change safe, or maybe we should change that code to a different exception type?

Related ES-10214
Closes #118320

elasticsearchmachine · 2024-12-13T15:26:57Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2024-12-13T15:27:21Z

Hi @DiannaHohensee, I've created a changelog YAML for you.

mhl-b · 2024-12-14T01:40:42Z

I think that should be IllegalStateException, because we already have remote connection to another cluster.

Why it needs REST status attached? Do we send it back to the REST handler and client?

DiannaHohensee · 2024-12-16T16:21:59Z

Thanks for taking a look and confirming it seems strange, @mhl-b. I'll take a closer look at Proxy* -- I'm not familiar with it, atm.

I forgot to reference the ticket, but this relates to ES-10214 -- I just updated my description to include it. I don't myself know how the HTTP error code propagates back to the caller, but other folks believe that it does. I would think it could reasonably end up in a stack trace of 'caused by' errors, if not directly.

DiannaHohensee · 2024-12-16T16:51:52Z

Hmm. I think IllegalStateException would indicate ES has done something wrong. Whereas ES looks like it's throwing the ConnectTransportException when the information from the user doesn't match up with reality: like setting an address to a cluster and a cluster name, and then ES verifies that the connected to cluster's name matches, and it doesn't. I think we need a "user gave us a bad parameter" non-retryable exception?

There is some more code throwing the ConnectTransportException, and I think the idea is that we connected successfully to some server, but the type of server is not what ES expected to find.

DiannaHohensee · 2024-12-16T19:41:22Z

After reviewing the ConnectTransportException uses, I think we need a new exception type, something like InvalidRemoteNodeException, to replace the ConnectTransportException use cases where the node connection is successful but the node type is unexpected. It doesn't seem like those error cases should be retryable.

@DaveCTurner for thoughts.

DaveCTurner · 2024-12-16T22:23:48Z

My reading of the HTTP spec is that we are indeed acting as a gateway in this situation:

A "gateway" (a.k.a. "reverse proxy") is an intermediary that acts as an origin server for the outbound connection but translates received requests and forwards them inbound to another server or servers [...] A gateway communicates with inbound servers using any protocol that it desires [...]

Moreover, if the remote node turns out to belong to a cluster with an unexpected name then 502 Bad Gateway is a valid response code:

The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.

There's other ways that a bad in-cluster config might lead to a ConnectTransportException or a subclass, and indeed I can think of ways that this specific failure might go away on a retry. In any case, the effect on client retries is not really relevant here, at least not as much as following the HTTP spec. The whole notion of triggering client behaviour based solely on the HTTP response code is fundamentally doomed anyway, there's just too many different outcomes to map onto the very limited options available to us in the HTTP spec.

It's definitely not a client-side failure so we have to return some kind of 5xx error, and 502 seems more appropriate than 500. If clients choose to retry on all 502s then 🤷 that's not wrong I guess.

mhl-b · 2024-12-17T01:10:53Z

Would be nice to have an integ or yaml/rest test to show rest status propagation. I dont think status is very important, 500 or 502, since we model REST over HTTP and there is no strict requirement to confine to error codes, as David said there are not enough of them. So client still need to unwrap 5xx HTTP response to make decision, including retry.

DaveCTurner

LGTM except a couple of nits

server/src/main/java/org/elasticsearch/transport/ConnectTransportException.java

…ortException.java Co-authored-by: David Turner <david.turner@elastic.co>

…n-response-code

elasticsearchmachine · 2024-12-17T15:31:04Z

Hi @DiannaHohensee, I've updated the changelog YAML for you.

DiannaHohensee · 2024-12-17T19:46:36Z

Would be nice to have an integ or yaml/rest test to show rest status propagation. I dont think status is very important, 500 or 502, since we model REST over HTTP and there is no strict requirement to confine to error codes, as David said there are not enough of them. So client still need to unwrap 5xx HTTP response to make decision, including retry.

In the related SDHE, Kibana was seeing message: "org.elasticsearch.transport.NodeDisconnectedException" and message: INTERNAL_SERVER_ERROR as the symptom. I suspect that's from the generic TransportService#onConnectionClosed method. I think it would be difficult to test this behavior end-to-end: I'm not aware of a related test suite where we set up ES servers and check HTTP codes. I think it would be more expedient to file a ticket to do that work, if we wanted it.

DiannaHohensee · 2024-12-17T19:47:58Z

The test failures match #118846, so they are not a blocker.

ConnectTransportException and its subclasses previous translated to a INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502 BAD_GATEWAY so that users may choose to retry it on connectivity issues. Related ES-10214 Closes elastic#118320

pawankartik-elastic · 2024-12-19T21:24:04Z

💚 All backports created successfully

Status	Branch	Result
✅	8.x

Questions ?

Please refer to the Backport tool documentation

ConnectTransportException and its subclasses previous translated to a INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502 BAD_GATEWAY so that users may choose to retry it on connectivity issues. Related ES-10214 Closes elastic#118320 (cherry picked from commit 517abe4)

…19146) ConnectTransportException and its subclasses previous translated to a INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502 BAD_GATEWAY so that users may choose to retry it on connectivity issues. Related ES-10214 Closes #118320 (cherry picked from commit 517abe4) Co-authored-by: Dianna Hohensee <dianna.hohensee@elastic.co>

ConnectTransportException returns retryable BAD_GATEWAY

4a0e765

DiannaHohensee added >enhancement :Distributed Coordination/Network Http and internode communication implementations Team:Distributed Coordination Meta label for Distributed Coordination team labels Dec 13, 2024

DiannaHohensee self-assigned this Dec 13, 2024

elasticsearchmachine added the v9.0.0 label Dec 13, 2024

Update docs/changelog/118681.yaml

13ecc1f

DiannaHohensee requested a review from DaveCTurner December 13, 2024 15:31

DaveCTurner approved these changes Dec 17, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/transport/ConnectTransportException.java Outdated Show resolved Hide resolved

server/src/main/java/org/elasticsearch/transport/ConnectTransportException.java Outdated Show resolved Hide resolved

DiannaHohensee and others added 5 commits December 17, 2024 09:28

Update server/src/main/java/org/elasticsearch/transport/ConnectTransp…

48e4031

…ortException.java Co-authored-by: David Turner <david.turner@elastic.co>

Update server/src/main/java/org/elasticsearch/transport/ConnectTransp…

96c6d53

…ortException.java Co-authored-by: David Turner <david.turner@elastic.co>

Merge branch 'main' into 2024/12/13/ES-10214-ConnectTransportExceptio…

8f82d43

…n-response-code

formatting

05ff3a0

Update docs/changelog/118681.yaml

9f386ea

DiannaHohensee merged commit 517abe4 into elastic:main Dec 17, 2024
14 of 16 checks passed

pawankartik-elastic mentioned this pull request Dec 19, 2024

[8.x] ConnectTransportException returns retryable BAD_GATEWAY (#118681) #119146

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ConnectTransportException returns retryable BAD_GATEWAY #118681

ConnectTransportException returns retryable BAD_GATEWAY #118681

Uh oh!

DiannaHohensee commented Dec 13, 2024 •

edited

Loading

Uh oh!

elasticsearchmachine commented Dec 13, 2024

Uh oh!

elasticsearchmachine commented Dec 13, 2024

Uh oh!

mhl-b commented Dec 14, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DaveCTurner commented Dec 16, 2024

Uh oh!

mhl-b commented Dec 17, 2024 •

edited

Loading

Uh oh!

DaveCTurner left a comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 17, 2024

Uh oh!

DiannaHohensee commented Dec 17, 2024

Uh oh!

DiannaHohensee commented Dec 17, 2024

Uh oh!

Uh oh!

pawankartik-elastic commented Dec 19, 2024

Uh oh!

Uh oh!

ConnectTransportException returns retryable BAD_GATEWAY #118681

ConnectTransportException returns retryable BAD_GATEWAY #118681

Uh oh!

Conversation

DiannaHohensee commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 13, 2024

Uh oh!

elasticsearchmachine commented Dec 13, 2024

Uh oh!

mhl-b commented Dec 14, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DiannaHohensee commented Dec 16, 2024

Uh oh!

DaveCTurner commented Dec 16, 2024

Uh oh!

mhl-b commented Dec 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Dec 17, 2024

Uh oh!

DiannaHohensee commented Dec 17, 2024

Uh oh!

DiannaHohensee commented Dec 17, 2024

Uh oh!

Uh oh!

pawankartik-elastic commented Dec 19, 2024

💚 All backports created successfully

Questions ?

Uh oh!

Uh oh!

DiannaHohensee commented Dec 13, 2024 •

edited

Loading

mhl-b commented Dec 17, 2024 •

edited

Loading