Skip to content

ConnectTransportException returns retryable BAD_GATEWAY #118681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

DiannaHohensee
Copy link
Contributor

@DiannaHohensee DiannaHohensee commented Dec 13, 2024

I took a spin through the code with new ConnectTransportException, and saw this error case. That one doesn't seem like it should be a retryable error 🤔 Is this status code change safe, or maybe we should change that code to a different exception type?

Related ES-10214
Closes #118320

@DiannaHohensee DiannaHohensee added >enhancement :Distributed Coordination/Network Http and internode communication implementations Team:Distributed Coordination Meta label for Distributed Coordination team labels Dec 13, 2024
@DiannaHohensee DiannaHohensee self-assigned this Dec 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @DiannaHohensee, I've created a changelog YAML for you.

@mhl-b
Copy link
Contributor

mhl-b commented Dec 14, 2024

I think that should be IllegalStateException, because we already have remote connection to another cluster.

Why it needs REST status attached? Do we send it back to the REST handler and client?

@DiannaHohensee
Copy link
Contributor Author

Thanks for taking a look and confirming it seems strange, @mhl-b. I'll take a closer look at Proxy* -- I'm not familiar with it, atm.

I forgot to reference the ticket, but this relates to ES-10214 -- I just updated my description to include it. I don't myself know how the HTTP error code propagates back to the caller, but other folks believe that it does. I would think it could reasonably end up in a stack trace of 'caused by' errors, if not directly.

@DiannaHohensee
Copy link
Contributor Author

Hmm. I think IllegalStateException would indicate ES has done something wrong. Whereas ES looks like it's throwing the ConnectTransportException when the information from the user doesn't match up with reality: like setting an address to a cluster and a cluster name, and then ES verifies that the connected to cluster's name matches, and it doesn't. I think we need a "user gave us a bad parameter" non-retryable exception?

There is some more code throwing the ConnectTransportException, and I think the idea is that we connected successfully to some server, but the type of server is not what ES expected to find.

@DiannaHohensee
Copy link
Contributor Author

After reviewing the ConnectTransportException uses, I think we need a new exception type, something like InvalidRemoteNodeException, to replace the ConnectTransportException use cases where the node connection is successful but the node type is unexpected. It doesn't seem like those error cases should be retryable.

@DaveCTurner for thoughts.

@DaveCTurner
Copy link
Contributor

My reading of the HTTP spec is that we are indeed acting as a gateway in this situation:

A "gateway" (a.k.a. "reverse proxy") is an intermediary that acts as an origin server for the outbound connection but translates received requests and forwards them inbound to another server or servers [...] A gateway communicates with inbound servers using any protocol that it desires [...]

Moreover, if the remote node turns out to belong to a cluster with an unexpected name then 502 Bad Gateway is a valid response code:

The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.

There's other ways that a bad in-cluster config might lead to a ConnectTransportException or a subclass, and indeed I can think of ways that this specific failure might go away on a retry. In any case, the effect on client retries is not really relevant here, at least not as much as following the HTTP spec. The whole notion of triggering client behaviour based solely on the HTTP response code is fundamentally doomed anyway, there's just too many different outcomes to map onto the very limited options available to us in the HTTP spec.

It's definitely not a client-side failure so we have to return some kind of 5xx error, and 502 seems more appropriate than 500. If clients choose to retry on all 502s then 🤷 that's not wrong I guess.

@mhl-b
Copy link
Contributor

mhl-b commented Dec 17, 2024

Would be nice to have an integ or yaml/rest test to show rest status propagation. I dont think status is very important, 500 or 502, since we model REST over HTTP and there is no strict requirement to confine to error codes, as David said there are not enough of them. So client still need to unwrap 5xx HTTP response to make decision, including retry.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except a couple of nits

@elasticsearchmachine
Copy link
Collaborator

Hi @DiannaHohensee, I've updated the changelog YAML for you.

@DiannaHohensee
Copy link
Contributor Author

Would be nice to have an integ or yaml/rest test to show rest status propagation. I dont think status is very important, 500 or 502, since we model REST over HTTP and there is no strict requirement to confine to error codes, as David said there are not enough of them. So client still need to unwrap 5xx HTTP response to make decision, including retry.

In the related SDHE, Kibana was seeing message: "org.elasticsearch.transport.NodeDisconnectedException" and message: INTERNAL_SERVER_ERROR as the symptom. I suspect that's from the generic TransportService#onConnectionClosed method. I think it would be difficult to test this behavior end-to-end: I'm not aware of a related test suite where we set up ES servers and check HTTP codes. I think it would be more expedient to file a ticket to do that work, if we wanted it.

@DiannaHohensee
Copy link
Contributor Author

The test failures match #118846, so they are not a blocker.

@DiannaHohensee DiannaHohensee merged commit 517abe4 into elastic:main Dec 17, 2024
14 of 16 checks passed
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Dec 18, 2024
ConnectTransportException and its subclasses previous translated to a
INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502
BAD_GATEWAY so that users may choose to retry it on connectivity
issues.

Related ES-10214
Closes elastic#118320
@pawankartik-elastic
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.x

Questions ?

Please refer to the Backport tool documentation

pawankartik-elastic pushed a commit to pawankartik-elastic/elasticsearch that referenced this pull request Dec 19, 2024
ConnectTransportException and its subclasses previous translated to a
INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502
BAD_GATEWAY so that users may choose to retry it on connectivity
issues.

Related ES-10214
Closes elastic#118320

(cherry picked from commit 517abe4)
pawankartik-elastic added a commit that referenced this pull request Dec 23, 2024
…19146)

ConnectTransportException and its subclasses previous translated to a
INTERNAL_SERVER_ERROR HTTP 500 code. We are changing it to 502
BAD_GATEWAY so that users may choose to retry it on connectivity
issues.

Related ES-10214
Closes #118320

(cherry picked from commit 517abe4)

Co-authored-by: Dianna Hohensee <dianna.hohensee@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed Coordination Meta label for Distributed Coordination team v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change ConnectTransportException response code to 502
5 participants