-
Notifications
You must be signed in to change notification settings - Fork 13
Fix auth transition on edge-cases #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MCK 1.3.0 Release NotesBug Fixes
Other Changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion
, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.
Key changes:
- Added authentication transition detection in the automation status checker
- Introduced logic to wait for auth-related moves to complete before marking clusters as ready
- Added comprehensive test coverage for authentication transition scenarios
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
controllers/om/automation_status.go | Added authentication transition detection and isAuthenticationTransitionMove helper function |
controllers/om/automation_status_test.go | Added comprehensive test cases for authentication transition scenarios |
changelog/20250808_fix_fixing_auth_transition_edge_cases.md | Added changelog entry documenting the fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
@@ -113,17 +138,29 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin | |||
} | |||
} | |||
|
|||
// isAuthenticationTransitionMove returns true if the given move is related to authentication transitions | |||
func isAuthenticationTransitionMove(move string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the approach of moving those phase specific checks outside of the readiness probe where we can only afford to treat them with a very wide brush.
date: 2025-08-08 | ||
--- | ||
|
||
* The readinessProbe always returns ready if the agent is in a wait step. This can be problematic during auth transitions as we can have a period where we invalidate one auth while the other is not activated yet and we try to use the not supported one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a period where we invalidate one auth while the other is not activated yet and we try to use the not supported one.
We seem to have auth in two states, invalidated and inactive. But at the end the note says, we try to use the not supported one? What is not supported one? Is it the invalidated one or the inactive one?
Sorry, it might be just me how is not able to understand this.
The other point that I have is, usually we try to convey what exactly what improved/fixed/changed using the release note. But after reading this note, I am not able to understand what exactly was changed. Was an issue fixed, or a feature added? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Summary
Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:
Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator
operator does
The core idea:
What happened in our tests:
node-0
completed its auth transition (now uses scram, instead of x509)Config server
hasn't finished its auth transition yetnod e-0
node-0
restarted with the old X509 config (see below comment from the agent code)tl;dr: first
node-0
moved to new auth,config
not yet,node-0
restarted and during the restartconfig
transitioned to the new auth whilenode-0
is again running old authProof of Work
Checklist
skip-changelog
label if not needed