Skip to content

Custom bootstrap and CR lifecycle management #2884

@Donnerbart

Description

@Donnerbart

This is mostly a reality check if our custom bootstrap is using JOSDK as intended or not. Depending on the feedback we can close this issue and create more specific follow-up issues if needed. I just wanted to describe the context once.

// configure operator
final var customMetrics = new CustomResourceMetrics(config, meterRegistry);
final var eventSender = new EventSender(config, client);
final var reconciler = new HiveMQPlatformReconciler(config, customResourceMetrics, client, eventSender);

final var dependentResourceFactory = new HiveMQPlatformOperatorDependentResourceFactory<>(config, eventSender);
final var metrics = MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(meterRegistry).build();
final var operator = new Operator(override -> override //
        .withConcurrentReconciliationThreads(config.getConcurrentReconciliationThreads())
        .withConcurrentWorkflowExecutorThreads(config.getConcurrentWorkflowThreads())
        .withReconciliationTerminationTimeout(config.getReconciliationTerminationTimeout())
        .withCacheSyncTimeout(config.getCacheSyncTimeout())
        .withKubernetesClient(client)
        .withCloseClientOnStop(closeClient)
        .withDependentResourceFactory(dependentResourceFactory)
        .withMetrics(metrics)
        .withUseSSAToPatchPrimaryResource(useSSA)
        .withSSABasedCreateUpdateMatchForDependentResources(useSSA));
final var reconcilerConfig = operator.getConfigurationService().getConfigurationFor(reconciler); // (1)
final var overrideConfig = ControllerConfigurationOverrider.override(reconcilerConfig);
if (operatorNamespaces.equals(Constants.WATCH_CURRENT_NAMESPACE)) {
    overrideConfig.watchingOnlyCurrentNamespace();
} else if (operatorNamespaces.equals(Constants.WATCH_ALL_NAMESPACES)) {
    overrideConfig.watchingAllNamespaces();
} else {
    final var namespaces = Arrays.stream(operatorNamespaces.split(",")).collect(Collectors.toSet());
    overrideConfig.settingNamespaces(namespaces);
}
if (!operatorSelector.isBlank()) {
    overrideConfig.withLabelSelector(operatorSelector);
}
overrideConfig.withOnAddFilter(platform -> { // (3)
    customResourceMetrics.register(platform);
    return true;
});
final var controller = (Controller<HiveMQPlatform>) operator.register(reconciler, overrideConfig.build());
customResourceMetrics.setCache(controller.getEventSourceManager().getControllerEventSource()); // (2)
  • HiveMQPlatformOperatorDependentResourceFactory implements DependentResourceFactory (it's the only place where we really had to replace Quarkus Arc, otherwise we don't need a CDI).
  • closeClient is true in production and only set to false in tests to not close an injected K8s client (that is still used in the test).
  • useSSA is true in production and only set to false in tests with the K8s mockserver.
  1. Are we creating the configuration and overrides as intended? I tried to reverse engineer how Quarkus and the LocallyRunOperatorExtension are configuring JOSDK, but we still get a warning on the operator.getConfigurationService().getConfigurationFor(reconciler) invocation:
    12:29:37.423 [main] WARN  Default ConfigurationService implementation - Configuration for reconciler
    'hivemq-controller' was not found. Known reconcilers: None.
    
    It feels wrong to get that warning, but I found no other way to create a ControllerConfigurationOverrider.
  2. How to properly implement custom metrics on all custom resources in an efficient way? For example, in CustomResourceMetrics we have a global Gauge for each state of our operator state machine. To calculate the values we count all custom resources that are in that state:
    cache.list()
        .map(CustomResource::getStatus)
        .filter(Objects::nonNull)
        .filter(status -> status.getState().equals(state))
        .count();
    That cache is the ControllerEventSource that we set via customResourceMetrics.setCache(controller.getEventSourceManager().getControllerEventSource()). This feels a bit illegal but works very fine for us.
    In the previous implementation we kept a shadow copy of all custom resources in a CHM, but these were the cloned instances from the reconciliation loop. This impacted the GC and needed constant updates of the cached instances to retrieve the current state. Using the original instances from the cache solved this problem very elegantly for us, including the lifecycle management.
  3. The same problem hits us now again with a new Gauge for an aggregated health status metric per custom resource. We need to register this Gauge once when the CR is added, de-register it when it's removed and in between have access to the current CustomResource::getStatus to get the health state. So we're back to a Map<String, GaugeHolder> with <namespace>-<name> as key.
    For the Gauge creation I'm using overrideConfig.withOnAddFilter(), which again feels very wrong, but seems to work fine. The Gauge removal is done in the Cleaner::cleanup method of our reconciler.
    Is there a better way for the lifecycle control and where to put that Gauge? It feels like we cannot put this into the custom resource itself, because of the cloning (the Gauge needs to be a singleton) and to update the correct instance. This is how it looks like in the CustomResourceMetrics right now:
    private final @NotNull Map<String, GaugeHolder> healthMetricGauges = new ConcurrentHashMap<>();
    
    // called from the bootstrap
    public void register(final @NotNull HiveMQPlatform platform) {
        healthMetricGauges.put(getKey(platform), new GaugeHolder(platform));
    }
    
    // called from Cleaner::cleanup
    public void deregister(final @NotNull HiveMQPlatform platform) {
        if (healthMetricGauges.get(getKey(platform)) instanceof GaugeHolder gaugeHolder) {
            meterRegistry.remove(gaugeHolder.gauge);
        }
    }
    
    // called from the main reconciler when the health state has changed
    public void updateHealthMetric(final @NotNull HiveMQPlatform platform) {
        if (healthMetricGauges.get(getKey(platform)) instanceof GaugeHolder gaugeHolder) {
            gaugeHolder.value.set(platform.getStatus().getHealthStatus().getMetricsValue());
        }
    }
    
    private static @NotNull String getKey(final @NotNull HiveMQPlatform platform) {
        return platform.getMetadata().getNamespace() + "-" + platform.getMetadata().getName();
    }
    
    private class GaugeHolder {
    
        private final @NotNull AtomicInteger value = new AtomicInteger(HealthStatus.UNKNOWN.getMetricsValue());
        private final @NotNull Gauge gauge;
    
        public GaugeHolder(final @NotNull HiveMQPlatform platform) {
            this.gauge = Gauge.builder("hivemq.platform.health.system.current", value::get)
                    .tag("namespace", platform.getMetadata().getNamespace())
                    .tag("name", platform.getMetadata().getName())
                    .description("Custom resource health status")
                    .register(meterRegistry);
        }
    }

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions