Prometheus Operator support an auth proxy for Service Discovery

October 17, 2023 - 6 minutes read - 1199 words

For ackalctld to be deployable to Kubernetes with Prometheus Operator, it is necessary to Enable ScrapeConfig to use (discovery|target) proxies #5966. While I’m familiar with Kubernetes, Kubernetes operators (Ackal uses one built with the Operator SDK) and Prometheus Operator, I’m unfamiliar with developing Prometheus Operator. This (and subsequent) posts will document some preliminary work on this.

Cloned Prometheus Operator

Branched scrape-config-url-proxy

I’m unsure how to effect these changes and unsure whether documentation exists.

Clearly, I will need to revise the ScrapeConfig CRD to add the proxy_url fields (one proxy_url defines a proxy for the Service Discovery endpoint; the second defines a proxy for the targets themselves) and it would be useful for this to closely mirror the existing Prometheus HTTP Service Discovery use, namely ,http_sd_config>:

# URL from which the targets are fetched.
url: <string>

# Refresh interval to re-query the endpoint.
[ refresh_interval: <duration> | default = 60s ]

# Authentication information used to authenticate to the API server.
# Note that `basic_auth`, `authorization` and `oauth2` options are
# mutually exclusive.
# `password` and `password_file` are mutually exclusive.

# Optional HTTP basic authentication information.
basic_auth:
  [ username: <string> ]
  [ password: <secret> ]
  [ password_file: <string> ]

# Optional `Authorization` header configuration.
authorization:
  # Sets the authentication type.
  [ type: <string> | default: Bearer ]
  # Sets the credentials. It is mutually exclusive with
  # `credentials_file`.
  [ credentials: <secret> ]
  # Sets the credentials to the credentials read from the configured file.
  # It is mutually exclusive with `credentials`.
  [ credentials_file: <filename> ]

# Optional OAuth 2.0 configuration.
oauth2:
  [ <oauth2> ]

# Optional proxy URL.
[ proxy_url: <string> ]
# Comma-separated string that can contain IPs, CIDR notation, domain names
# that should be excluded from proxying. IP and domain names can
# contain port numbers.
[ no_proxy: <string> ]
# Use proxy URL indicated by environment variables (HTTP_PROXY, https_proxy, HTTPs_PROXY, https_proxy, and no_proxy)
[ proxy_from_environment: <boolean> | default: false ]
# Specifies headers to send to proxies during CONNECT requests.
[ proxy_connect_header:
  [ <string>: [<secret>, ...] ] ]

# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <boolean> | default = true ]

# Whether to enable HTTP2.
[ enable_http2: <boolean> | default: true ]

# TLS configuration.
tls_config:
  [ <tls_config> ]

Here is the known-working example from ackalctld:

# Healthcheck Services
# HTTP Service Discovery requests must not be proxied
# Services that are discovered must be proxied
- job_name: healthchecks
  scheme: http
  proxy_url: http://localhost:7777
  http_sd_configs:
  - refresh_interval: 1m
    url: http://localhost:8080

NOTE ackalctld only proxies the (Cloud Run) targets that result from Service Discovery. Service Discovery requests aren’t proxied because

So, there are multiple updates that will need to made to ScrapeConfig CRD but (initially) I’m going to focus on 2:

ScrapeConfig’s spec add proxy_url
ScrapeConfig’s spec’s httpSDConfigs add proxy_url

Potentially, the httpsSDConfigs addition of proxy_url should be rippled through the other *SDConfigs (e.g. consulSDConfigs). How I would be able to test these (presumably there are existing tests!?) adds complexity.

Changed jsonnet/prometheus-operator/scrapeconfigs-crd.json, added (twice):

"proxyUrl": {
  "description": "Optional proxy URL",
  "type": "string"
}

Presumably, I can try re-applying prometheus-operator to the (MicroK8s) cluster.

The Prometheus Operator README includes Development.

make test-unit (appears) to succeed.

Trying make image (uses docker):

GOOS=linux \
GOARCH=amd64 \
CGO_ENABLED=0 \
go build \
-ldflags="-s -X github.com/prometheus/common/version.Revision=52d1e55af -X github.com/prometheus/common/version.BuildUser={user} -X github.com/prometheus/common/version.BuildDate=20231017-09:39:11 -X github.com/prometheus/common/version.Branch=main -X github.com/prometheus/common/version.Version=0.68.0" \
-o operator cmd/operator/main.go

docker build \
--build-arg ARCH=amd64 \
--build-arg OS=linux \
-t quay.io/prometheus-operator/prometheus-operator:52d1e55af \
.

Successfully tagged quay.io/prometheus-operator/prometheus-operator:52d1e55af
Successfully tagged quay.io/prometheus-operator/prometheus-config-reloader:52d1e55af
Successfully tagged quay.io/prometheus-operator/admission-webhook:52d1e55af

Aha, it’s possible to use podman:

CONTAINER_CLI=podman \
make image

Aside using MicroK8s’ built-in (addon) registry

The Prometheus Operator makefile includes:

CONTAINER_CLI ?= docker

GO_PKG=github.com/prometheus-operator/prometheus-operator
IMAGE_OPERATOR?=quay.io/prometheus-operator/prometheus-operator
IMAGE_RELOADER?=quay.io/prometheus-operator/prometheus-config-reloader
IMAGE_WEBHOOK?=quay.io/prometheus-operator/admission-webhook
TAG?=$(shell git rev-parse --short HEAD)

NOTE TAG is git rev-parse --short HEAD

So:

REGISTRY="localhost:32000" # MicroK8s

CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make image

Then (because we know how to capture TAG):

REGISTRY="localhost:32000" # MicroK8s

TAG=$(git rev-parse --short HEAD)

IMAGES=(
  "prometheus-operator"
  "prometheus-config-reloader"
  "admission-webhook"
)

for IMAGE in "${IMAGES[@]}"
do
  podman push ${REGISTRY}/${IMAGE}:${TAG}
done

NOTE The makefile includes: ‘For now, the v1beta1 CRDs aren’t part of the default bundle because they require to deploy/run the conversion webhook. They are provided in a separate directory (example/prometheus-operator-crd-full) and we generate jsonnet code that can be used to patch the “default” jsonnet CRD.’

The only reference to conversion (webhook) in makefile is alertmanager-crd-conversion

So, it appears there are multiple (!?) definitions of ScrapeConfig (I don’t understand why the examples variants exist) but the correct (!) definition is probably in bundle.yaml but is constrained per the above proviso (the v1alpha1 versions aren’t deployed by default).

Added the YAML version of proxy_url to both locations in bundle.yaml for ScrapeConfig.

Question: Should I be bumping the CRD version with the changes?

REGISTRY="localhost:32000" # MicroK8s

CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make test-e2e

Tests failed due to incorrect parsing of the image URLs (localhost:3200/prometheus-operator).

Tweaked the code per test-e2e makes incorrect assumption of registry image path.

After making this change, there are still multiple failures:

main_test.go:184: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceeded
prometheus_test.go:2874: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not found
prometheus_test.go:2945: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not found
denylist_test.go:58: waiting for 0xc00113739c Prometheus instances timed out (allowed): prometheus denylist-prometheus-s2ot2p-1-d81d6a3f/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
denylist_test.go:153: waiting for 0xc00109c838 Prometheus instances timed out (allowed): prometheus denylist-servicemonitor-s2otc8-1-b9daaae1/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
prometheus_instance_namespaces_test.go:59: waiting for 0xc00109c96c Prometheus instances timed out (instance): prometheus prominstancens-allns-s2ou84-1-52537efc/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:284: waiting for 0xc000a48758 Prometheus instances timed out (instance): prometheus prominstancens-allowlist-s2ouhq-2-34baf117/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:181: waiting for 0xc000a44f70 Prometheus instances timed out (instance): prometheus prominstancens-denylist-s2ourf-2-3a2b5b7b/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:434: waiting for 0xc00086e4a8 Prometheus instances timed out (instance): prometheus prominstancens-namespacenotfound-s2ov1i-2-90889a2e/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
alertmanager_instance_namespaces_test.go:114: alertmanager alertmanagerinstancens-denyns-s2ovb3-1-60d76236/instance failed to become available: context deadline exceeded: expected 3 replicas, got 2
alertmanager_instance_namespaces_test.go:71: alertmanager alertmanagerinstancens-allns-s2ovkp-1-b02c9938/instance failed to become available: context deadline exceeded: expected 3 replicas, got 0
alertmanager_instance_namespaces_test.go:193: alertmanager alertmanagerinstancens-allowlist-s2ovud-1-ef43d8e8/instance failed to become available: context deadline exceeded: expected 3 replicas, got 1
upgradepath_test.go:42: failed to create webhook server: failed to create or update admission webhook server service: waiting for service prometheus-operator-admission-webhook to become ready timed out: requesting endpoints for service prometheus-operator-admission-webhook failed: endpoints "prometheus-operator-admission-webhook" not found
main_test.go:427: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceeded

And, one particularly relevant but possibly red-herring:

scrapeconfig_test.go:256: 
        Error Trace:	/prometheus-operator/test/e2e/scrapeconfig_test.go:256
        Error:      	Should be empty, but was map[prometheus-operator:0]
        Test:       	TestOperatorUpgrade/PromOperatorStartsWithoutScrapeConfigCRD
        Messages:   	expected to get 1 container but got 1