Prometheus Operator support an auth proxy for Service Discovery
- 6 minutes read - 1199 wordsFor ackalctld to be deployable to Kubernetes with Prometheus Operator, it is necessary to Enable ScrapeConfig to use (discovery|target) proxies #5966. While I’m familiar with Kubernetes, Kubernetes operators (Ackal uses one built with the Operator SDK) and Prometheus Operator, I’m unfamiliar with developing Prometheus Operator. This (and subsequent) posts will document some preliminary work on this.
Cloned Prometheus Operator
Branched scrape-config-url-proxy
I’m unsure how to effect these changes and unsure whether documentation exists.
Clearly, I will need to revise the ScrapeConfig CRD to add the proxy_url fields (one proxy_url defines a proxy for the Service Discovery endpoint; the second defines a proxy for the targets themselves) and it would be useful for this to closely mirror the existing Prometheus HTTP Service Discovery use, namely ,http_sd_config>:
# URL from which the targets are fetched.
url: <string>
# Refresh interval to re-query the endpoint.
[ refresh_interval: <duration> | default = 60s ]
# Authentication information used to authenticate to the API server.
# Note that `basic_auth`, `authorization` and `oauth2` options are
# mutually exclusive.
# `password` and `password_file` are mutually exclusive.
# Optional HTTP basic authentication information.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
# Optional `Authorization` header configuration.
authorization:
# Sets the authentication type.
[ type: <string> | default: Bearer ]
# Sets the credentials. It is mutually exclusive with
# `credentials_file`.
[ credentials: <secret> ]
# Sets the credentials to the credentials read from the configured file.
# It is mutually exclusive with `credentials`.
[ credentials_file: <filename> ]
# Optional OAuth 2.0 configuration.
oauth2:
[ <oauth2> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# Comma-separated string that can contain IPs, CIDR notation, domain names
# that should be excluded from proxying. IP and domain names can
# contain port numbers.
[ no_proxy: <string> ]
# Use proxy URL indicated by environment variables (HTTP_PROXY, https_proxy, HTTPs_PROXY, https_proxy, and no_proxy)
[ proxy_from_environment: <boolean> | default: false ]
# Specifies headers to send to proxies during CONNECT requests.
[ proxy_connect_header:
[ <string>: [<secret>, ...] ] ]
# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <boolean> | default = true ]
# Whether to enable HTTP2.
[ enable_http2: <boolean> | default: true ]
# TLS configuration.
tls_config:
[ <tls_config> ]
Here is the known-working example from ackalctld:
# Healthcheck Services
# HTTP Service Discovery requests must not be proxied
# Services that are discovered must be proxied
- job_name: healthchecks
scheme: http
proxy_url: http://localhost:7777
http_sd_configs:
- refresh_interval: 1m
url: http://localhost:8080
NOTE
ackalctldonly proxies the (Cloud Run) targets that result from Service Discovery. Service Discovery requests aren’t proxied because
So, there are multiple updates that will need to made to ScrapeConfig CRD but (initially) I’m going to focus on 2:
ScrapeConfig’sspecaddproxy_urlScrapeConfig’sspec’shttpSDConfigsaddproxy_url
Potentially, the httpsSDConfigs addition of proxy_url should be rippled through the other *SDConfigs (e.g. consulSDConfigs). How I would be able to test these (presumably there are existing tests!?) adds complexity.
Changed jsonnet/prometheus-operator/scrapeconfigs-crd.json, added (twice):
"proxyUrl": {
"description": "Optional proxy URL",
"type": "string"
}
Presumably, I can try re-applying prometheus-operator to the (MicroK8s) cluster.
The Prometheus Operator README includes Development.
make test-unit (appears) to succeed.
Trying make image (uses docker):
GOOS=linux \
GOARCH=amd64 \
CGO_ENABLED=0 \
go build \
-ldflags="-s -X github.com/prometheus/common/version.Revision=52d1e55af -X github.com/prometheus/common/version.BuildUser={user} -X github.com/prometheus/common/version.BuildDate=20231017-09:39:11 -X github.com/prometheus/common/version.Branch=main -X github.com/prometheus/common/version.Version=0.68.0" \
-o operator cmd/operator/main.go
docker build \
--build-arg ARCH=amd64 \
--build-arg OS=linux \
-t quay.io/prometheus-operator/prometheus-operator:52d1e55af \
.
Successfully tagged quay.io/prometheus-operator/prometheus-operator:52d1e55af
Successfully tagged quay.io/prometheus-operator/prometheus-config-reloader:52d1e55af
Successfully tagged quay.io/prometheus-operator/admission-webhook:52d1e55af
Aha, it’s possible to use podman:
CONTAINER_CLI=podman \
make image
Aside using MicroK8s’ built-in (addon) registry
The Prometheus Operator makefile includes:
CONTAINER_CLI ?= docker
GO_PKG=github.com/prometheus-operator/prometheus-operator
IMAGE_OPERATOR?=quay.io/prometheus-operator/prometheus-operator
IMAGE_RELOADER?=quay.io/prometheus-operator/prometheus-config-reloader
IMAGE_WEBHOOK?=quay.io/prometheus-operator/admission-webhook
TAG?=$(shell git rev-parse --short HEAD)
NOTE
TAGisgit rev-parse --short HEAD
So:
REGISTRY="localhost:32000" # MicroK8s
CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make image
Then (because we know how to capture TAG):
REGISTRY="localhost:32000" # MicroK8s
TAG=$(git rev-parse --short HEAD)
IMAGES=(
"prometheus-operator"
"prometheus-config-reloader"
"admission-webhook"
)
for IMAGE in "${IMAGES[@]}"
do
podman push ${REGISTRY}/${IMAGE}:${TAG}
done
NOTE The
makefileincludes: ‘For now, the v1beta1 CRDs aren’t part of the default bundle because they require to deploy/run the conversion webhook. They are provided in a separate directory (example/prometheus-operator-crd-full) and we generate jsonnet code that can be used to patch the “default” jsonnet CRD.’
The only reference to conversion (webhook) in makefile is alertmanager-crd-conversion
So, it appears there are multiple (!?) definitions of ScrapeConfig (I don’t understand why the examples variants exist) but the correct (!) definition is probably in bundle.yaml but is constrained per the above proviso (the v1alpha1 versions aren’t deployed by default).
Added the YAML version of proxy_url to both locations in bundle.yaml for ScrapeConfig.
Question: Should I be bumping the CRD version with the changes?
REGISTRY="localhost:32000" # MicroK8s
CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make test-e2e
Tests failed due to incorrect parsing of the image URLs (localhost:3200/prometheus-operator).
Tweaked the code per test-e2e makes incorrect assumption of registry image path.
After making this change, there are still multiple failures:
main_test.go:184: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceededprometheus_test.go:2874: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not foundprometheus_test.go:2945: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not founddenylist_test.go:58: waiting for 0xc00113739c Prometheus instances timed out (allowed): prometheus denylist-prometheus-s2ot2p-1-d81d6a3f/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceededdenylist_test.go:153: waiting for 0xc00109c838 Prometheus instances timed out (allowed): prometheus denylist-servicemonitor-s2otc8-1-b9daaae1/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadlineprometheus_instance_namespaces_test.go:59: waiting for 0xc00109c96c Prometheus instances timed out (instance): prometheus prominstancens-allns-s2ou84-1-52537efc/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceededprometheus_instance_namespaces_test.go:284: waiting for 0xc000a48758 Prometheus instances timed out (instance): prometheus prominstancens-allowlist-s2ouhq-2-34baf117/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceededprometheus_instance_namespaces_test.go:181: waiting for 0xc000a44f70 Prometheus instances timed out (instance): prometheus prominstancens-denylist-s2ourf-2-3a2b5b7b/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceededprometheus_instance_namespaces_test.go:434: waiting for 0xc00086e4a8 Prometheus instances timed out (instance): prometheus prominstancens-namespacenotfound-s2ov1i-2-90889a2e/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceededalertmanager_instance_namespaces_test.go:114: alertmanager alertmanagerinstancens-denyns-s2ovb3-1-60d76236/instance failed to become available: context deadline exceeded: expected 3 replicas, got 2alertmanager_instance_namespaces_test.go:71: alertmanager alertmanagerinstancens-allns-s2ovkp-1-b02c9938/instance failed to become available: context deadline exceeded: expected 3 replicas, got 0alertmanager_instance_namespaces_test.go:193: alertmanager alertmanagerinstancens-allowlist-s2ovud-1-ef43d8e8/instance failed to become available: context deadline exceeded: expected 3 replicas, got 1upgradepath_test.go:42: failed to create webhook server: failed to create or update admission webhook server service: waiting for service prometheus-operator-admission-webhook to become ready timed out: requesting endpoints for service prometheus-operator-admission-webhook failed: endpoints "prometheus-operator-admission-webhook" not foundmain_test.go:427: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceeded
And, one particularly relevant but possibly red-herring:
scrapeconfig_test.go:256:
Error Trace: /prometheus-operator/test/e2e/scrapeconfig_test.go:256
Error: Should be empty, but was map[prometheus-operator:0]
Test: TestOperatorUpgrade/PromOperatorStartsWithoutScrapeConfigCRD
Messages: expected to get 1 container but got 1