Prometheus Operator support an auth proxy for Service Discovery
- 6 minutes read - 1199 wordsFor ackalctld
to be deployable to Kubernetes with Prometheus Operator, it is necessary to Enable ScrapeConfig
to use (discovery|target) proxies #5966. While I’m familiar with Kubernetes, Kubernetes operators (Ackal uses one built with the Operator SDK) and Prometheus Operator, I’m unfamiliar with developing Prometheus Operator. This (and subsequent) posts will document some preliminary work on this.
Cloned Prometheus Operator
Branched scrape-config-url-proxy
I’m unsure how to effect these changes and unsure whether documentation exists.
Clearly, I will need to revise the ScrapeConfig
CRD to add the proxy_url
fields (one proxy_url
defines a proxy for the Service Discovery endpoint; the second defines a proxy for the targets themselves) and it would be useful for this to closely mirror the existing Prometheus HTTP Service Discovery use, namely ,http_sd_config>
:
# URL from which the targets are fetched.
url: <string>
# Refresh interval to re-query the endpoint.
[ refresh_interval: <duration> | default = 60s ]
# Authentication information used to authenticate to the API server.
# Note that `basic_auth`, `authorization` and `oauth2` options are
# mutually exclusive.
# `password` and `password_file` are mutually exclusive.
# Optional HTTP basic authentication information.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <string> ]
# Optional `Authorization` header configuration.
authorization:
# Sets the authentication type.
[ type: <string> | default: Bearer ]
# Sets the credentials. It is mutually exclusive with
# `credentials_file`.
[ credentials: <secret> ]
# Sets the credentials to the credentials read from the configured file.
# It is mutually exclusive with `credentials`.
[ credentials_file: <filename> ]
# Optional OAuth 2.0 configuration.
oauth2:
[ <oauth2> ]
# Optional proxy URL.
[ proxy_url: <string> ]
# Comma-separated string that can contain IPs, CIDR notation, domain names
# that should be excluded from proxying. IP and domain names can
# contain port numbers.
[ no_proxy: <string> ]
# Use proxy URL indicated by environment variables (HTTP_PROXY, https_proxy, HTTPs_PROXY, https_proxy, and no_proxy)
[ proxy_from_environment: <boolean> | default: false ]
# Specifies headers to send to proxies during CONNECT requests.
[ proxy_connect_header:
[ <string>: [<secret>, ...] ] ]
# Configure whether HTTP requests follow HTTP 3xx redirects.
[ follow_redirects: <boolean> | default = true ]
# Whether to enable HTTP2.
[ enable_http2: <boolean> | default: true ]
# TLS configuration.
tls_config:
[ <tls_config> ]
Here is the known-working example from ackalctld
:
# Healthcheck Services
# HTTP Service Discovery requests must not be proxied
# Services that are discovered must be proxied
- job_name: healthchecks
scheme: http
proxy_url: http://localhost:7777
http_sd_configs:
- refresh_interval: 1m
url: http://localhost:8080
NOTE
ackalctld
only proxies the (Cloud Run) targets that result from Service Discovery. Service Discovery requests aren’t proxied because
So, there are multiple updates that will need to made to ScrapeConfig
CRD but (initially) I’m going to focus on 2:
ScrapeConfig
’sspec
addproxy_url
ScrapeConfig
’sspec
’shttpSDConfigs
addproxy_url
Potentially, the httpsSDConfigs
addition of proxy_url
should be rippled through the other *SDConfigs
(e.g. consulSDConfigs
). How I would be able to test these (presumably there are existing tests!?) adds complexity.
Changed jsonnet/prometheus-operator/scrapeconfigs-crd.json
, added (twice):
"proxyUrl": {
"description": "Optional proxy URL",
"type": "string"
}
Presumably, I can try re-applying prometheus-operator
to the (MicroK8s) cluster.
The Prometheus Operator README
includes Development
.
make test-unit
(appears) to succeed.
Trying make image
(uses docker
):
GOOS=linux \
GOARCH=amd64 \
CGO_ENABLED=0 \
go build \
-ldflags="-s -X github.com/prometheus/common/version.Revision=52d1e55af -X github.com/prometheus/common/version.BuildUser={user} -X github.com/prometheus/common/version.BuildDate=20231017-09:39:11 -X github.com/prometheus/common/version.Branch=main -X github.com/prometheus/common/version.Version=0.68.0" \
-o operator cmd/operator/main.go
docker build \
--build-arg ARCH=amd64 \
--build-arg OS=linux \
-t quay.io/prometheus-operator/prometheus-operator:52d1e55af \
.
Successfully tagged quay.io/prometheus-operator/prometheus-operator:52d1e55af
Successfully tagged quay.io/prometheus-operator/prometheus-config-reloader:52d1e55af
Successfully tagged quay.io/prometheus-operator/admission-webhook:52d1e55af
Aha, it’s possible to use podman
:
CONTAINER_CLI=podman \
make image
Aside using MicroK8s’ built-in (addon) registry
The Prometheus Operator makefile
includes:
CONTAINER_CLI ?= docker
GO_PKG=github.com/prometheus-operator/prometheus-operator
IMAGE_OPERATOR?=quay.io/prometheus-operator/prometheus-operator
IMAGE_RELOADER?=quay.io/prometheus-operator/prometheus-config-reloader
IMAGE_WEBHOOK?=quay.io/prometheus-operator/admission-webhook
TAG?=$(shell git rev-parse --short HEAD)
NOTE
TAG
isgit rev-parse --short HEAD
So:
REGISTRY="localhost:32000" # MicroK8s
CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make image
Then (because we know how to capture TAG
):
REGISTRY="localhost:32000" # MicroK8s
TAG=$(git rev-parse --short HEAD)
IMAGES=(
"prometheus-operator"
"prometheus-config-reloader"
"admission-webhook"
)
for IMAGE in "${IMAGES[@]}"
do
podman push ${REGISTRY}/${IMAGE}:${TAG}
done
NOTE The
makefile
includes: ‘For now, the v1beta1 CRDs aren’t part of the default bundle because they require to deploy/run the conversion webhook. They are provided in a separate directory (example/prometheus-operator-crd-full) and we generate jsonnet code that can be used to patch the “default” jsonnet CRD.’
The only reference to conversion
(webhook) in makefile
is alertmanager-crd-conversion
So, it appears there are multiple (!?) definitions of ScrapeConfig
(I don’t understand why the examples
variants exist) but the correct (!) definition is probably in bundle.yaml
but is constrained per the above proviso (the v1alpha1
versions aren’t deployed by default).
Added the YAML version of proxy_url
to both locations in bundle.yaml
for ScrapeConfig
.
Question: Should I be bumping the CRD version with the changes?
REGISTRY="localhost:32000" # MicroK8s
CONTAINER_CLI="podman" \
IMAGE_OPERATOR="${REGISTRY}/prometheus-operator" \
IMAGE_RELOADER="${REGISTRY}/prometheus-config-reloader" \
IMAGE_WEBHOOK="${REGISTRY}/admission-webhook" \
make test-e2e
Tests failed due to incorrect parsing of the image URLs (localhost:3200/prometheus-operator
).
Tweaked the code per test-e2e
makes incorrect assumption of registry image path.
After making this change, there are still multiple failures:
main_test.go:184: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceeded
prometheus_test.go:2874: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not found
prometheus_test.go:2945: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": failed to call webhook: Post "https://prometheus-operator-admission-webhook.operatorupgrade-operatorupgrade-s2orl9-0-b1360a42.svc:443/admission-prometheusrules/mutate?timeout=10s": service "prometheus-operator-admission-webhook" not found
denylist_test.go:58: waiting for 0xc00113739c Prometheus instances timed out (allowed): prometheus denylist-prometheus-s2ot2p-1-d81d6a3f/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
denylist_test.go:153: waiting for 0xc00109c838 Prometheus instances timed out (allowed): prometheus denylist-servicemonitor-s2otc8-1-b9daaae1/allowed failed to become available: context deadline exceeded: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
prometheus_instance_namespaces_test.go:59: waiting for 0xc00109c96c Prometheus instances timed out (instance): prometheus prominstancens-allns-s2ou84-1-52537efc/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:284: waiting for 0xc000a48758 Prometheus instances timed out (instance): prometheus prominstancens-allowlist-s2ouhq-2-34baf117/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:181: waiting for 0xc000a44f70 Prometheus instances timed out (instance): prometheus prominstancens-denylist-s2ourf-2-3a2b5b7b/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
prometheus_instance_namespaces_test.go:434: waiting for 0xc00086e4a8 Prometheus instances timed out (instance): prometheus prominstancens-namespacenotfound-s2ov1i-2-90889a2e/instance failed to become available: context deadline exceeded: client rate limiter Wait returned an error: context deadline exceeded
alertmanager_instance_namespaces_test.go:114: alertmanager alertmanagerinstancens-denyns-s2ovb3-1-60d76236/instance failed to become available: context deadline exceeded: expected 3 replicas, got 2
alertmanager_instance_namespaces_test.go:71: alertmanager alertmanagerinstancens-allns-s2ovkp-1-b02c9938/instance failed to become available: context deadline exceeded: expected 3 replicas, got 0
alertmanager_instance_namespaces_test.go:193: alertmanager alertmanagerinstancens-allowlist-s2ovud-1-ef43d8e8/instance failed to become available: context deadline exceeded: expected 3 replicas, got 1
upgradepath_test.go:42: failed to create webhook server: failed to create or update admission webhook server service: waiting for service prometheus-operator-admission-webhook to become ready timed out: requesting endpoints for service prometheus-operator-admission-webhook failed: endpoints "prometheus-operator-admission-webhook" not found
main_test.go:427: failed to create webhook server: after create, waiting for deployment prometheus-operator-admission-webhook to become ready timed out: client rate limiter Wait returned an error: context deadline exceeded
And, one particularly relevant but possibly red-herring:
scrapeconfig_test.go:256:
Error Trace: /prometheus-operator/test/e2e/scrapeconfig_test.go:256
Error: Should be empty, but was map[prometheus-operator:0]
Test: TestOperatorUpgrade/PromOperatorStartsWithoutScrapeConfigCRD
Messages: expected to get 1 container but got 1