Tailscale client metrics service discovery to Prometheus

June 20, 2025 - 4 minutes read - 670 words

I couldn’t summarize this in a title (even with an LLM’s help):

I wanted to:

Run a Tailscale service discovery agent
On a Tailscale node outside of the Kubernetes cluster
Using Podman Quadlet
Accessing it from the Kubernetes Cluster using the Tailscale’s egress proxy
Accessing the proxy with a kube-prometheus ScrapeConfig
In order that Prometheus would scrape the container for Tailscale client metrics

Long-winded? Yes but I had an underlying need in running the Tailscale Service Discoovery remotely and this configuration helped me achieve that.

Run `tailscale-sd` on a Tailscale node

One way to run rootless containers indefinitely is to use Quadlet to generate a systemd service to run a container:

I created ${HOME}/.config/containers/systemd/tailscale-sd.container:

[Unit]
Description=Tailscale Service Discovery container
After=network-online.target
Wants=network-online.target

[Container]
ContainerName=tailscale-sd
Image=ghcr.io/.../tailscale-sd:...
Environment=TAILSCALE_API_TOKEN="..."
Environment=TAILSCALE_TAILNET="....ts.net"
PublishPort=8080:8080

[Service]
KillMode=mixed

[Install]
WantedBy=default.target

And learned that, for debugging purposes, it’s good to run the following to check for errors:

/usr/lib/systemd/user-generators/podman-user-generator \
  ~/.config/systemd/user \
  /dev/null \
  /dev/null

Before running:

systemctl --user daemon-reload
systemctl --user start tailscale-sd.service
systemctl --user status tailscale-sd.service

Yielding:

● tailscale-sd.service - Tailscale Service Discovery Container
     Loaded: loaded (/{HOME}/.config/containers/systemd/user/tailscale-sd.container; enabled; preset: enabled)
     Active: active (running) since Fri 2025-06-20 00:00:00 PDT;
   Main PID: 12345 (conmon)
      Tasks: 23 (limit: 3853)
     Memory: 6.0M (peak: 16.4M)
        CPU: 661ms
     CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/tailscale-sd.service
             ├─libpod-payload-77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35
             │ └─15539 /server --address=:8080
             └─runtime
               ├─15518 /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -r 3 -e 4 --netns-type=path /run/user/1000/netns/netns-9c5a9fa8-4208-8a48-85e5-22795ec4f937 tap0
               ├─15520 rootlessport
               ├─15528 rootlessport-child
               └─15537 /usr/bin/conmon --api-version 1 -c 77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35 -u 77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35 -r /usr/bin/crun -b /{HOME}/.local/share/containers/storage/overlay-cont>

And, then I could confirm that the expected number of Tailscale nodes is being discovered with:

curl \
--silent \
--get \
http://{node}.{tailnet}:8080 \
| jq -r '.|length'

Create a `proxy-egress` Service on Kubernetes

I’m running Tailscale’s excellent Kubernetes Operator on MicroK8s.

I wanted to be able to access the tailscale-sd service running on the remote node.

To do this the operator provides epxose a Tailnet service to your Kubernetes cluster aka “cluster egress”.

I’m using Jsonnet but here’s the equivalent Service YAML, externalName is replaced by the Tailscale Operator and so is a placeholder as the name here indicates:

apiVersion: v1
kind: Service
metadata:
  annotations:
    tailscale.com/tailnet-fqdn: {node}.{tailnet}
  name: tailscale-sd
  namespace: tailscale-sd
spec:
  externalName: placeholder
  type: ExternalName

NOTE Because the Operator patches the Service, if you try to reapply the Service to the cluster, it will be updated because it will differ to the server’s value.

All being well, the Operator should patch the externalName with the name of the Service (associated with a StatefulSet and Pod) that it’s created in the (default) tailscale namespace. These are the resources that proxy the traffic to the external service.

You can verify the Operator’s e.g backing Pod with:

SERVICE="tailscale-sd"

kubectl get pod \
--selector=tailscale.com/parent-resource=${SERVICE} \
--namespace=tailscale

I wanted to verify that the service is working correctly:

kubectl run busyboxplus \
--stdin --tty --rm \
--image=docker.io/radial/busyboxplus:curl \
--namespace=default \
-- sh

NOTE The only time I “advertently” (!?) use default namespace

nslookup tailscale-sd.tailscale-sd.svc.cluster.local
curl http://tailscale-sd.tailscale-sd.svc.cluster.local:8080

[
  {
    "targets":...
  },
  ...
]

Outputs the Prometheus Service Discovery format as expected. No jq on the busybox container image.

Configure `kube-prometheus` to scrape the Service

kube-prometheus (v0.81.0) provides the following CRDs including ScrapeConfig

This is straightforward and simply points to the http endpoint exposed by the service:

apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
  name: tailscale-sd
  namespace: tailscale-sd
spec:
  httpSDConfigs:
  - url: http://tailscale-sd.tailscale-sd.svc.cluster.local:8080/

And then verify that the resulting Serice Discovery (kube-prometheus names these scrapeConfig/{namespace}/{node}) exists and is finding the expect number of nodes:

PROMETHEUS="..."

NAME="tailscale-sd"
NAMESPACE="tailscale-sd"

POOL="scrapeConfig/${NAMESPACE}/${NAME}"

FILTER="[
    .data.activeTargets[]
    |select(.scrapePool==\"${POOL}\")
]
|length"

curl \
--silent \
${PROMETHEUS}/api/v1/targets \
| jq -r "${FILTER}"

And, query the unique number of metrics available from the resulting job:

curl \
--silent \
--get \
--data-urlencode "query=count(
  count by(__name__) (
    {job=\"${POOL}}$\"}
  )
)" \
${PROMETHEUS}/api/v1/query

{
    "status":"success",
    "data":{
        "resultType":"vector",
        "result":[
            {
                "metric":{},
                "value":[1750444592.219,"14"]
            }
        ]
    }
}

Or interact with it through the Prometheus MCP server ;-)

To delete everything:

Kubernetes resources:

kubectl delete scrapeconfig/tailscale-sd --namespace=tailscale-sd
kubectl delete service/tailscale-sd --namespace=tailscale-sd

Quadlet:

systemctl --user stop tailscale-sd.service
systemctl --user disable tailscale-sd.service

rm ~/.config/containers/systemd/tailscale-sd.container

systemctl --user daemon-reload
systemctl --user status tailscale-sd.service

That’s all!

Run tailscale-sd on a Tailscale node

Create a proxy-egress Service on Kubernetes

Configure kube-prometheus to scrape the Service

Run `tailscale-sd` on a Tailscale node

Create a `proxy-egress` Service on Kubernetes

Configure `kube-prometheus` to scrape the Service