Tailscale client metrics service discovery to Prometheus
- 4 minutes read - 670 wordsI couldn’t summarize this in a title (even with an LLM’s help):
I wanted to:
- Run a Tailscale service discovery agent
- On a Tailscale node outside of the Kubernetes cluster
- Using Podman Quadlet
- Accessing it from the Kubernetes Cluster using the Tailscale’s egress proxy
- Accessing the proxy with a
kube-prometheus
ScrapeConfig
- In order that Prometheus would scrape the container for Tailscale client metrics
Long-winded? Yes but I had an underlying need in running the Tailscale Service Discoovery remotely and this configuration helped me achieve that.
Run tailscale-sd
on a Tailscale node
One way to run rootless containers indefinitely is to use Quadlet to generate a systemd service to run a container:
I created ${HOME}/.config/containers/system/tailscale-sd.container
:
[Unit]
Description=Tailscale Service Discovery container
After=network-online.target
Wants=network-online.target
[Container]
ContainerName=tailscale-sd
Image=ghcr.io/.../tailscale-sd:...
Environment=TAILSCALE_API_TOKEN="..."
Environment=TAILSCALE_TAILNET="....ts.net"
PublishPort=8080:8080
[Service]
KillMode=mixed
[Install]
WantedBy=default.target
And learned that, for debugging purposes, it’s good to run the following to check for errors:
/usr/lib/systemd/user-generators/podman-user-generator \
~/.config/systemd/user \
/dev/null \
/dev/null
Before running:
systemctl --user daemon-reload
systemctl --user start tailscale-sd.service
systemctl --user status tailscale-sd.service
Yielding:
● tailscale-sd.service - Tailscale Service Discovery Container
Loaded: loaded (/{HOME}/.config/containers/systemd/tailscale-sd.container; enabled; preset: enabled)
Active: active (running) since Fri 2025-06-20 00:00:00 PDT;
Main PID: 12345 (conmon)
Tasks: 23 (limit: 3853)
Memory: 6.0M (peak: 16.4M)
CPU: 661ms
CGroup: /user.slice/user-1000.slice/user@1000.service/app.slice/tailscale-sd.service
├─libpod-payload-77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35
│ └─15539 /server --address=:8080
└─runtime
├─15518 /usr/bin/slirp4netns --disable-host-loopback --mtu=65520 --enable-sandbox --enable-seccomp --enable-ipv6 -c -r 3 -e 4 --netns-type=path /run/user/1000/netns/netns-9c5a9fa8-4208-8a48-85e5-22795ec4f937 tap0
├─15520 rootlessport
├─15528 rootlessport-child
└─15537 /usr/bin/conmon --api-version 1 -c 77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35 -u 77863e02aa15100a40139398657f6fab1c797a22d2a16b8d0f8700e68c84af35 -r /usr/bin/crun -b /{HOME}/.local/share/containers/storage/overlay-cont>
And, then I could confirm that the expected number of Tailscale nodes is being discovered with:
curl \
--silent \
--get \
http://{node}.{tailnet}:8080 \
| jq -r '.|length'
Create a proxy-egress
Service on Kubernetes
I’m running Tailscale’s excellent Kubernetes Operator on MicroK8s.
I wanted to be able to access the tailscale-sd
service running on the remote node.
To do this the operator provides epxose a Tailnet service to your Kubernetes cluster aka “cluster egress”.
I’m using Jsonnet but here’s the equivalent Service YAML, externalName
is replaced by the Tailscale Operator and so is a placeholder as the name here indicates:
apiVersion: v1
kind: Service
metadata:
annotations:
tailscale.com/tailnet-fqdn: {node}.{tailnet}
name: tailscale-sd
namespace: tailscale-sd
spec:
externalName: placeholder
type: ExternalName
NOTE Because the Operator patches the Service, if you try to reapply the Service to the cluster, it will be updated because it will differ to the server’s value.
All being well, the Operator should patch the externalName
with the name of the Service (associated with a StatefulSet and Pod) that it’s created in the (default) tailscale
namespace. These are the resources that proxy the traffic to the external service.
You can verify the Operator’s e.g backing Pod with:
SERVICE="tailscale-sd"
kubectl get pod \
--selector=tailscale.com/parent-resource=${SERVICE} \
--namespace=tailscale
I wanted to verify that the service is working correctly:
kubectl run busyboxplus \
--stdin --tty --rm \
--image=docker.io/radial/busyboxplus:curl \
--namespace=default \
-- sh
NOTE The only time I “advertently” (!?) use
default
namespace
nslookup tailscale-sd.tailscale-sd.svc.cluster.local
curl http://tailscale-sd.tailscale-sd.svc.cluster.local:8080
[
{
"targets":...
},
...
]
Outputs the Prometheus Service Discovery format as expected. No jq
on the busybox container image.
Configure kube-prometheus
to scrape the Service
kube-prometheus
(v0.81.0) provides the following CRDs including ScrapeConfig
This is straightforward and simply points to the http
endpoint exposed by the service:
apiVersion: monitoring.coreos.com/v1alpha1
kind: ScrapeConfig
metadata:
name: tailscale-sd
namespace: tailscale-sd
spec:
httpSDConfigs:
- url: http://tailscale-sd.tailscale-sd.svc.cluster.local:8080/
And then verify that the resulting Serice Discovery (kube-prometheus
names these scrapeConfig/{namespace}/{node}
) exists and is finding the expect number of nodes:
PROMETHEUS="..."
NAME="tailscale-sd"
NAMESPACE="tailscale-sd"
POOL="scrapeConfig/${NAMESPACE}/${NAME}"
FILTER="[
.data.activeTargets[]
|select(.scrapePool==\"${POOL}\")
]
|length"
curl \
--silent \
${PROMETHEUS}/api/v1/targets \
| jq -r "${FILTER}"
And, query the unique number of metrics available from the resulting job
:
curl \
--silent \
--get \
--data-urlencode "query=count(
count by(__name__) (
{job=\"${POOL}}$\"}
)
)" \
${PROMETHEUS}/api/v1/query
{
"status":"success",
"data":{
"resultType":"vector",
"result":[
{
"metric":{},
"value":[1750444592.219,"14"]
}
]
}
}
Or interact with it through the Prometheus MCP server ;-)
To delete everything:
Kubernetes resources:
kubectl delete scrapeconfig/tailscale-sd --namespace=tailscale-sd
kubectl delete service/tailscale-sd --namespace=tailscale-sd
Quadlet:
systemctl --user stop tailscale-sd.service
systemctl --user disable tailscale-sd.service
rm ~/.config/containers/systemd/tailscale-sd.container
systemctl --user daemon-reload
systemctl --user status tailscale-sd.service
That’s all!