Kubernetes metrics, metrics everywhere

April 19, 2023 - 5 minutes read - 971 words

I’ve been tinkering with ways to “unit-test” my assumptions when using cloud platforms. I recently wrote about good posts by Google describing achieving cost savings with Cloud Monitoring and Cloud Logging:

With Cloud Monitoring, I’ve restricted the prometheus.googleapis.com metrics that are being ingested but realized I wanted to track the number of Pods (and Containers) deployed to a GKE cluster.

With Cloud Logging, I’ve filtered the number of logs and, of the remaining logs, am filtering the log entries (fluentbit-gke looking at you) that are being ingested.

Because I’m using Cloud Endpoints with a Cloud Run proxy of the form proxy-xxxxxxxxxx-yy.a.run.app, each Project has a non-inferable name and accompanying log. For this reason, I’m continuing to use Google’s default but rather “unconventional” inclusion filter of the form not log_id(l₁) and not log_id(l₂... and not log_id(lₓ) so as to include (!?) unexpected logs such as the Cloud Endpoints proxy log.

I wanted to be able to alert if:

Google adds new logs
Google adds new prometheus.googleapis.com metrics
GKE numbers of Pods and Containers

I am spoiled for choice.

Initially, I was unable to determine the best way to monitor these metrics.

For Cloud Logging logs, I’ve updated gcp-exporter to capture Cloud Logging. I’ve added a rule that alerts when the number of logs for Projects associated with Ackal exceeds the current (expected) number:

groups:
  - name: ackal
    rules:
      - alert: ackal_cloud_logging_logs
        expr: min_over_time(gcp_cloud_logging_logs{project=~"..."}[15m]) > 39
        for: 6h
        labels:
          severity: page
        annotations:
          summary: "Ackal Project ({{ $labels.project}}) has {{ $value }} logs"

I’m unsure how to alert on changes to the number of prometheus.googleapis.com metrics that are being ingested. Google doesn’t appear to support the PromQL that I’d customarily want to use for this. Something of the form:

count by(__name__) ({__name__=~".+"})

I also had quite the challenge trying to alert on the number of Pods (and Containers) deployed to the cluster.

Kubernetes is renowned for its abundance of metrics. The Kubernetes documentation (now?) has a useful page that lists Kubernetes Metrics Reference but the table doesn’t categorize the metrics by the producing component e.g. Kubelet.

After Googling (and Openai’ing !?) around, I realized I wanted kubelet_running_pods and kubelet_running_containers. A quick check using Google’s excellent Metrics explorer and the Google-Managed Prometheus (GMP) UI, I realized this metric was missing from my configuration.

The challenge was in determining how to enable it.

The challenge is compounded by the fact that GMP uses an operator and so, documentation for enabling Kubelet metrics don’t necessarily apply e.g. Monitoring Kubernetes layers: key metrics to know and How to Monitor the Kubelet.

I’ve used cAdvisor (although I don’t use it much these days absent its support for Podman) and had an inkling that this metric may be surfaced by cAdvisor.

Google has an interesting document Available metrics which I’d hoped would lead me to a solution but, after trawling through each option and the metrics associated with each option, I’d still not found kubectl_running_pods.

Eventually, I stumbled upon Kubelet/cAdvisor which doesn’t list the metrics that it provides but I had a good inkling that it would include what I wanted (it does).

To enable it, I need to PATCH GMP’s operatorconfig/config which I described (with its limitations) in Filtering metrics w/ Google Managed Prometheus. This time:

collection:
  filter:
    matchOneOf:
    - '{__name__=~"kubelet_.+"}'
  kubeletScraping:
    interval: 30s

It’s possible that the matchOneOf would be better as {job="kubelet"} rather than including metrics prefixed kubelet_ but, as I wrote, this is all rather poorly documented and so… who knows!

After PATCH‘ing the operatorconfig/config with the change, I was able to read kubelet_running_pods and kubelet_running_containers.

This may seem like a primitive metric upon which to alert but, my cluster should be relative stable and I don’t expect there to be any (significant) changes in the number of Pods and Containers and so I’m confident this will be a useful “unit test” for me to monitor changes to the cluster.

I then had to choose whether to create GMP ClusterRules and use GMP Managed alerting but I decided to use Cloud Monitoring Alerting because I’m using this elsewhere in Ackal.

To corroborate my expectations, the following kubectl commands represent mostly (!) what I want to monitor:

# Number of Pods
kubectl get pods \
--all-namespaces \
--output=json \
| jq -r '.items|length'

# Number of Containers
kubectl get pods \
--all-namespaces \
--output=json \
| jq -r '[.items[].spec.containers|length]|add'

NOTE The number of container is the desired (spec’d) number of containers whereas the policy (below) filters on containers that are running

I added two Policies:

NAME="gke-pods"
EXPECTED="..." # Number of Pods

gcloud alpha monitoring policies create \
--display-name="${NAME}" \
--documentation="Unexpected number of Pods running on GKE cluster" \
--notification-channels=${CHANNELS} \
--condition-display-name="${NAME}" \
--condition-filter="resource.type=\"prometheus_target\" metric.type=\"prometheus.googleapis.com/kubelet_running_pods/gauge\"" \
--duration="0s" \
--aggregation="{\"alignmentPeriod\":\"600s\", \"crossSeriesReducer\":\"REDUCE_NONE\", \"perSeriesAligner\": \"ALIGN_MIN\"}" \
--if=">${EXPECTED}" \
--trigger-count=1 \
--combiner="OR" \
--enabled \
--project=${PROJECT}

And:

NAME="gke-containers"
EXPECTED="..." # Number of containers

gcloud alpha monitoring policies create \
--display-name="${NAME}" \
--documentation="Unexpected number of Containers running on GKE cluster" \
--notification-channels=${CHANNELS} \
--condition-display-name="${NAME}" \
--condition-filter="resource.type=\"prometheus_target\" metric.type=\"prometheus.googleapis.com/kubelet_running_containers/gauge\" metric.labels.container_state=\"running\"" \
--duration="0s" \
--aggregation="{\"alignmentPeriod\":\"600s\", \"crossSeriesReducer\":\"REDUCE_NONE\", \"perSeriesAligner\": \"ALIGN_MIN\"}" \
--if=">${EXPECTED}" \
--trigger-count=1 \
--combiner="OR" \
--enabled \
--project=${PROJECT}

NOTE I should probably revise my deployment script’s gcloud alpha monitoring policies create commands to use --from-file to specify the Policy as YAML completely.

So, it’s nowhere near as straightforward as it should be to enumerate Kubernetes metrics (by component) and to include|exclude these using any Kubernetes distro, and with or without a Prometheus Operator. But, with some persistence, most things are possible.

I continue to appreciate GMP. I’m unsure whether there’s much (any?) utility in having Managed alerting and Cloud Monitoring Alerting(https://cloud.google.com/monitoring/alerts). Perhaps if there were different (Alertmanager) receivers or (Cloud Monitoring) notification channels available (although I think there aren’t).