Robusta KRR w/ GMP

April 27, 2023 - 3 minutes read - 493 words

I’ve been spending time recently optimizing Ackal’s use of Google Cloud Logging and Cloud Monitoring in posts:

Yesterday, I read that Robusta has a new open source project Kubernetes Resource Recommendations (KRR) so I took some time to evaluate it.

This post describes the changes I had to make to get KRR working with Google Managed Prometheus (GMP):

Enable Kubelet|cAdvisor
Enable kube-state-metrics
Create a ClusterRules recording rule for KRR’s PromQL for CPU
Revise KRR source to tweak the KRR’s PromQL for Memory

Enable Kubelet|cAdvisor

The GMP configuration file operatorconfig/config in Namespace gmp-public needs to be revised (per the instructions) to include:

collection:
  kubeletScraping:
    interval: 30s

Additionally, and in anticipation of KRR’s PromQL queries, 3 metrics must be included:

collection:
  filter:
    matchOneOf:
    - '{__name__=~"kube_pod_info"}'
    - '{__name__=~"container_cpu_usage_seconds_total"}'
    - '{__name__=~"container_memory_working_set_bytes"}'
  kubeletScraping:
    interval: 30s

Enable kube-state-metrics

You will need to apply the configuration described in Install Kube State Metrics.

curl \
--silent \
https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/main/examples/kube-state-metrics/kube-state-metrics.yaml \
| kubectl apply --filename=-

Create a `ClusterRules`

KRR uses 2 PromQL queries as described in the repo’s README metrics gathering. The first, corresponding to CPU is defined:

sum(
  node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{
    namespace="{object.namespace}",
    pod="{pod}",
    container="{object.container}"
  }
)

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate is a recording rule that was not present in GMP. It uses 2 metrics (container_cpu_usage_seconds_total and kube_pod_info) I was able to find a reference to its implementation:

sum by (cluster, namespace, pod, container) (
    irate(container_cpu_usage_seconds_total{job="kubelet", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster,
namespace, pod) (
    1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)

NOTE I’ve removed the reference to a label metrics_path that is not used by kube-state-metrics which exports the metric container_cpu_usage_seconds_total

This must then be applied to the cluster:

apiVersion: monitoring.googleapis.com/v1
kind: ClusterRules
metadata:
  name: krr
spec:
  groups:
  - name: krr
    interval: 30s
    rules:
    - record: >-
            node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
      expr: >-
            sum by (cluster, namespace, pod, container) (
              irate(container_cpu_usage_seconds_total{job="kubelet", image!=""}[5m])
            ) * on (cluster, namespace, pod) group_left(node) topk by (cluster,
            namespace, pod) (
              1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
            )

Revise KRR source to tweak the PromQL for Memory

KRR’s second PromQL query corresponding to Memory is defined:

sum(
  container_memory_working_set_bytes{
    job="kubelet",
    image!="",
    namespace="{object.namespace}",
    pod="{pod}",
    container="{object.container}"
  }
)

As defined, the PromQL also uses the label metrics_path which is not provided by GMP. This query is hard-coded in the KRR source:

robusta_krr/core/integrations/prometheus.py (L#136):

query=f'sum(container_memory_working_set_bytes{{job="kubelet", image!="", namespace="{object.namespace}", pod="{pod}", container="{object.container}"}})'

Run KRR

Now you should be able to run KRR. Interestingly, KRR utilizes Google’s Application Default Credentials and so, if you’re authenticated, you should be able to run:

PROJECT="..." # Google Cloud Project ID
MONITORING="https://monitoring.googleapis.com/v1"
ENDPOINT="${MONITORING}/projects/${PROJECT}/location/global/prometheus"

python krr.py simple \
--prometheus-url=${ENDPOINT}

`VerticalPodAutoscaler`

Robusta KRR is an alternative to Vertical Pod Autoscaler and Robusta documents the differences.

Acka uses GKE and Vertical Pod autoscaling. In this case, its definition is:

apiVersion: "autoscaling.k8s.io/v1"
kind: VerticalPodAutoscaler
metadata:
  name: ackal-system-controller-manager
  namespace: ackal-system
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: ackal-system-controller-manager
  updatePolicy:
    updateMode: "Off"

I will compare the results from both tools.