Robusta KRR w/ GMP
- 3 minutes read - 493 wordsI’ve been spending time recently optimizing Ackal’s use of Google Cloud Logging and Cloud Monitoring in posts:
- Filtering metrics w/ Google Managed Prometheus
- Kubernetes metrics, metrics everywhere
- Google Metric Diagnostics and Metric Data Ingested
Yesterday, I read that Robusta has a new open source project Kubernetes Resource Recommendations (KRR) so I took some time to evaluate it.
This post describes the changes I had to make to get KRR working with Google Managed Prometheus (GMP):
- Enable Kubelet|cAdvisor
- Enable kube-state-metrics
- Create a
ClusterRules
recording rule for KRR’s PromQL for CPU - Revise KRR source to tweak the KRR’s PromQL for Memory
Enable Kubelet|cAdvisor
The GMP configuration file operatorconfig/config
in Namespace gmp-public
needs to be revised (per the instructions) to include:
collection:
kubeletScraping:
interval: 30s
Additionally, and in anticipation of KRR’s PromQL queries, 3 metrics must be included:
collection:
filter:
matchOneOf:
- '{__name__=~"kube_pod_info"}'
- '{__name__=~"container_cpu_usage_seconds_total"}'
- '{__name__=~"container_memory_working_set_bytes"}'
kubeletScraping:
interval: 30s
Enable kube-state-metrics
You will need to apply the configuration described in Install Kube State Metrics.
curl \
--silent \
https://raw.githubusercontent.com/GoogleCloudPlatform/prometheus-engine/main/examples/kube-state-metrics/kube-state-metrics.yaml \
| kubectl apply --filename=-
Create a ClusterRules
KRR uses 2 PromQL queries as described in the repo’s README metrics gathering. The first, corresponding to CPU is defined:
sum(
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{
namespace="{object.namespace}",
pod="{pod}",
container="{object.container}"
}
)
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
is a recording rule that was not present in GMP. It uses 2 metrics (container_cpu_usage_seconds_total
and kube_pod_info
) I was able to find a reference to its implementation:
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubelet", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster,
namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
NOTE I’ve removed the reference to a label
metrics_path
that is not used bykube-state-metrics
which exports the metriccontainer_cpu_usage_seconds_total
This must then be applied to the cluster:
apiVersion: monitoring.googleapis.com/v1
kind: ClusterRules
metadata:
name: krr
spec:
groups:
- name: krr
interval: 30s
rules:
- record: >-
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
expr: >-
sum by (cluster, namespace, pod, container) (
irate(container_cpu_usage_seconds_total{job="kubelet", image!=""}[5m])
) * on (cluster, namespace, pod) group_left(node) topk by (cluster,
namespace, pod) (
1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""})
)
Revise KRR source to tweak the PromQL for Memory
KRR’s second PromQL query corresponding to Memory is defined:
sum(
container_memory_working_set_bytes{
job="kubelet",
image!="",
namespace="{object.namespace}",
pod="{pod}",
container="{object.container}"
}
)
As defined, the PromQL also uses the label metrics_path
which is not provided by GMP. This query is hard-coded in the KRR source:
robusta_krr/core/integrations/prometheus.py
(L#136):
query=f'sum(container_memory_working_set_bytes{{job="kubelet", image!="", namespace="{object.namespace}", pod="{pod}", container="{object.container}"}})'
Run KRR
Now you should be able to run KRR. Interestingly, KRR utilizes Google’s Application Default Credentials and so, if you’re authenticated, you should be able to run:
PROJECT="..." # Google Cloud Project ID
MONITORING="https://monitoring.googleapis.com/v1"
ENDPOINT="${MONITORING}/projects/${PROJECT}/location/global/prometheus"
python krr.py simple \
--prometheus-url=${ENDPOINT}
VerticalPodAutoscaler
Robusta KRR is an alternative to Vertical Pod Autoscaler and Robusta documents the differences.
Acka uses GKE and Vertical Pod autoscaling. In this case, its definition is:
apiVersion: "autoscaling.k8s.io/v1"
kind: VerticalPodAutoscaler
metadata:
name: ackal-system-controller-manager
namespace: ackal-system
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: ackal-system-controller-manager
updatePolicy:
updateMode: "Off"
I will compare the results from both tools.