Google Metric Diagnostics and Metric Data Ingested

April 25, 2023 - 3 minutes read - 623 words

I’ve been on an efficiency drive with Cloud Logging and Cloud Monitoring.

With regards Cloud Logging, I’m contemplating (!) eliminating almost all log storage. As it is I’ve buzz cut log storage with a _Default sink that has comprehensive sets of NOT LOG_ID(X) inclusion and exclusion filters. As I was doing so, I began to wonder why I need to pay for the storage of much logging. There’s the comfort from knowing that everything you may ever need is being logged (at least for 30 days) but there’s also the costs that that entails. I use logs exclusively for debugging which got me thinking, couldn’t I just capture logs when I’m debugging (rather thna all the time?). I’ve not taken that leap yet but I’m noodling on it.

With regards Cloud Monitoring, I’ve only eliminated a small set of what I feel are redundant metrics from Google Managed Service for Prometheus (GMP). Metrics contribute to the ongoing value of Ackal by providing Ackal’s customers with Prometheus metrics for their gRPC health checks and by providing me with metrics on Ackal’s customers’ use of the service and general service health. However, as I described in Filtering metrics w/ Google Managed Prometheus (GMP) and Kubernetes metrics, metrics everywhere, I’ve filtered these prometheus.googleapis.com (chargeable!) metrics down to 4 key metrics:

ackal_ (*2)
kubelet_running_containers
kubelet_running_pods

My outstanding question was whether I could reduce the (volume) of Metric Data Ingested. For a given metric e.g. kubelet_running_containers, there are 2 factors that contriubte to the volume of Metric Data Ingested:

The number of time-series
The frequency of measurement

The number of time-series is called the cardinality. In the case of kubelet_running_containers, there are 3:

Metric	Value
`kubelet_running_containers{container_state="created"}`	1
`kubelet_running_containers{container_state="exited"}`	8
`kubelet_running_containers{container_state="running"}`	29

In the above, I’ve removed the labels cluster=, instance, job, location, node, project_id because these values were constant across the above metrics. The only label value that changed is container_state and it has 3 values and it is this that gives us the metric (sets) cardinality. If I were to include a 2nd node, and both nodes surfaced metrics with the 3 container_state values, then the cardinality would be 6. Cardinality is important because when we often think of kubelet_running_containers as a metric but, in my example above, there are 3 time-series being produced and potentially (not necessarily) 3 measurements being recorded for kubelet_running_containers each scrape.

We can determine a metric’s cardinality by counting the number of time-series using e.g.:

count({__name__="kubelet_running_containers})

The second important factor that we would like to understand is how many time-series data points have been ingested for each metric. If we issue the query below, we’ll get an instant vector (see table above) for the metrics at a single (absent further qualification, the most recent) point in time:

kubelet_running_containers

You can append @ {UNIX epoch} to the above query to get the values at that instant in time.

If you append a subquery for a range e.g. [5m], you’ll get results for each metric for each instant during the past e.g. 5 minutes:

kubelet_running_containers[5m]

Using my data and selecting a single metric (e.g. container_state="created"), I get 10 measurements. This is to be expected because, in this case, I configured inclusion of this metric using operatorconfig/config and set a kubeletScraping:interval of 30 seconds (scraping a measurement every 30 seconds for 5 minutes yields 10 measurements). Each measurement is 30.000 seconds later than its predecessor:

1 @1682460398.000
1 @1682460428.000
1 @1682460458.000
1 @1682460488.000
1 @1682460518.000
1 @1682460548.000
1 @1682460578.000
1 @1682460608.000
1 @1682460638.000
1 @1682460668.000

Google’s Metrics Diagnostic (for the past 7 days) displayed 3392 for kubelet_running_containers/gauge and 1132 for kubelet_running_pods/gauge for Metic Data Ingested. To verify these values, I used the following PromQL queries which corroborate Google’s values:

sum (count_over_time(kubelet_running_containers[7d]))
3438

sum (count_over_time(kubelet_running_pods[7d]))
1147