Filtering metrics w/ Google Managed Prometheus

April 13, 2023 - 4 minutes read - 842 words

Google has published two, very good blog posts on cost management:

This post is about my application cost reductions for Cloud Monitoring for Ackal.

I’m pleased with Google Cloud Managed Service for Prometheus (hereinafter GMP). I’ve a strong preference for letting service providers run components of Ackal that I consider important but non-differentiating.

Monitoring (primarily through Prometheus for Ackal’s metrics but also Google Cloud Monitoring directly for Google’s services’ metrics) is critically important and it’s great to delegate this responsibility to Google.

In the case of GMP, there are additional benefits:

Deployment is trivial; GMP is deployed as a flag (--enable-managed-prometheus) when I create GKE clusters
Configuration is straightforward (see below);
Google’s time-series store (Monarch) is used as the Data storage for GMP so I need not worry about storage plus 24-months’ storage is included;
Google surfaces all Cloud Monitoring metrics through PromQL and (seemingly) enables interacting with Cloud Monitoring using PromQL rather than Google’s rocket-scientists’ Monitoring Query Language

NOTE For example Cloud Run’s (run.googleapis.com) revision container instance count metric (run.googleapis.com/container/instance_count) is transparently mapped to the PromQL metric run_googleapis_com:container_instance_count. This is very useful!

I’m left with the responsibility to ensure that I manage the costs of benefitting from this service and, Google’s post How to identify and reduce costs of your Google Cloud observability in Cloud Monitoring (leads to content) that has helped me better understand this. The blog post links to Cost controls and attribution which provides extensive documentation on ways to reduce GMP costs.

Ackal surfaces very few (two) application metrics through GMP but, as I began this journey, I noticed Cloud Console’s excellent (new?) Metrics Diagnostics showed that I had 62 metrics.

The 62 metrics correspond to Google Cloud Monitoring metrics prefixed: prometheus.googleapis.com. Google’s metrics are prefixed with a service-specific URL (e.g. compute.googleapis.com, container.googleapis.com).

Another way to enumerate these metrics, which avoids screenshots is to use Cloud Monitoring API:

PROJECT="..." # Google Cloud Project ID

ENDPOINT="https://monitoring.googleapis.com/v3"

PREFIX="prometheus.googleapis.com"
FILTER="metric.type=starts_with(\"${PREFIX}\")"

QUERY="
  .metricDescriptors[].name
  |sub(\"projects/${PROJECT}/metricDescriptors/${PREFIX}/\";\"\")
"

TOKEN=$(gcloud auth print-access-token) && \
curl \
--silent \
--get \
--header "Authorization: Bearer ${TOKEN}" \
--header "Accept: application/json" \
--data-urlencode "filter=${FILTER}" \
--compressed \
"${ENDPOINT}/projects/${PROJECT}/metricDescriptors"\
| jq -r "${QUERY}"

For my project, this returned the following metrics with Ackal’s metrics redacted:

ackal_...
ackal_...
certwatcher_read_certificate_errors_total/counter
certwatcher_read_certificate_total/counter
controller_runtime_active_workers/gauge
controller_runtime_max_concurrent_reconciles/gauge
controller_runtime_reconcile_errors_total/counter
controller_runtime_reconcile_time_seconds/histogram
controller_runtime_reconcile_total/counter
go_gc_duration_seconds/summary
go_gc_duration_seconds_count/summary
go_gc_duration_seconds_sum/summary:counter
go_goroutines/gauge
go_info/gauge
go_memstats_alloc_bytes/gauge
go_memstats_alloc_bytes_total/counter
go_memstats_buck_hash_sys_bytes/gauge
go_memstats_frees_total/counter
go_memstats_gc_sys_bytes/gauge
go_memstats_heap_alloc_bytes/gauge
go_memstats_heap_idle_bytes/gauge
go_memstats_heap_inuse_bytes/gauge
go_memstats_heap_objects/gauge
go_memstats_heap_released_bytes/gauge
go_memstats_heap_sys_bytes/gauge
go_memstats_last_gc_time_seconds/gauge
go_memstats_lookups_total/counter
go_memstats_mallocs_total/counter
go_memstats_mcache_inuse_bytes/gauge
go_memstats_mcache_sys_bytes/gauge
go_memstats_mspan_inuse_bytes/gauge
go_memstats_mspan_sys_bytes/gauge
go_memstats_next_gc_bytes/gauge
go_memstats_other_sys_bytes/gauge
go_memstats_stack_inuse_bytes/gauge
go_memstats_stack_sys_bytes/gauge
go_memstats_sys_bytes/gauge
go_threads/gauge
leader_election_master_status/gauge
process_cpu_seconds_total/counter
process_max_fds/gauge
process_open_fds/gauge
process_resident_memory_bytes/gauge
process_start_time_seconds/gauge
process_virtual_memory_bytes/gauge
process_virtual_memory_max_bytes/gauge
rest_client_request_duration_seconds/histogram
rest_client_request_size_bytes/histogram
rest_client_requests_total/counter
rest_client_response_size_bytes/histogram
scrape_duration_seconds/gauge
scrape_samples_post_metric_relabeling/gauge
scrape_samples_scraped/gauge
scrape_series_added/gauge
up/gauge
workqueue_adds_total/counter
workqueue_depth/gauge
workqueue_longest_running_processor_seconds/gauge
workqueue_queue_duration_seconds/histogram
workqueue_retries_total/counter
workqueue_unfinished_work_seconds/gauge
workqueue_work_duration_seconds/histogram

None of the metrics in this list are (currently) relevant to me and yet, per the blog post, because they come under prometheus.googleapis.com, I’m paying for them. How can I filter ingestion of the prometheus.googleapis.com prefixed metrics that I don’t (currently) need? Here, Google’s documentation is less good.

I’m using managed collection and I’m familiar with Prometheus Operator CRD PodMonitor and GMP’s variant PodMonitoring

NOTE GMP doesn’t have an equivalent of ServiceMonitor

Google’s Cost controls and attribution page includes Reduce the number of time series

But the metricRelabeling example cited only allows you to filter metrics that are being ingested by your e.g. PodMonitoring Custom Resources. The 60 (extraneous) metrics shown above aren’t the result of PodMonitoring resources. I have a single PodMonitoring Resource and it ingests that 2 Ackal metrics that are of interest to me.

This Google Cloud Skills Boost Reduce Costs for Managed Service for Prometheus is really useful. The solution it describes is to add a filter to GMP’s OperatorConfig.

Specifically I want to only include metrics prefixed “ackal_”:

operatorconfig.patch.yaml:

collection:
  filter:
    matchOneOf:
    - '{__name__=~"ackal_.+"}'

And this can be achieved by kubectl patch‘ing, the existing OperatorConfig (config) in the gmp-public Namespace created when GMP was deployed to the cluster:

kubectl patch operatorconfig/config \
--namespace=gmp-public \
--type=merge \
--patch-file=path/to/operatorconfig.patch.yaml

This begs the question(s) as to how to automate this configuration so that you never ingest metrics that you don’t need.

Unfortunately, it’s not yet possible and you can only currently retroactively apply the above change after GMP has been deployed (and potentially after unecessary metrics have been ingested).

See this issue on the GoogleCloudPlatform/prometheus-engine repo.

After the above change, I can confirm that only ackal_ prefixed metrics are being ingested by reviewing Google’s Metrics Diagnostics dashboard.

The Google documentation states that “The exporters included in the `kube-prometheus project—the kube-state-metrics service in particular—can emit a lot of metric data” and I immediately decided I could reduce or eliminate these and it wasn’t immediately obvious but these aren’t included in the ingested metrics.

There’s a page Infrastructure exporters that lists these exporters where it is clear that they must be specifically enabled for metric collection. By default, they aren’t.

In conclusion, determine which prometheus.googleapis.com metrics are being ingested from your GMP deployments and, if you need to filter some, consider using OperatorConfig to do so.