Monitoring of a Kubernetes Cluster from the Outside with Prometheus

When setting up a Kubernetes cluster, a Prometheus and Grafana setup is a great way to have to start monitoring your cluster health. From CPU and RAM stats, to filesystem usage, to even the number and type of requests that your cluster is serving. One thing however, is that most setups tend to assume that you’ll be deploying Prometheus within the cluster itself. To be sure, this is probably the easier way of settings things up. Within the cluster, your Prometheus set up will have no issues finding and monitoring the configured pods, endpoints and nodes.

The Problem with an out-cluster Prometheus setup

That said, sometimes, it’s not entirely feasible to deploy Prometheus within a particular cluster — for example, if you wanted Prometheus to monitor mulitple Kubernetes clusters across multiple service providers like AWS and Azure.

In this scenario, I’ve found it more convenient to host Prometheus separately, outside the Kubernetes clusters, then set it up to monitor the clusters and their services and pods.

However, I ran into yet another problem, which was properly exposing the services to have them scaped by Prometheus — particularly in the case of Deployments that spanned more than one Pod. While I could potentially include sidecar containers in each Pod exposing /metrics on a given port, I struggled to find a way to properly expose them to an out-cluster Prometheus. I could expose an Ingress and have Prometheus access the endpoint through the cluster’s main IP, but given as how Kubernetes Services deliver requests to their backing Pods in a round-robin fashion, this meant that each successive scrape could end up hitting a different pod. This would lead to confusing metrics to say the least. Ideally, what we would want is for Prometheus to be able to distinguish between the metrics scraped from each individual pod (with a different label for each pod), that way, we could tell if, say, one Pod ended up serving more traffic than the others in the Deployment.

One option to address each Pod individually would perhaps be to expose one Ingress per Pod. Of course, this would have to be automated in some form, perhaps having yet another Service watch new pods being spun up and creating the necessary exporters and Ingresses automatically, but this approach quickly gets very unwieldy, without even considering what happens once we start scaling in any form.

So now we have two problems. We want to

Abusing the API Server

As it turns out, the APIServer does in fact allow us to communicate directly with the Pods, without necessarily creating an Ingress, or a Service beforehand. All that’s needed is the proper credentials and authorisation to make HTTP requests through the APIServer.

It’s not terribly obvious from the outset, but even the very familiar kubectl does in fact communicate with the APIServer (and manages the cluster) via simple HTTP calls. (Run any kubectl command with the -v 10 option to see the HTTP calls that are being made in the background — eg. kubectl -v 10 version)

Access Control

In order to communicate with the APIServer through its API though, we’ll first need to set up some form of access control.

If you’re on Kubernetes < 1.6, you’ll have to use Attributed-based Access Control (ABAC), and if you’re running Kubernetes > 1.6, you’ll be able to use the more convenient Role-based Access Control (RBAC). A discussion on how to effectively use ABAC vs RBAC is beyond the scope of this post, but essentially, you’ll want to end up with an access token (eg. a ServiceAccount’s Secret token value) that will allow you to make authenticated and authorised requests to the APIServer.

If you’d just like to try it out, you could run kubectl -v 10 version, watch the HTTP calls, and simply use the values kubectl is sending in the Authorization HTTP header. For production setups however, I’d recommend setting up a proper ServiceAccount with appropriately scoped permissions.

Accessing Pods through the APIServer

It’s not commonly mentioned in the general Kubernetes documentation, but APIServer does allow you to make rquests directly to the pods within the cluster.

It's hardly clear what you're supposed to do with this, and the lack of documented examples don't help either

Get Connect Proxy Path

It's hardly clear what you're supposed to do with this, and the lack of documented examples don't help either

However, with the handy documentation on this page, we can make HTTP calls directly to each pod, through the Kubernetes API server, without needing to create a specific Ingress for each Pod that we’d like to have Prometheus scrape. It follows that this means that we can afford to then expose metrics pages only on cluster-local IP addresses, without worry of those pages leaking out to the public Internet.

From the Kubernetes API documentation, we can refer to the sections on Proxy operations for the various Kubernetes objects. For example, the Pod proxy operations show us how to reach out to a specific Pod through the Kubernetes API.

Assuming we have a Prometheus exporter pod, prom-exporter in the namespace monitoring, that exposes metrics at the path /metrics, that we’d like to scrape.

The general pattern of the request looks like

GET /api/v1/proxy/namespaces/{namespace}/pods/{name}/{path}

We can make a request to the Pod via the call below

GET /api/v1/proxy/namespaces/monitoring/pods/prom-exporter/metrics
# As a curl command, it should look something like
# $ curl "https://<api_server:port>/api/v1/proxy/namespaces/monitoring/pods/prom-exporter/metrics

which should give us our exported metrics.


Naturally, when you’re setting up Prometheus to perform scraping through the proxy API in this manner, you’ll want to be connecting over HTTPS to ensure that your metrics are not leaked to third parties over the wire. However though, since Kubernetes APIServer SSL certs are usually self signed, you’ll also want to include your APIServer’s CA certificate in your Prometheus configuration so that you can authenticate the server.

In your prometheus.yml,

    - job_name: 'natsd'
      scheme: https
      bearer_token: "$KUBERNETES_TOKEN"
        ca_file: /etc/prometheus/tls/certs/kubernetes.crt

      - api_server: ''
        bearer_token: "$KUBERNETES_TOKEN"
          ca_file: /etc/prometheus/tls/certs/kubernetes.crt
        role: pod

Extracting the CA certificate from the APIServer is a matter of running (assuming the APIServer is running at,

$ openssl s_client -connect < /dev/null | openssl x509 -text
<... truncated ...>
    Signature Algorithm: sha256WithRSAEncryption

The important bit is between the BEGIN CERTIFICATE and END CERTIFICATE lines, inclusive. Save that to a file named ca.crt (for example), and include it in your prometheus.yml.

Putting it together

Eventually, our configuration ends up looking something like this

    - job_name: 'natsd'
      scheme: https
      bearer_token: "$KUBERNETES_TOKEN"
        ca_file: /etc/prometheus/tls/certs/kubernetes.crt

      - api_server: ''
        bearer_token: "$KUBERNETES_TOKEN"
          ca_file: /etc/prometheus/tls/certs/kubernetes.crt
        role: pod

      # Tells Prometheus to query the APIServer for all pods matching the target label (natsd-*) below
      # and for each of the pods, generate a scrape target at the `/metrics` path via the proxy API
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name, __meta_kubernetes_pod_container_port_name]
        action: keep
        regex: default;natsd-.*;metrics  # Remember to use the right 'container_port_name` as specified in the Deployment
      - target_label: __address__
        replacement: ''  # API server address
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_pod_name,__meta_kubernetes_pod_container_port_number]
        regex: (.+);(.+);(.+)
        target_label: __metrics_path__
        replacement: /api/v1/namespaces/${1}/pods/http:${2}:${3}/proxy/metrics  # Path after /proxy/ is metrics path of pod
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(service|tier|type)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod_name

With this, we can expose metrics on each individual natsd-* pod within the Kubernetes cluster, without needing to worry about setting up an Ingress nor a Service for the sole purpose of allowing an off-cluster Prometheus setup to access said metrics.