Matt Horan's Blog

Blackbox Monitoring with Prometheus

Prior to migrating from Cacti to Prometheus for infrastructure monitoring, I’d already been using Prometheus for blackbox monitoring. A couple of years ago I was looking for a way to monitor the health of various services I had deployed across virtual machines and containers running on my home network. I had used Pingdom for this in the past, but they killed their free plan in 2019. I had quite a few services to monitor, including multiple Web servers, a mail server, IRC server, and more. I surveyed the hosted service landscape but the available free options didn’t support the variety of services I needed to monitor; and the paid services cost as much as a single VPS at ARP Networks.

Since I was already running some infrastructure on Google Cloud Platform, I first took a look at Stackdriver (now Google Cloud Monitoring). Google Cloud Monitoring has support for uptime checks which can be used to perform blackbox monitoring of HTTP, HTTPS, and TCP services. However, at the time, uptime checks did not check for SSL certificate expiration, something I considered important given I was using 90 day Let’s Encrypt certificates and wanted to ensure that certificate rotation was working as expected. Also, the TCP checks are rather limited, simply confirming that a connection could be made. I wanted to check service health of TCP services (i.e. SMTP, IMAP, and IRC) as well.

I stumbled across Cloudprober and tried to get it working with Stackdriver. However, I never quite got it working the way I wanted. Finally, I discovered blackbox_exporter. At the time I had no experience with Prometheus, and since everyone was talking about it at the time, I figured it was time to learn.

To deploy blackbox_exporter on Kubernetes I used the following manifest:

apiVersion: v1
kind: ConfigMap
metadata:
  name: blackbox-exporter
  namespace: prometheus
data:
  config.yml: |
    modules:
      http_2xx:
        prober: http
        http:
          preferred_ip_protocol: ip4
          headers:
            User-Agent: Blackbox-Exporter
      imap_starttls:
        prober: tcp
        timeout: 5s
        tcp:
          preferred_ip_protocol: ip4
          query_response:
            - expect: "OK.*STARTTLS"
            - send: ". STARTTLS"
            - expect: "OK"
            - starttls: true
            - send: ". capability"
            - expect: "CAPABILITY IMAP4rev1"
      smtp_starttls:
        prober: tcp
        timeout: 5s
        tcp:
          preferred_ip_protocol: ip4
          query_response:
            - expect: "^220 ([^ ]+) ESMTP (.+)$"
            - send: "EHLO prober"
            - expect: "^250-STARTTLS"
            - send: "STARTTLS"
            - expect: "^220"
            - starttls: true
            - send: "EHLO prober"
            - expect: "^250-AUTH"
            - send: "QUIT"
      dns:
        prober: dns
        timeout: 5s
        dns:
          preferred_ip_protocol: ip4
          query_name: "matthoran.com"
          query_type: "A"
          validate_answer_rrs:
            fail_if_not_matches_regexp:
            - "206\\.125\\.170\\.166"    
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: blackbox-exporter
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: blackbox-exporter
    spec:
      containers:
      - image: prom/blackbox-exporter:v0.22.0
        name: blackbox-exporter
        securityContext:
          runAsUser: 65534
        resources:
          requests:
            cpu: "100m"
          limits:
            cpu: "100m"
        volumeMounts:
        - mountPath: /etc/blackbox_exporter
          name: blackbox-exporter
      volumes:
      - configMap:
          defaultMode: 420
          name: blackbox-exporter
        name: blackbox-exporter
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: blackbox-exporter
  name: blackbox-exporter
  namespace: prometheus
spec:
  ports:
  - port: 9115
    protocol: TCP
    targetPort: 9115
  selector:
    app: blackbox-exporter
  type: ClusterIP

The modules in the configuration above define what sort of blackbox testing the blackbox_exporter can perform. For testing HTTP(s) targets I’ve defined an http_2xx module. This does what you’d expect — checking the response of an HTTP request for a 2XX response code. imap_starttls and smtp_starttls both make IMAP and SMTP connections, respectively, over TLS. This allows me to verify the certificates presented by my mail server, and that the server is up and running, rather than just opening a TCP socket. Finally, I’ve defined a dns module. dns verifies that the target DNS server returns the expected records.

Standing up Prometheus on Kubernetes is pretty simple as well. Today I’m using Google Managed Prometheus, but at the time I stood up Prometheus using a Deployment, PersistentVolume and PersistentVolumeClaim. I’d strongly recommend using some sort of managed Prometheus solution so that you don’t have to manage persistence yourself. Google Managed Prometheus handles long term metrics retention for you automatically, down-sampling as appropriate up to 24 months.

Given that you’ve deployed Google Managed Prometheus according to my post on the subject, the following ConfigMap tells Prometheus what targets to scrape via blackbox_exporter:

apiVersion: v1
kind: ConfigMap
metadata:
  name: blackbox-targets
  namespace: prometheus
data:
  dns.yml: |
    - labels:
        module: dns
      targets:
      - matthoran.com:53
      - ns1.he.net:53
      - ns2.he.net:53
      - ns3.he.net:53
      - ns4.he.net:53
      - ns5.he.net:53    
  http.yml: |
    - labels:
        module: http_2xx  # Look for a HTTP 200 response.
      targets:
      - https://matthoran.com
      - https://blog.matthoran.com    

The prometheus ConfigMap needs to be modified to load the blackbox targets from the *.yml files defined in the ConfigMap above:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus
  namespace: prometheus
data:
  config.yaml: |
    global:
      scrape_interval: 5m
      scrape_timeout: 30s

    scrape_configs:
    - job_name: 'blackbox'
      metrics_path: /probe
      file_sd_configs:
      - files:
        - '/etc/prometheus/blackbox/targets/*.yml'
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [module]
          target_label: __param_module
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: blackbox-exporter:9115  # The blackbox exporter's real hostname:port.    

Finally, Prometheus itself and the config-reloader sidecar need to be made aware of the blackbox targets configuration. Make the following changes to your prometheus StatefulSet:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: prometheus
  name: prometheus
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  serviceName: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      automountServiceAccountToken: true
      nodeSelector:
        kubernetes.io/arch: amd64
        kubernetes.io/os: linux
      containers:
      - name: prometheus
        image: gke.gcr.io/prometheus-engine/prometheus:v2.35.0-gmp.2-gke.0
        args:
        - --config.file=/prometheus/config_out/config.yaml
        - --storage.tsdb.path=/prometheus/data
        - --storage.tsdb.retention.time=24h
        - --web.enable-lifecycle
        - --storage.tsdb.no-lockfile
        - --web.route-prefix=/
        ports:
        - name: web
          containerPort: 9090
        readinessProbe:
          httpGet:
            path: /-/ready
            port: web
            scheme: HTTP
        resources:
          requests:
            cpu: "100m"
          limits:
            cpu: "100m"
        volumeMounts:
        - name: config-out
          mountPath: /prometheus/config_out
          readOnly: true
        - name: prometheus-db
          mountPath: /prometheus/data
        - mountPath: /etc/prometheus/blackbox/targets
          name: blackbox-targets
      - name: config-reloader
        image: gke.gcr.io/prometheus-engine/config-reloader:v0.4.3-gke.0
        args:
        - --config-file=/prometheus/config/config.yaml
        - --config-file-output=/prometheus/config_out/config.yaml
        - --reload-url=http://localhost:9090/-/reload
        - --listen-address=:19091
        - --watched-dir=/prometheus/blackbox/targets
        ports:
        - name: reloader-web
          containerPort: 8080
        resources:
          requests:
            cpu: "50m"
          limits:
            cpu: "50m"
        volumeMounts:
        - name: config
          mountPath: /prometheus/config
        - name: config-out
          mountPath: /prometheus/config_out
        - mountPath: /prometheus/blackbox/targets
          name: blackbox-targets
      terminationGracePeriodSeconds: 600
      volumes:
      - name: prometheus-db
        emptyDir: {}
      - name: config
        configMap:
          name: prometheus
          defaultMode: 420
      - name: config-out
        emptyDir: {}
      - configMap:
          defaultMode: 420
          name: blackbox-targets
        name: blackbox-targets

With all this in place, Prometheus will probe your blackbox targets every 5 minutes (or your scrape_interval).

Having these metrics in Prometheus (and Monarch, for Google Managed Prometheus) in and of itself isn’t particularly useful. I needed some alerting! In my first endeavor setting up Prometheus and blackbox_exporter, I used Prometheus alerting along with Alertmanager to send email and SMS notifications for alerts. This worked great, but had a few weaknesses. First, I was using my cell phone provider’s email to SMS gateway to send SMS messages. This didn’t always work well. Also, since Alertmanager was running on GKE, I needed to use SendGrid outbound SMTP. While this worked fine, I wanted something better. Finally, there was a single point of failure in that I was managing the monitoring and alerting platform on GKE.

I’d separately been playing with Grafana Cloud and discovered that Grafana OnCall had recently been rolled out. Grafana OnCall could be used to replace the unreliable email to SMS gateway from my cell phone provider. Plus if I moved to using Grafana managed alerts I could remove the single point of failure for alerting.

Moving over to Grafana managed alerts from Prometheus alerting was pretty easy. I now reliably get SMS messages for outages, and the same grouping functionality is available from Grafana managed alerts as with Prometheus and Alertmanager. Plus I can easily check on the status of the my blackbox probes, for example by graphing SSL certificate expiry.

Screenshot of SSL Cert Expiry graph

For alerting on probe failure, I have an alert set up with the following query:

min by(instance) (up{job="blackbox"} == 0 or probe_success{job="blackbox"})

More completely, the alert rule is configured as follows:

Screenshot of Grafana alert rule configuration

The query is a bit convoluted, so I’ll take a second to explain it. First, we check for any defined blackbox targets that are not up. This means that either the blackbox exporter is down, or the exporter could not connect to the target. Second, given the exporter could connect to the target, we check whether the probe succeeded. The or is a bit tricky here, but it is simply the union between up and probe_success (because of the == 0 on up, if the target is up, only probe_success will be returned for that target, and vice versa).

Grafana notification policies can be used to configure how alerts are routed to contact points. This can include simple email alerts or an on-call rotation through Grafana OnCall or PagerDuty. Below is a screenshot of my notification policy. I’ve configured the “root policy” to send email notifications only, waiting 30 seconds to group alerts prior to the first notification, and 5 minutes for subsequent alerts. Finally, notifications will be resent every 3 hours. For critical systems, I’ve overridden the root policy to use Grafana OnCall, resulting in both SMS and email notifications for those alerts.

Screenshot of Grafana notification policies