Monitoring & Observability¶

Monitoring is a crucial aspect of maintaining the health and performance of your applications.
It involves collecting, analyzing, and using information to ensure that your system is running smoothly and efficiently. Effective monitoring helps in identifying issues early, understanding system behavior, and making informed decisions to improve performance and reliability.

For external-dns, all metrics available for scraping are exposed on the /metrics endpoint. The metrics are in the Prometheus exposition format, which is widely used for monitoring and alerting.

To access the metrics:

curl https://localhost:7979/metrics

In the metrics output, you’ll see the help text, type information, and current value of the external_dns_registry_endpoints_total counter:

# HELP external_dns_registry_endpoints_total Number of Endpoints in the registry
# TYPE external_dns_registry_endpoints_total gauge
external_dns_registry_endpoints_total 11

You can configure a locally running Prometheus instance to scrape metrics from the application. Here’s an example prometheus.yml configuration:

scrape_configs:
- job_name: external-dns
  scrape_interval: 10s
  static_configs:
  - targets:
    - localhost:7979

For more detailed information on how to instrument application with Prometheus, you can refer to the Prometheus Go client library documentation.

What metrics can I get from ExternalDNS and what do they mean?¶

The project maintain a metrics page with a list of supported custom metrics.
Go runtime metrics also available for scraping.

ExternalDNS exposes 3 types of metrics: Sources, Registry errors and Cache hits.

Sources are mostly Kubernetes API objects. Examples of source errors may be connection errors to the Kubernetes API server itself or missing RBAC permissions.
It can also stem from incompatible configuration in the objects itself like invalid characters, processing a broken fqdnTemplate, etc.

Registry errors are mostly Provider errors, unless there’s some coding flaw in the registry package. Provider errors often arise due to accessing their APIs due to network or missing cloud-provider permissions when reading records.
When applying a changeset, errors will arise if the changeset applied is incompatible with the current state.

In case of an increased error count, you could correlate them with the http_request_duration_seconds{handler="instrumented_http"} metric which should show increased numbers for status codes 4xx (permissions, configuration, invalid changeset) or 5xx (apiserver down).

You can use the host label in the metric to figure out if the request was against the Kubernetes API server (Source errors) or the DNS provider API (Registry/Provider errors).