- Time Series
- 16. Sep
The Prometheus Pushgateway exists to allow ephemeral and batch jobs to expose their metrics to Prometheus. Since these kinds of jobs may not exist long enough to be scraped, they can instead push their metrics to a Pushgateway. The Pushgateway then exposes these metrics to Prometheus.
Common pitfalls when using the Pushgateway
Jobs of an ephemeral nature are often not around long enough to have their metrics scraped by Prometheus. In order to remedy this, the Pushgateway was developed to allow for these types of jobs to push their metrics to a metrics cache in order to be scraped by Prometheus long after the original jobs have gone away. This blog post discusses some of the common pitfalls users tend to fall into when adding the Pushgateway to their monitoring stack.
In the majority of cases, the recommended approach to monitoring something with Prometheus is to stick to the normal pull model, however, there are a small number of cases where the Pushgateway is preferable. In most cases, the only valid use case for the Pushgateway is in collecting metrics from service-level batch jobs.
One common pitfall of using the Pushgateway with Prometheus is that it becomes a single point of failure. If your Pushgateway is collecting metrics from many different sources and goes down, you will lose monitoring of all of those sources, potentially triggering a lot of needless alerts.
Another important point to remember is that the Pushgateway will not automatically remove any metrics pushed to it. This means that metrics whose source may disappear will not disappear from Prometheus scraping the Pushgateway. This is particularly evident with metrics containing an instance label (which should not be going to the Pushgateway in the first place, as they are not service-level). Instances may come and go but the old metrics for the expired instances will remain in the Pushgateway and thus Prometheus. In order to synchronize, one must remember to delete expired metrics from the Pushgateway using its API:
curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job/instance/some_instance
Another typical misuse of the Pushgateway includes efforts to circumvent firewall or NAT issues preventing Prometheus from scraping its desired targets. Rather than use Pushgateway to push metrics to for Prometheus to scrape, the recommended approach would be to move Prometheus behind that firewall, closer to the targets we want to scrape.
For getting around NAT, you can try Robust Perception's own PushProx.
Finally, remember that when using Pushgateway, you lose the up metric.
WHEN TO USE THE PUSHGATEWAY
The Pushgateway is an intermediary service which allows you to push metrics from jobs which cannot be scraped. For details, see Pushing metrics.
Should I be using the Pushgateway?
We only recommend using the Pushgateway in certain limited cases. There are several pitfalls when blindly using the Pushgateway instead of Prometheus's usual pull model for general metrics collection:
- When monitoring multiple instances through a single Pushgateway, the Pushgateway becomes both a single point of failure and a potential bottleneck.
- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
The latter point is especially relevant when multiple instances of a job differentiate their metrics in the Pushgateway via an instance label or similar. Metrics for an instance will then remain in the Pushgateway even if the originating instance is renamed or removed. This is because the lifecycle of the Pushgateway as a metrics cache is fundamentally separate from the lifecycle of the processes that push metrics to it. Contrast this to Prometheus's usual pull-style monitoring: when an instance disappears (intentional or not), its metrics will automatically disappear along with it. When using the Pushgateway, this is not the case, and you would now have to delete any stale metrics manually or automate this lifecycle synchronization yourself.
Usually, the only valid use case for the Pushgateway is for capturing the outcome of a service-level batch job. A "service-level" batch job is one which is not semantically related to a specific machine or job instance (for example, a batch job that deletes a number of users for an entire service). Such a job's metrics should not include a machine or instance label to decouple the lifecycle of specific machines or instances from the pushed metrics. This decreases the burden for managing stale metrics in the Pushgateway. See also the best practices for monitoring batch jobs.
If an inbound firewall or NAT is preventing you from pulling metrics from targets, consider moving the Prometheus server behind the network barrier as well. We generally recommend running Prometheus servers on the same network as the monitored instances.
For batch jobs that are related to a machine (such as automatic security update cronjobs or configuration management client runs), expose the resulting metrics using the Node Exporter's textfile module instead of the Pushgateway.