Let's talk about a particularly persistent myth. Whenever there is a discussion about monitoring systems and Prometheus's pull-based metrics collection approach comes up, someone inevitably chimes in about how a pull-based approach just “fundamentally doesn't scale”. The given reasons are often vague or only apply to systems that are fundamentally different from Prometheus. In fact, having worked with pull-based monitoring at the largest scales, this claim runs counter to our own operational experience.
We already have an FAQ entry about why Prometheus chooses pull over push, but it does not focus specifically on scaling aspects. Let's have a closer look at the usual misconceptions around this claim and analyze whether and how they would apply to Prometheus.
When people think of a monitoring system that actively pulls, they often think of Nagios. Nagios has a reputation of not scaling well, in part due to spawning subprocesses for active checks that can run arbitrary actions on the Nagios host in order to determine the health of a certain host or service. This sort of check architecture indeed does not scale well, as the central Nagios host quickly gets overwhelmed. As a result, people usually configure checks to only be executed every couple of minutes, or they run into more serious problems.
However, Prometheus takes a fundamentally different approach altogether. Instead of executing check scripts, it only collects time series data from a set of instrumented targets over the network. For each target, the Prometheus server simply fetches the current state of all metrics of that target over HTTP (in a highly parallel way, using goroutines) and has no other execution overhead that would be pull-related. This brings us to the next point:
For scaling purposes, it doesn't matter who initiates the TCP connection over which metrics are then transferred. Either way you do it, the effort for establishing a connection is small compared to the metrics payload and other required work.
But a push-based approach could use UDP and avoid connection establishment altogether, you say! True, but the TCP/HTTP overhead in Prometheus is still negligible compared to the other work that the Prometheus server has to do to ingest data (especially persisting time series data on disk). To put some numbers behind this: a single big Prometheus server can easily store millions of time series, with a record of 800,000 incoming samples per second (as measured with real production metrics data at SoundCloud). Given a 10-seconds scrape interval and 700 time series per host, this allows you to monitor over 10,000 machines from a single Prometheus server. The scaling bottleneck here has never been related to pulling metrics, but usually to the speed at which the Prometheus server can ingest the data into memory and then sustainably persist and expire data on disk/SSD.
Also, although networks are pretty reliable these days, using a TCP-based pull approach makes sure that metrics data arrives reliably, or that the monitoring system at least knows immediately when the metrics transfer fails due to a broken network.
Some monitoring systems are event-based. That is, they report each individual event (an HTTP request, an exception, you name it) to a central monitoring system immediately as it happens. This central system then either aggregates the events into metrics (StatsD is the prime example of this) or stores events individually for later processing (the ELK stack is an example of that). In such a system, pulling would be problematic indeed: the instrumented service would have to buffer events between pulls, and the pulls would have to happen incredibly frequently in order to simulate the same “liveness” of the push-based approach and not overwhelm event buffers.
However, again, Prometheus is not an event-based monitoring system. You do not send raw events to Prometheus, nor can it store them. Prometheus is in the business of collecting aggregated time series data. That means that it's only interested in regularly collecting the current state of a given set of metrics, not the underlying events that led to the generation of those metrics. For example, an instrumented service would not send a message about each HTTP request to Prometheus as it is handled, but would simply count up those requests in memory. This can happen hundreds of thousands of times per second without causing any monitoring traffic. Prometheus then simply asks the service instance every 15 or 30 seconds (or whatever you configure) about the current counter value and stores that value together with the scrape timestamp as a sample. Other metric types, such as gauges, histograms, and summaries, are handled similarly. The resulting monitoring traffic is low, and the pull-based approach also does not create problems in this case.
With a pull-based approach, your monitoring system needs to know which service instances exist and how to connect to them. Some people are worried about the extra configuration this requires on the part of the monitoring system and see this as an operational scalability problem.
We would argue that you cannot escape this configuration effort for serious monitoring setups in any case: if your monitoring system doesn't know what the world should look like and which monitored service instances should be there, how would it be able to tell when an instance just never reports in, is down due to an outage, or really is no longer meant to exist? This is only acceptable if you never care about the health of individual instances at all, like when you only run ephemeral workers where it is sufficient for a large-enough number of them to report in some result. Most environments are not exclusively like that.
If the monitoring system needs to know the desired state of the world anyway, then a push-based approach actually requires more configuration in total. Not only does your monitoring system need to know what service instances should exist, but your service instances now also need to know how to reach your monitoring system. A pull approach not only requires less configuration, it also makes your monitoring setup more flexible. With pull, you can just run a copy of production monitoring on your laptop to experiment with it. It also allows you just fetch metrics with some other tool or inspect metrics endpoints manually. To get high availability, pull allows you to just run two identically configured Prometheus servers in parallel. And lastly, if you have to move the endpoint under which your monitoring is reachable, a pull approach does not require you to reconfigure all of your metrics sources.
On a practical front, Prometheus makes it easy to configure the desired state of the world with its built-in support for a wide variety of service discovery mechanisms for cloud providers and container-scheduling systems: Consul, Marathon, Kubernetes, EC2, DNS-based SD, Azure, Zookeeper Serversets, and more. Prometheus also allows you to plug in your own custom mechanism if needed. In a microservice world or any multi-tiered architecture, it is also fundamentally an advantage if your monitoring system uses the same method to discover targets to monitor as your service instances use to discover their backends. This way you can be sure that you are monitoring the same targets that are serving production traffic and you have only one discovery mechanism to maintain.
Whether you pull or push, any time-series database will fall over if you send it more samples than it can handle. However, in our experience it's slightly more likely for a push-based approach to accidentally bring down your monitoring. If the control over what metrics get ingested from which instances is not centralized (in your monitoring system), then you run into the danger of experimental or rogue jobs suddenly pushing lots of garbage data into your production monitoring and bringing it down. There are still plenty of ways how this can happen with a pull-based approach (which only controls where to pull metrics from, but not the size and nature of the metrics payloads), but the risk is lower. More importantly, such incidents can be mitigated at a central point.
Besides the fact that Prometheus is already being used to monitor very large setups in the real world (like using it to monitor millions of machines at DigitalOcean), there are other prominent examples of pull-based monitoring being used successfully in the largest possible environments. Prometheus was inspired by Google's Borgmon, which was (and partially still is) used within Google to monitor all its critical production services using a pull-based approach. Any scaling issues we encountered with Borgmon at Google were not due its pull approach either. If a pull-based approach scales to a global environment with many tens of datacenters and millions of machines, you can hardly say that pull doesn't scale.
There are indeed setups that are hard to monitor with a pull-based approach. A prominent example is when you have many endpoints scattered around the world which are not directly reachable due to firewalls or complicated networking setups, and where it's infeasible to run a Prometheus server directly in each of the network segments. This is not quite the environment for which Prometheus was built, although workarounds are often possible (via the Pushgateway or restructuring your setup). In any case, these remaining concerns about pull-based monitoring are usually not scaling-related, but due to network operation difficulties around opening TCP connections.
This article addresses the most common scalability concerns around a pull-based monitoring approach. With Prometheus and other pull-based systems being used successfully in very large environments and the pull aspect not posing a bottleneck in reality, the result should be clear: the “pull doesn't scale” argument is not a real concern. We hope that future debates will focus on aspects that matter more than this red herring.