TIL: Monitoring Docker Memory with Limits on Prometheus#

One of our CI jobs failed seemingly without any reason. When a Celery job established a new HTTP connection, then its container stopped working. Later on I noticed the following message from docker-compose:

core_2127_14293_celery_1 exited with code 137

The exit code 137 means that a process was out of memory (OOM). In this case, the celery container had a limit of 2GiB and docker killed it when it went over that. Here is how we define the memory limit in our docker-compose.yml:

version: "3.7"

services:
  celery:
    deploy:
      resources:
        limits:
          memory: "2G"

We must prevent this from happening again, so we need to know if we are close to hitting the limit. Calculating used / limit gives us a nice percentage value.

Luckily, cAdvisor provides us with container metrics in Prometheus. We add a new Grafana panel with the following query (note that $instance is a dashboard variable):

sum by (name) (
    container_memory_usage_bytes{image!="",instance=~"$instance"}
    /
    container_spec_memory_limit_bytes{image!="",instance=~"$instance"}
)

All that is left to do is to set up the alert. From the panel:

  • tab “Alert”: Create alert rule from this panel

  • review query A that was copied from the dashboard

  • query B reduces to Last of A: keep it that way

  • query C: set threshold to 0.9 (i.e. 90%)

  • set Evaluation group and save

From now on we’ll be alerted if any container starts consuming more than 90% of its memory limit. 🤓