TIL: Monitoring Docker Memory with Limits on Prometheus#
One of our CI jobs failed seemingly without any reason. When a Celery job established a new HTTP connection, then its container stopped working. Later on I noticed the following message from docker-compose:
core_2127_14293_celery_1 exited with code 137
The exit code 137 means that a process was out of memory (OOM).
In this case, the celery container had a limit of 2GiB and docker killed it when it went over that.
Here is how we define the memory limit in our docker-compose.yml
:
version: "3.7"
services:
celery:
deploy:
resources:
limits:
memory: "2G"
We must prevent this from happening again, so we need to know if we are close to hitting the limit.
Calculating used / limit
gives us a nice percentage value.
Luckily, cAdvisor provides us with container metrics in Prometheus.
We add a new Grafana panel with the following query (note that $instance
is a dashboard variable):
sum by (name) (
container_memory_usage_bytes{image!="",instance=~"$instance"}
/
container_spec_memory_limit_bytes{image!="",instance=~"$instance"}
)
All that is left to do is to set up the alert. From the panel:
tab “Alert”: Create alert rule from this panel
review query A that was copied from the dashboard
query B reduces to Last of A: keep it that way
query C: set threshold to 0.9 (i.e. 90%)
set Evaluation group and save
From now on we’ll be alerted if any container starts consuming more than 90% of its memory limit. 🤓