I could not find much information on the interwebs how many containers you can run per host. So here are mine and the issues we ran into along the way.

The Beginning

In the beginning there were virtual machines running with 8 vCPUs and 60GB of RAM. They started to serve around 30 containers per VM. Later on we managed to squeeze around 50 containers per VM.

Initial orchestration was done with swarm, later on we moved to nomad. Access was initially fronted by nginx with consul-template generating the config. When it did not scale anymore nginx was replaced by Traefik. Service discovery is managed by consul. Log shipping was initially handled by logspout in a container, later on we switched to filebeat. Log transformation is handled by logstash. All of this is running on Debian GNU/Linux with docker-ce.

At some point it did not make sense anymore to use VMs. We've no state inside the containerized applications anyway. So we decided to move to dedicated hardware for our production setup. We settled with HPe DL360G10 with 24 physical cores and 128GB of RAM.

THP and Defragmentation

When we moved to the dedicated bare metal hosts we were running Debian/stretch + Linux from stretch-backports. At that time Linux 4.17. These machines were sized to run 95+ containers. Once we were above 55 containers we started to see occasional hiccups. First occurences lasted only for 20s, then 2min, and suddenly some lasted for around 20min. Our system metrics, as collected by prometheus-node-exporter, could only provide vague hints. The metric export did work, so processes were executed. But the CPU usage and subsequently the network throughput went down to close to zero.

I've seen similar hiccups in the past with Postgresql running on a host with THP (Transparent Huge Pages) enabled. So a good bet was to look into that area. By default /sys/kernel/mm/transparent_hugepage/enabled is set to always, so THP are enabled. We stick to that, but changed the defrag mode /sys/kernel/mm/transparent_hugepage/defrag (since Linux 4.12) from the default madavise to defer+madvise.

This moves page reclaims and compaction for pages which were not allocated with madvise to the background, which was enough to get rid of those hiccups. See also the upstream documentation. Since there is no sysctl like facility to adjust sysfs values, we're using the sysfsutils package to adjust this setting after every reboot.

Conntrack Table

Since the default docker networking setup involves a shitload of NAT, it shouldn't be surprising that nf_conntrack will start to drop packets at some point. We're currently fine with setting the sysctl tunable

net.netfilter.nf_conntrack_max = 524288

but that's very much up to your network setup and traffic characteristics.

Inotify Watches and Cadvisor

Along the way cadvisor refused to start at one point. Turned out that the default settings (again sysctl tunables) for

fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

are too low. We increased to

fs.inotify.max_user_instances = 4096
fs.inotify.max_user_watches = 32768

Ephemeral Ports

We didn't ran into an issue with running out of ephemeral ports directly, but dockerd has a constant issue of keeping track of ports in use and we already see collisions to appear regularly. Very unscientifically we set the sysctl

net.ipv4.ip_local_port_range = 11000 60999

NOFILE limits and Nomad

Initially we restricted nomad (via systemd) with

LimitNOFILE=65536

which apparently is not enough for our setup once we were crossing the 100 container per host limit. Though the error message we saw was hard to understand:

[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=93c6b94b-e122-30ba-7250-1050e0107f4d task=mycontainer error="prestart hook "logmon" failed: Unrecognized remote plugin message:

This was solved by following the official recommendation and setting

LimitNOFILE=infinity
LimitNPROC=infinity
TasksMax=infinity

The main lead here was looking into the "hashicorp/go-plugin" library source, and understanding that they try to read the stdout of some other process, which sounded roughly like someone would have to open at some point a file.

Running out of PIDs

Once we were close to 200 containers per host (test environment with 256GB RAM per host), we started to experience failures of all kinds because processes could no longer be forked. Since that was also true for completely fresh user sessions, it was clear that we're hitting some global limitation and nothing bound to session via a pam module.

It's important to understand that most of our workloads are written in Java, and a lot of the other software we use is written in go. So we've a lot of Threads, which in Linux are presented as "Lightweight Process" (LWP). So every LWP still exists with a distinct PID out of the global PID space.

With /proc/sys/kernel/pid_max defaulting to 32768 we actually ran out of PIDs. We increased that limit vastly, probably way beyond what we currently need, to 500000. Actuall limit on 64bit systems is 222 according to man 5 proc.