RSS Atom Add a new post titled:

I guess that is just one of the things you've to know, so maybe it helps someone else.

We saw some warnings in our playbook rollouts like

[WARNING]: sftp transfer mechanism failed on [192.168.23.42]. Use
ANSIBLE_DEBUG=1 to see detailed information

They were actually reported for sftp and scp usage. If you look at the debug output it's not very helpful for the average user, similar if you go to verbose mode with -vvv. The later one at least helped to see parameters passed to sftp and scp, but you still see no error message. But if you set

scp_if_ssh: True

or

scp_if_ssh: False

you will suddenly see the real error message

fatal: [docker-023]: FAILED! => {"msg": "failed to transfer file to /home/sven/testme.txt /home/sven/
.ansible/tmp/ansible-tmp-1568643306.1439135-27483534812631/source:\n\nunknown option -- A\r\nusage: scp [-346BCpqrv]
[-c cipher] [-F ssh_config] [-i identity_file]\n           [-l limit] [-o ssh_option] [-P port] [-S program] source
... target\n"}

Lesson learned, as long as ansible is running in "smart" mode it will hide all error messages from the user. Now we could figure out that the culprit is the -A for AgentForwarding, which is for obvious reasons not available in sftp and scp. One can move it to group_vars ansible_ssh_extra_args. The best documentation regarding this, beside of the --help output, seems to be the commit message of 3ad9b4cba62707777c3a144677e12ccd913c79a8.

Posted Mon Sep 16 18:16:25 2019

Update 2019-09-06: we experienced the name issue on Debian/stretch hosts using the latest stretch-backports kernel based on the same buster kernel release. Still no idea what's going on. Created https://bugs.debian.org/939328 for further tracking.

For yet unknown reasons some recently installed HPE DL360G10 running buster changed back the interface names from the expected "onboard" based names eno5 and eno6 to ethX after a reboot.

My current workaround is a link file which kind of enforces the onboard scheme.

$ cat /etc/systemd/network/101-onboard-rd.link 
[Link]
NamePolicy=onboard kernel database slot path
MACAddressPolicy=none

The hosts are running the latest buster kernel

Linux foobar 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64 GNU/Linu

A downgrade of the kernel did not change anything. So I currently like to believe this is not related a kernel change.

I tried to collect a few information on one of the broken systems while in a broken state:

root@foobar:~# SYSTEMD_LOG_LEVEL=debug udevadm test-builtin net_setup_link /sys/class/net/eth0
=== trie on-disk ===
tool version:          241
file size:         9492053 bytes
header size             80 bytes
strings            2069269 bytes
nodes              7422704 bytes
Load module index
Found container virtualization none.
timestamp of '/etc/systemd/network' changed
Skipping overridden file '/usr/lib/systemd/network/99-default.link'.
Parsed configuration file /etc/systemd/network/99-default.link
Created link configuration context.
ID_NET_DRIVER=i40e
eth0: No matching link configuration found.
Builtin command 'net_setup_link' fails: No such file or directory
Unload module index
Unloaded link configuration context.

root@foobar:~# udevadm test-builtin net_id /sys/class/net/eth0 2>/dev/null
ID_NET_NAMING_SCHEME=v240
ID_NET_NAME_MAC=enx48df37944ab0
ID_OUI_FROM_DATABASE=Hewlett Packard Enterprise
ID_NET_NAME_ONBOARD=eno5
ID_NET_NAME_PATH=enp93s0f0

Most interesting hint right now seems to be that /sys/class/net/eth0/name_assign_type is invalid While on sytems before the reboot that breaks it, and after setting the .link file fix, contains a 4.

Since those hosts were intially installed with buster there are no remains on any ethX related configuration present. If someone has an idea what is going on write a mail (sven at stormbind dot net), or blog on planet.d.o.

I found a vaguely similar bug report for a Dell PE server in #929622, though that was a change from 4.9 (stretch) to the 4.19 stretch-bpo kernel and the device names were not changed back to the ethX scheme, and Ben found a reason for it inside the kernel. Also the hardware is different using bnxt_en, while I've tg3 and i40e in use.

Posted Tue Aug 13 15:24:17 2019

I could not find much information on the interwebs how many containers you can run per host. So here are mine and the issues we ran into along the way.

The Beginning

In the beginning there were virtual machines running with 8 vCPUs and 60GB of RAM. They started to serve around 30 containers per VM. Later on we managed to squeeze around 50 containers per VM.

Initial orchestration was done with swarm, later on we moved to nomad. Access was initially fronted by nginx with consul-template generating the config. When it did not scale anymore nginx was replaced by Traefik. Service discovery is managed by consul. Log shipping was initially handled by logspout in a container, later on we switched to filebeat. Log transformation is handled by logstash. All of this is running on Debian GNU/Linux with docker-ce.

At some point it did not make sense anymore to use VMs. We've no state inside the containerized applications anyway. So we decided to move to dedicated hardware for our production setup. We settled with HPe DL360G10 with 24 physical cores and 128GB of RAM.

THP and Defragmentation

When we moved to the dedicated bare metal hosts we were running Debian/stretch + Linux from stretch-backports. At that time Linux 4.17. These machines were sized to run 95+ containers. Once we were above 55 containers we started to see occasional hiccups. First occurences lasted only for 20s, then 2min, and suddenly some lasted for around 20min. Our system metrics, as collected by prometheus-node-exporter, could only provide vague hints. The metric export did work, so processes were executed. But the CPU usage and subsequently the network throughput went down to close to zero.

I've seen similar hiccups in the past with Postgresql running on a host with THP (Transparent Huge Pages) enabled. So a good bet was to look into that area. By default /sys/kernel/mm/transparent_hugepage/enabled is set to always, so THP are enabled. We stick to that, but changed the defrag mode /sys/kernel/mm/transparent_hugepage/defrag (since Linux 4.12) from the default madavise to defer+madvise.

This moves page reclaims and compaction for pages which were not allocated with madvise to the background, which was enough to get rid of those hiccups. See also the upstream documentation. Since there is no sysctl like facility to adjust sysfs values, we're using the sysfsutils package to adjust this setting after every reboot.

Conntrack Table

Since the default docker networking setup involves a shitload of NAT, it shouldn't be surprising that nf_conntrack will start to drop packets at some point. We're currently fine with setting the sysctl tunable

net.netfilter.nf_conntrack_max = 524288

but that's very much up to your network setup and traffic characteristics.

Inotify Watches and Cadvisor

Along the way cadvisor refused to start at one point. Turned out that the default settings (again sysctl tunables) for

fs.inotify.max_user_instances = 128
fs.inotify.max_user_watches = 8192

are too low. We increased to

fs.inotify.max_user_instances = 4096
fs.inotify.max_user_watches = 32768

Ephemeral Ports

We didn't ran into an issue with running out of ephemeral ports directly, but dockerd has a constant issue of keeping track of ports in use and we already see collisions to appear regularly. Very unscientifically we set the sysctl

net.ipv4.ip_local_port_range = 11000 60999

NOFILE limits and Nomad

Initially we restricted nomad (via systemd) with

LimitNOFILE=65536

which apparently is not enough for our setup once we were crossing the 100 container per host limit. Though the error message we saw was hard to understand:

[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=93c6b94b-e122-30ba-7250-1050e0107f4d task=mycontainer error="prestart hook "logmon" failed: Unrecognized remote plugin message:

This was solved by following the official recommendation and setting

LimitNOFILE=infinity
LimitNPROC=infinity
TasksMax=infinity

The main lead here was looking into the "hashicorp/go-plugin" library source, and understanding that they try to read the stdout of some other process, which sounded roughly like someone would have to open at some point a file.

Running out of PIDs

Once we were close to 200 containers per host (test environment with 256GB RAM per host), we started to experience failures of all kinds because processes could no longer be forked. Since that was also true for completely fresh user sessions, it was clear that we're hitting some global limitation and nothing bound to session via a pam module.

It's important to understand that most of our workloads are written in Java, and a lot of the other software we use is written in go. So we've a lot of Threads, which in Linux are presented as "Lightweight Process" (LWP). So every LWP still exists with a distinct PID out of the global PID space.

With /proc/sys/kernel/pid_max defaulting to 32768 we actually ran out of PIDs. We increased that limit vastly, probably way beyond what we currently need, to 500000. Actuall limit on 64bit systems is 222 according to man 5 proc.

Posted Fri Aug 2 16:44:25 2019

While you can find a lot of information regarding the Java "Project Jigsaw", I could not really find a good example on "assembling" your own JVM. So I took a few minutes to figure that out. My usecase here is that someone would like to use Instana (non free tracing solution) which requires the java.instrument and jdk.attach module to be available. From an operations perspektive we do not want to ship the whole JDK in our production Docker Images, so we've to ship a modified JVM. Currently we base our images on the builds provided by AdoptOpenJDK.net, so my examples are based on those builds. You can just download and untar them to any directory to follow along.

You can check the available modules of your JVM by runnning:

$ jdk-11.0.3+7-jre/bin/java --list-modules | grep -E '(instrument|attach)'
java.instrument@11.0.3

As you can see only the java.instrument module is available. So let's assemble a custom JVM which includes all the modules provided by the default AdoptOpenJDK.net JRE builds and the missing jdk.attach module:

$ jdk-11.0.3+7/bin/jlink --module-path jdk-11.0.3+7 --add-modules $(jdk-11.0.3+7-jre/bin/java --list-modules|cut -d'@' -f 1|tr '\n' ',')jdk.attach --output myjvm

$ ./myjvm/bin/java --list-modules | grep -E '(instrument|attach)'
java.instrument@11.0.3
jdk.attach@11.0.3

Size wise the increase is, as expected, rather minimal:

$ du -hs myjvm jdk-11.0.3+7-jre jdk-11.0.3+7
141M    myjvm
121M    jdk-11.0.3+7-jre
310M    jdk-11.0.3+7

For the fun of it you could also add the compiler so you can execute source files directly:

$ jdk-11.0.3+7/bin/jlink --module-path jdk-11.0.3+7 --add-modules $(jdk-11.0.3+7-jre/bin/java --list-modules|cut -d'@' -f 1|tr '\n' ',')jdk.compiler --output myjvm2
$ ./myjvm2/bin/java HelloWorld.java
Hello World!
Posted Wed Jul 10 16:31:27 2019

If you've a logstash filter that contains a json filter/decoding step like

filter {
    json {
      source => "log"
    }
}

this, and you end up with an error message like that:

[2019-06-21T09:47:58,243][WARN ][logstash.filters.json    ] Error parsing json {:source=>"log", :raw=>{"file"=>{"path"=>"/var/lib/docker/containers/abdf3db21fca8e1dc17c888d4aa661fe16ae4371355215157cf7c4fc91b8ea4b/abdf3db21fca8e1dc17c888d4aa661fe16ae4371355215157cf7c4fc91b8ea4b-json.log"}}, :exception=>java.lang.ClassCastException}

It might be just telling you that the field log actually does contain valid json, and no decoding is required.

Posted Fri Jun 21 10:08:59 2019

tl;dr If you want to run Kafka 2.x use 2.1.1rc1 or later.

So someone started to update from Kafka 1.1.1 to 2.1.0 yesterday and it kept crashing every other hour. It pretty much looks like https://issues.apache.org/jira/browse/KAFKA-7697, so we're now trying out 2.1.1rc1 because we missed the rc2 at http://home.apache.org/~cmccabe/kafka-2.1.1-rc2/. So ideally you go with rc2 which has a few more fixes for unrelated issues.

Beside of that, be aware that the update to 2.1.0 is a one way street! Read http://kafka.apache.org/21/documentation.html#upgrade carefully. There is a schema change in the consumer offset topic which is internally used to track your consumer offsets since those moved out of Zookeeper some time ago.

For us the primary lesson is that we've to put way more synthetic traffic on our testing environments, because 2.1.0 was running in the non production environments for several days without an issue, and the relevant team hit the deadlock in production within hours.

Posted Tue Feb 12 16:15:04 2019

I guess in the past everyone used CGIs to achieve something similar, it just seemed like a nice detour to use the nginx Lua module instead. Don't expect to read something magic. I'm currently looking into different CDN providers and how they behave regarding cache-control header, and what additional header they sent by default and when you activate certain feature. So I setup two locations inside the nginx configuration using a content_by_lua_block {} for testing purpose.

location /header {
  default_type 'text/plain';
  content_by_lua_block {
   local myheads=ngx.req.get_headers()
   for key in pairs(myheads) do
    local outp="Header '" .. key .. "': " .. myheads[key]
    ngx.say(outp)
  end
 }
}

location /cc {
 default_type 'text/plain';
  content_by_lua_block {
   local cc=ngx.req.get_headers()["cc"]
   if cc ~= nil then
    ngx.header["cache-control"]=cc
    ngx.say(cc)
   else
    ngx.say("moep - no cc header found")
   end
  }
 }

The first one is rather boring, it just returns you the request header my origin server received, like this

$ curl -is https://nocigar.shx0.cf/header
HTTP/2 200 
date: Sun, 02 Dec 2018 13:20:14 GMT
content-type: text/plain
set-cookie: __cfduid=d503ed2d3148923514e3fe86b4e26f5bf1543756814; expires=Mon, 02-Dec-19 13:20:14 GMT; path=/; domain=.shx0.cf; HttpOnly; Secure
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 482e16f7ae1bc2f1-FRA

Header 'x-forwarded-for': 93.131.190.59
Header 'cf-ipcountry': DE
Header 'connection': Keep-Alive
Header 'accept': */*
Header 'accept-encoding': gzip
Header 'host': nocigar.shx0.cf
Header 'x-forwarded-proto': https
Header 'cf-visitor': {"scheme":"https"}
Header 'cf-ray': 482e16f7ae1bc2f1-FRA
Header 'cf-connecting-ip': 93.131.190.59
Header 'user-agent': curl/7.62.0

The second one is more interesting, it copies the content of the "cc" HTTP request header to the "cache-control" response header to allow you convenient evaluation of the handling of different cache-control header settings.

$ curl -H'cc: no-store,no-cache' -is https://nocigar.shx0.cf/cc/foobar42.jpg
HTTP/2 200 
date: Sun, 02 Dec 2018 13:27:46 GMT
content-type: image/jpeg
set-cookie: __cfduid=d971badd257b7c2be831a31d13ccec77f1543757265; expires=Mon, 02-Dec-19 13:27:45 GMT; path=/; domain=.shx0.cf; HttpOnly; Secure
cache-control: no-store,no-cache
cf-cache-status: MISS
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 482e22001f35c26f-FRA

no-store,no-cache

$ curl -H'cc: public' -is https://nocigar.shx0.cf/cc/foobar42.jpg
HTTP/2 200 
date: Sun, 02 Dec 2018 13:28:18 GMT
content-type: image/jpeg
set-cookie: __cfduid=d48a4b571af6374c759c430c91c3223d71543757298; expires=Mon, 02-Dec-19 13:28:18 GMT; path=/; domain=.shx0.cf; HttpOnly; Secure
cache-control: public, max-age=14400
cf-cache-status: MISS
expires: Sun, 02 Dec 2018 17:28:18 GMT
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 482e22c8886627aa-FRA

public

$ curl -H'cc: no-cache,no-store' -is https://nocigar.shx0.cf/cc/foobar42.jpg
HTTP/2 200 
date: Sun, 02 Dec 2018 13:30:33 GMT
content-type: image/jpeg
set-cookie: __cfduid=dbc4758b7bb98d556173a89aa2a8c2d3a1543757433; expires=Mon, 02-Dec-19 13:30:33 GMT; path=/; domain=.shx0.cf; HttpOnly; Secure
cache-control: public, max-age=14400
cf-cache-status: HIT
expires: Sun, 02 Dec 2018 17:30:33 GMT
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
cf-ray: 482e26185d36c29c-FRA

public

As you can see this endpoint is currently fronted by Cloudflare using a default configuration. If you burned one request path below "/cc/" and it's now cached for a long time you can just use a random different one to continue your test, without any requirement to flush the CDN caches.

Posted Sun Dec 2 14:40:07 2018
# docker version|grep Version
Version:      18.03.1-ce
Version:      18.03.1-ce

# cat Dockerfile
FROM alpine
RUN addgroup service && adduser -S service -G service
COPY --chown=root:root debug.sh /opt/debug.sh
RUN chmod 544 /opt/debug.sh
USER service
ENTRYPOINT ["/opt/debug.sh"]

# cat debug.sh
#!/bin/sh
ls -l /opt/debug.sh
whoami

# docker build -t foobar:latest .; docker run foobar
Sending build context to Docker daemon   5.12kB
[...]
Sucessfully built 41c8b99a6371
Successfully tagged foobar:latest
-r-xr--r--    1 root     root            37 Nov 14 22:42 /opt/debug.sh
service

# docker version|grep Version
Version:           18.09.0
Version:          18.09.0

# docker run foobar
standard_init_linux.go:190: exec user process caused "permission denied"

That changed with 18.06 and just uncovered some issues. I was, well let's say "surprised", that this ever worked at all. Other sets of perms like 0700 or 644 already failed with different error message on docker 18.03.1.

Posted Wed Nov 14 23:53:28 2018

tl;dr; Yes, you can use Firefox 60 in Debian/stretch with your U2F device to authenticate your Google account, but you've to use Chrome for the registration.

Thanks to Mike, Moritz and probably others there's now Firefox 60 ESR in Debian/stretch. So I took it as a chance to finally activate my for work YubiKey Nano as a U2F/2FA device for my at work Google account. Turns out it's not so simple. Basically Google told me that this browser is not support and I should install the trojan horse (Chrome) to use this feature. So I gave in, installed Chrome, logged in to my Google account and added the Yubikey as the default 2FA device. Then I quit Chrome, went back to Firefox and logged in again to my Google account. Bäm it works! The Yubikey blinks, I can touch it and I'm logged in.

Just in case: you probably want to install "u2f-host" to have "libu2f-host0" available which ships all the udev rules to detect common U2F devices correctly.

Posted Mon Sep 10 14:13:06 2018

We're currently migrating from our on premise HipChat instance to Google Chat (basically a nicer UI for Hangouts). Since our deployments are orchestrated by ansible playbooks we'd like to write out to changelog chat rooms whenever a deployment starts and finishes (either with a success or a failure message), I had to figure out how to write to those Google Chat rooms/conversations via the simple Webhook API.

First of all I learned a few more things about ansible.

  1. The "role_path" variable is no longer available, but "playbook_dir" works.
  2. The lookup() template module has strange lookup path to try by default:

    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"

I'm still wondering why they try everything except the templates directory within the calling role "/home/sven/deploy/roles/googlechat-notify/templates/card.json.j2".

I ended up with the following handler:

- name: notify google chat changelog channel
  uri:
    url: "{{ googlechat_room }}"
    body: "{{ lookup('template', playbook_dir + '/roles/googlechat-notify/templates/card.json.j2') }}"
    body_format: "json"
    method: "POST"
  ignore_errors: yes
  register: googlechat_result
  when: not (disable_googlechat_notification | default(false))

- set_fact:
    googlechat_conversation: "{{googlechat_result.json.thread.name}}"
  ignore_errors: yes

Sending the following json template:

{
{% if googlechat_conversation is defined %}
  "thread": {
      "name": "{{ googlechat_conversation }}"
  },
{% endif %}
  "cards": [
    {
      "sections": [
        {
          "widgets": [
            {
              "textParagraph": {
              {% if googlechat_msg is defined and googlechat_status is defined and googlechat_status == "SUCCESS" %}
                "text": "<b><font color=\"#0BCA14\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "FAILED" %}
                "text": "<b><font color=\"#E41B2B\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "START" %}
                "text": "<b><font color=\"#3A3CC4\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% else %}
                "text": "<b><font color=\"#F66905\">UNKOWN status</font></b> - {{ googlechat_msg }}"
              {% endif %}
              }
            }
          ]
        }
      ]
    }
  ]
}

The message card documentation is here, in case you would like to lookup the details. The advantage of cards compared to the simple text messages is that you can colorize the output. So it's visually distinguishable if you're dealing with a success or a failure.

The calling code in the playbook looks like this:

- hosts: deploy-hosts
  tasks:
  -name: deploy
    block:
      - include: roles/googlechat-notify/handlers/main.yml
        vars:
          googlechat_status: "START"
          googlechat_msg: "foo"
        run_once: true
        [... do the actual deployment ...]
      - meta: flush_handlers
      - include: roles/googlechat-notify/handlers/main.yml
        vars:
          googlechat_status: "SUCCESS"
          googlechat_msg: "bar"
        run_once: true
      - meta: flush_handlers
    rescue:
       - include: roles/googlechat-notify/handlers/main.yml
         vars:
          googlechat_status: "FAILED"
          googlechat_msg: "baz"
         run_once: true
Posted Fri Aug 31 10:43:21 2018