RSS Atom Add a new post titled:

tl;dr If you want to run Kafka 2.x use 2.1.1rc1 or later.

So someone started to update from Kafka 1.1.1 to 2.1.0 yesterday and it kept crashing every other hour. It pretty much looks like, so we're now trying out 2.1.1rc1 because we missed the rc2 at So ideally you go with rc2 which has a few more fixes for unrelated issues.

Beside of that, be aware that the update to 2.1.0 is a one way street! Read carefully. There is a schema change in the consumer offset topic which is internally used to track your consumer offsets since those moved out of Zookeeper some time ago.

For us the primary lesson is that we've to put way more synthetic traffic on our testing environments, because 2.1.0 was running in the non production environments for several days without an issue, and the relevant team hit the deadlock in production within hours.

Posted Tue Feb 12 16:15:04 2019

I guess in the past everyone used CGIs to achieve something similar, it just seemed like a nice detour to use the nginx Lua module instead. Don't expect to read something magic. I'm currently looking into different CDN providers and how they behave regarding cache-control header, and what additional header they sent by default and when you activate certain feature. So I setup two locations inside the nginx configuration using a content_by_lua_block {} for testing purpose.

location /header {
  default_type 'text/plain';
  content_by_lua_block {
   local myheads=ngx.req.get_headers()
   for key in pairs(myheads) do
    local outp="Header '" .. key .. "': " .. myheads[key]

location /cc {
 default_type 'text/plain';
  content_by_lua_block {
   local cc=ngx.req.get_headers()["cc"]
   if cc ~= nil then
    ngx.say("moep - no cc header found")

The first one is rather boring, it just returns you the request header my origin server received, like this

$ curl -is
HTTP/2 200 
date: Sun, 02 Dec 2018 13:20:14 GMT
content-type: text/plain
set-cookie: __cfduid=d503ed2d3148923514e3fe86b4e26f5bf1543756814; expires=Mon, 02-Dec-19 13:20:14 GMT; path=/;; HttpOnly; Secure
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri=""
server: cloudflare
cf-ray: 482e16f7ae1bc2f1-FRA

Header 'x-forwarded-for':
Header 'cf-ipcountry': DE
Header 'connection': Keep-Alive
Header 'accept': */*
Header 'accept-encoding': gzip
Header 'host':
Header 'x-forwarded-proto': https
Header 'cf-visitor': {"scheme":"https"}
Header 'cf-ray': 482e16f7ae1bc2f1-FRA
Header 'cf-connecting-ip':
Header 'user-agent': curl/7.62.0

The second one is more interesting, it copies the content of the "cc" HTTP request header to the "cache-control" response header to allow you convenient evaluation of the handling of different cache-control header settings.

$ curl -H'cc: no-store,no-cache' -is
HTTP/2 200 
date: Sun, 02 Dec 2018 13:27:46 GMT
content-type: image/jpeg
set-cookie: __cfduid=d971badd257b7c2be831a31d13ccec77f1543757265; expires=Mon, 02-Dec-19 13:27:45 GMT; path=/;; HttpOnly; Secure
cache-control: no-store,no-cache
cf-cache-status: MISS
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri=""
server: cloudflare
cf-ray: 482e22001f35c26f-FRA


$ curl -H'cc: public' -is
HTTP/2 200 
date: Sun, 02 Dec 2018 13:28:18 GMT
content-type: image/jpeg
set-cookie: __cfduid=d48a4b571af6374c759c430c91c3223d71543757298; expires=Mon, 02-Dec-19 13:28:18 GMT; path=/;; HttpOnly; Secure
cache-control: public, max-age=14400
cf-cache-status: MISS
expires: Sun, 02 Dec 2018 17:28:18 GMT
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri=""
server: cloudflare
cf-ray: 482e22c8886627aa-FRA


$ curl -H'cc: no-cache,no-store' -is
HTTP/2 200 
date: Sun, 02 Dec 2018 13:30:33 GMT
content-type: image/jpeg
set-cookie: __cfduid=dbc4758b7bb98d556173a89aa2a8c2d3a1543757433; expires=Mon, 02-Dec-19 13:30:33 GMT; path=/;; HttpOnly; Secure
cache-control: public, max-age=14400
cf-cache-status: HIT
expires: Sun, 02 Dec 2018 17:30:33 GMT
strict-transport-security: max-age=2592000
expect-ct: max-age=604800, report-uri=""
server: cloudflare
cf-ray: 482e26185d36c29c-FRA


As you can see this endpoint is currently fronted by Cloudflare using a default configuration. If you burned one request path below "/cc/" and it's now cached for a long time you can just use a random different one to continue your test, without any requirement to flush the CDN caches.

Posted Sun Dec 2 14:40:07 2018
# docker version|grep Version
Version:      18.03.1-ce
Version:      18.03.1-ce

# cat Dockerfile
FROM alpine
RUN addgroup service && adduser -S service -G service
COPY --chown=root:root /opt/
RUN chmod 544 /opt/
USER service
ENTRYPOINT ["/opt/"]

# cat
ls -l /opt/

# docker build -t foobar:latest .; docker run foobar
Sending build context to Docker daemon   5.12kB
Sucessfully built 41c8b99a6371
Successfully tagged foobar:latest
-r-xr--r--    1 root     root            37 Nov 14 22:42 /opt/

# docker version|grep Version
Version:           18.09.0
Version:          18.09.0

# docker run foobar
standard_init_linux.go:190: exec user process caused "permission denied"

That changed with 18.06 and just uncovered some issues. I was, well let's say "surprised", that this ever worked at all. Other sets of perms like 0700 or 644 already failed with different error message on docker 18.03.1.

Posted Wed Nov 14 23:53:28 2018

tl;dr; Yes, you can use Firefox 60 in Debian/stretch with your U2F device to authenticate your Google account, but you've to use Chrome for the registration.

Thanks to Mike, Moritz and probably others there's now Firefox 60 ESR in Debian/stretch. So I took it as a chance to finally activate my for work YubiKey Nano as a U2F/2FA device for my at work Google account. Turns out it's not so simple. Basically Google told me that this browser is not support and I should install the trojan horse (Chrome) to use this feature. So I gave in, installed Chrome, logged in to my Google account and added the Yubikey as the default 2FA device. Then I quit Chrome, went back to Firefox and logged in again to my Google account. Bäm it works! The Yubikey blinks, I can touch it and I'm logged in.

Just in case: you probably want to install "u2f-host" to have "libu2f-host0" available which ships all the udev rules to detect common U2F devices correctly.

Posted Mon Sep 10 14:13:06 2018

We're currently migrating from our on premise HipChat instance to Google Chat (basically a nicer UI for Hangouts). Since our deployments are orchestrated by ansible playbooks we'd like to write out to changelog chat rooms whenever a deployment starts and finishes (either with a success or a failure message), I had to figure out how to write to those Google Chat rooms/conversations via the simple Webhook API.

First of all I learned a few more things about ansible.

  1. The "role_path" variable is no longer available, but "playbook_dir" works.
  2. The lookup() template module has strange lookup path to try by default:

    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"

I'm still wondering why they try everything except the templates directory within the calling role "/home/sven/deploy/roles/googlechat-notify/templates/card.json.j2".

I ended up with the following handler:

- name: notify google chat changelog channel
    url: "{{ googlechat_room }}"
    body: "{{ lookup('template', playbook_dir + '/roles/googlechat-notify/templates/card.json.j2') }}"
    body_format: "json"
    method: "POST"
  ignore_errors: yes
  register: googlechat_result
  when: not (disable_googlechat_notification | default(false))

- set_fact:
    googlechat_conversation: "{{}}"
  ignore_errors: yes

Sending the following json template:

{% if googlechat_conversation is defined %}
  "thread": {
      "name": "{{ googlechat_conversation }}"
{% endif %}
  "cards": [
      "sections": [
          "widgets": [
              "textParagraph": {
              {% if googlechat_msg is defined and googlechat_status is defined and googlechat_status == "SUCCESS" %}
                "text": "<b><font color=\"#0BCA14\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "FAILED" %}
                "text": "<b><font color=\"#E41B2B\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "START" %}
                "text": "<b><font color=\"#3A3CC4\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% else %}
                "text": "<b><font color=\"#F66905\">UNKOWN status</font></b> - {{ googlechat_msg }}"
              {% endif %}

The message card documentation is here, in case you would like to lookup the details. The advantage of cards compared to the simple text messages is that you can colorize the output. So it's visually distinguishable if you're dealing with a success or a failure.

The calling code in the playbook looks like this:

- hosts: deploy-hosts
  -name: deploy
      - include: roles/googlechat-notify/handlers/main.yml
          googlechat_status: "START"
          googlechat_msg: "foo"
        run_once: true
        [... do the actual deployment ...]
      - meta: flush_handlers
      - include: roles/googlechat-notify/handlers/main.yml
          googlechat_status: "SUCCESS"
          googlechat_msg: "bar"
        run_once: true
      - meta: flush_handlers
       - include: roles/googlechat-notify/handlers/main.yml
          googlechat_status: "FAILED"
          googlechat_msg: "baz"
         run_once: true
Posted Fri Aug 31 10:43:21 2018

I gave my first public talk on Saturday at FrOSCon 13. In case you're interested in how we maintain Docker base images (based on Debian-slim) at REWE Digital, the video is already online (German). The slides are also available and a tarball, containing the slides and all files, so you do not have to copy the snippets from the slides. The relevant tool, container-diff, is provided by Google on GitHub. In case you're interested in our migration to microservices, you can find the referenced talk given by Paul Puschmann at OSDC 2018 on Youtube (English). If you've any question regarding the talk don't hesitate to write me a mail, details on how to reach out are here.

If you're interested in the topic I highly recommend to also watch the talk given by Chris Jantz Unboxing and Building Container Images (English). Chris is not only talking about what a container image contains, but also about the rather new Google tool Kaniko, which can build container based on Dockerfiles without root permission and without dockerd.

Beside of that two of my colleagues gave a talk about Kafka from a developer perspective Apache Kafka: Lessons learned (German). Judging from the feedback it was well received.

All in all it was a great experience and a huge thank you to all the volunteers keeping this event alive, especially to those who helped to set me up for the talk. You're awesome!

Posted Mon Aug 27 14:47:47 2018

I've just uploaded iptables 1.6.2 to stretch-backports (thanks Arturo for the swift ACK). The relevant new feature here is the --random-fully support for the MASQUERADE target. This release could be relevant to you if you've to deal with a rather large amount of NATed outbound connections, which is likely if you've to deal with the whale. The engineering team at Xing published a great writeup about this issue in February. So the lesson to learn here is that the nf_conntrack layer propably got a bit more robust during the Bittorrent heydays, but NAT is still evil shit we should get rid of.

Posted Sun Aug 12 14:45:00 2018

At work we're running nginx in several instances. Sometimes running on Debian/stretch (Woooh) and sometimes on Debian/jessie (Boooo). To improve our request tracking abilities we set out to add a header with a UUID version 4 if it does not exist yet. We expected this to be a story we could implemented in a few hours at most ...

/proc/sys/kernel/random/uuid vs lua uuid module

If you start to look around on how to implement it you might find out that there is a lua module to generate a UUID. Since this module is not packaged in Debian we started to think about packaging it, but on a second thought we wondered if simply reading from the Linux /proc interface isn't faster after all? So we build a very unscientific test case that we deemed good enough:

$ cat uuid_by_kernel.lua
#!/usr/bin/env lua5.1
local i = 0
  local f = assert("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  i = i + 1
until i == 1000

$ cat uuid_by_lua.lua
#!/usr/bin/env lua5.1
package.path = package.path .. ";/home/sven/uuid.lua"
local i = 0
  local uuid = require("uuid")
  local content = uuid()
  i = i + 1
until i == 1000

The result is in favour of using the Linux /proc interface:

$ time ./uuid_by_kernel.lua
real    0m0.013s
user    0m0.012s
sys 0m0.000s

$ time ./uuid_by_lua.lua
real    0m0.021s
user    0m0.016s
sys 0m0.004s

nginx in Debian/stretch vs nginx in Debian/jessie

Now that we had settled on the lua code

if (ngx.var.http_correlation_id == nil or ngx.var.http_correlation_id == "") then
  local f = assert("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
    return content:sub(1, -2)
    return ngx.var.http_correlation_id

and the nginx configuration

set_by_lua_file $ngx.var.http_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;

we started to roll this one out to our mixed setup of Debian/stretch and Debian/jessie hosts. While we tested this one on Debian/stretch, and it all worked fine, we never gave it a try on Debian/jessie. Within seconds of the rollout all our nginx instances on Debian/jessie started to segfault.

Half an hour later it was clear that the nginx release shipped in Debian/jessie does not yet allow you to write directly into the internal variable $ngx.var.http_correlation_id. To workaround this issue we configured nginx like this to use the add_header configuration option to create the header.

set_by_lua_file $header_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
add_header correlation_id $header_correlation_id;

This configuration works on Debian/stretch and Debian/jessie.

Another possibility we considered was using the backported version of nginx. But this one depends on a newer openssl release. I didn't want to walk down the road of manually tracking potential openssl bugs against a release not supported by the official security team. So we rejected this option. Next item on the todo list is for sure the migration to Debian/stretch, which is overdue now anyway.

and it just stopped

A few hours later we found that the nginx running on Debian/stretch was still running, but no longer responding. Attaching strace revealed that all processes (worker and master) were waiting on a futex() call. Logs showed an assert pointing in the direction of the nchan module. I think the bug we're seeing is #446, I've added the few bits of additional information I could gather. We just moved on and disabled the module on our systems. Now it's running fine in all cases for a few weeks.

Kudos to Martin for walking down this muddy road together on a Friday.

Posted Sat Jun 23 18:08:51 2018

Sounds crazy and nobody would ever do that, but just for a moment imagine you no longer own your infrastructure.

Imagine you just run your container on something like GKE with Kubernetes.

Imagine you build your software with something like Jenkins running in a container, using the GKE provided docker interface to build stuff in another container.

And for a $reason imagine you're not using the Google provided container registry, but your own one hosted somewhere else on the internet.

Of course you access your registry via HTTPS, so your connection is secured at the transport level.

Now imagine your certificate is at the end of its validity period. Like ending the next day.

Imagine you just do what you do every time that happens, and you just order a new certificate from one of the left over CAs like DigiCert.

You receive your certificate within 15 minutes.

You deploy it to your registry.

You validate that your certificate chain validates against different certificate stores.

The one shipped in ca-certificates on various Debian releases you run.

The one in your browser.

Maybe you even test it with Google Chrome.

Everything is cool and validates. I mean, of course it does. DigiCert is a known CA player and the root CA certificate was created five years ago. A lot of time for a CA to be included and shipped in many places.

But still there is one issue. The docker commands you run in your build jobs fail to pull images from your registry because the certificate can not be validated.

You take a look at the underlying OS and indeed it's not shipping the 5 year old root CA certificate that issued your intermediate CA that just issued your new server certificate.

If it were your own infrastructure you would now just ship the missing certificate.

Maybe by including it in your internal ca-certificates build.

Or by just deploying it with ansible to /usr/share/ca-certificates/myfoo/ and adding that to the configuration in /etc/ca-certificates.conf so update-ca-certificates can create the relevant hash links for you in /etc/ssl/certs/.

But this time it's not your infrastructure and you can not modify the operating system context your docker container are running in.

Sounds insane, right? Luckily we're just making up a crazy story and something like that would never happen in the real world, because we all insist on owning our infrastructure.

Posted Fri Jun 15 21:04:44 2018

A small followup regarding the replacement of hp-health and hpssacli. Turns out a few things have to be replaced, lucky all you already running on someone else computer where you do not have to take care of the hardware.


According to the super nice and helpful Craig L. at HPE they're planing an update for the MCP ssacli for Ubuntu 18.04. This one will also support the SmartArray firmware 1.34. If you need it now you should be able to use the one released for RHEL and SLES. I did not test it.

replacing hp-health

The master plan is to query the iLO. Basically there are two ways. Either locally via hponcfg or remotely via a Perl script sample provided by HPE along with many helpful RIBCL XML file examples. Both approaches are not cool because you've to deal with a lot of XML, so opt for a 3rd way and use the awesome python-hpilo module (part of Debian/stretch) which abstracts all the RIBCL XML stuff nicely away from you.

If you'd like to have a taste of it, I had to reset a few ilo passwords to something sane, without quotes, double quotes and backticks, and did it like this:

pwfile="ilo-pwlist-$(date +%s)"

for x in $(seq -w 004 006); do
  pw=$(pwgen -n 24 1)
  echo "${host},${pw}" >> $pwfile
  ssh $host "echo \"<RIBCL VERSION=\\\"2.0\\\"><LOGIN USER_LOGIN=\\\"adminname\\\" PASSWORD=\\\"password\\\"><USER_INFO MODE=\\\"write\\\"><MOD_USER USER_LOGIN=\\\"Administrator\\\"><PASSWORD value=\\\"$pw\\\"/></MOD_USER></USER_INFO></LOGIN></RIBCL>\" > /tmp/setpw.xml"
  ssh $host "sudo hponcfg -f /tmp/setpw.xml && rm /tmp/setpw.xml"

After I regained access to all iLO devices I used the hpilo_cli helper to add a monitoring user:

while read -r line; do
  host=$(echo $line|cut -d',' -f 1)
  pw=$(echo $line|cut -d',' -f 2)
  hpilo_cli -l Administrator -p $pw $host add_user user_login="monitoring" user_name="monitoring" password="secret" admin_priv=False remote_cons_priv=False reset_server_priv=False virtual_media_priv=False config_ilo_priv=False
done < ${1}

The helper script to actually query the iLO interfaces from our monitoring is, in comparison to those ad-hoc shell hacks, rather nice:

import hpilo, argparse

parser = argparse.ArgumentParser()
parser.add_argument("component", help="HW component to query", choices=['battery', 'bios_hardware', 'fans', 'memory', 'network', 'power_supplies', 'processor', 'storage', 'temperature'])
parser.add_argument("host", help="iLO Hostname or IP address to connect to")
args = parser.parse_args()

def askIloHealth(component, host, user, password):
    ilo = hpilo.Ilo(host, user, password)
    health = ilo.get_embedded_health()

askIloHealth(args.component,, iloUser, iloPassword)

You can also take a look at a more detailed state if you pprint the complete stuff returned by "get_embedded_health". This whole approach of using the iLO should work since iLO 3. I tested version 4 and 5.

Posted Fri May 11 17:40:08 2018