RSS Atom Add a new post titled:

tl;dr; Yes, you can use Firefox 60 in Debian/stretch with your U2F device to authenticate your Google account, but you've to use Chrome for the registration.

Thanks to Mike, Moritz and probably others there's now Firefox 60 ESR in Debian/stretch. So I took it as a chance to finally activate my for work YubiKey Nano as a U2F/2FA device for my at work Google account. Turns out it's not so simple. Basically Google told me that this browser is not support and I should install the trojan horse (Chrome) to use this feature. So I gave in, installed Chrome, logged in to my Google account and added the Yubikey as the default 2FA device. Then I quit Chrome, went back to Firefox and logged in again to my Google account. Bäm it works! The Yubikey blinks, I can touch it and I'm logged in.

Just in case: you probably want to install "u2f-host" to have "libu2f-host0" available which ships all the udev rules to detect common U2F devices correctly.

Posted Mon Sep 10 14:13:06 2018

We're currently migrating from our on premise HipChat instance to Google Chat (basically a nicer UI for Hangouts). Since our deployments are orchestrated by ansible playbooks we'd like to write out to changelog chat rooms whenever a deployment starts and finishes (either with a success or a failure message), I had to figure out how to write to those Google Chat rooms/conversations via the simple Webhook API.

First of all I learned a few more things about ansible.

  1. The "role_path" variable is no longer available, but "playbook_dir" works.
  2. The lookup() template module has strange lookup path to try by default:

    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/roles/googlechat-notify/handlers/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/templates/card.json.j2"
    looking for "card.json.j2" at "/home/sven/deploy/card.json.j2"

I'm still wondering why they try everything except the templates directory within the calling role "/home/sven/deploy/roles/googlechat-notify/templates/card.json.j2".

I ended up with the following handler:

- name: notify google chat changelog channel
  uri:
    url: "{{ googlechat_room }}"
    body: "{{ lookup('template', playbook_dir + '/roles/googlechat-notify/templates/card.json.j2') }}"
    body_format: "json"
    method: "POST"
  ignore_errors: yes
  register: googlechat_result
  when: not (disable_googlechat_notification | default(false))

- set_fact:
    googlechat_conversation: "{{googlechat_result.json.thread.name}}"
  ignore_errors: yes

Sending the following json template:

{
{% if googlechat_conversation is defined %}
  "thread": {
      "name": "{{ googlechat_conversation }}"
  },
{% endif %}
  "cards": [
    {
      "sections": [
        {
          "widgets": [
            {
              "textParagraph": {
              {% if googlechat_msg is defined and googlechat_status is defined and googlechat_status == "SUCCESS" %}
                "text": "<b><font color=\"#0BCA14\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "FAILED" %}
                "text": "<b><font color=\"#E41B2B\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% elif googlechat_msg is defined and googlechat_status is defined and googlechat_status == "START" %}
                "text": "<b><font color=\"#3A3CC4\">{{ googlechat_status }}</font></b> - {{ googlechat_msg }}"
              {% else %}
                "text": "<b><font color=\"#F66905\">UNKOWN status</font></b> - {{ googlechat_msg }}"
              {% endif %}
              }
            }
          ]
        }
      ]
    }
  ]
}

The message card documentation is here, in case you would like to lookup the details. The advantage of cards compared to the simple text messages is that you can colorize the output. So it's visually distinguishable if you're dealing with a success or a failure.

The calling code in the playbook looks like this:

- hosts: deploy-hosts
  tasks:
  -name: deploy
    block:
      - include: roles/googlechat-notify/handlers/main.yml
        vars:
          googlechat_status: "START"
          googlechat_msg: "foo"
        run_once: true
        [... do the actual deployment ...]
      - meta: flush_handlers
      - include: roles/googlechat-notify/handlers/main.yml
        vars:
          googlechat_status: "SUCCESS"
          googlechat_msg: "bar"
        run_once: true
      - meta: flush_handlers
    rescue:
       - include: roles/googlechat-notify/handlers/main.yml
         vars:
          googlechat_status: "FAILED"
          googlechat_msg: "baz"
         run_once: true
Posted Fri Aug 31 10:43:21 2018

I gave my first public talk on Saturday at FrOSCon 13. In case you're interested in how we maintain Docker base images (based on Debian-slim) at REWE Digital, the video is already online (German). The slides are also available and a tarball, containing the slides and all files, so you do not have to copy the snippets from the slides. The relevant tool, container-diff, is provided by Google on GitHub. In case you're interested in our migration to microservices, you can find the referenced talk given by Paul Puschmann at OSDC 2018 on Youtube (English). If you've any question regarding the talk don't hesitate to write me a mail, details on how to reach out are here.

If you're interested in the topic I highly recommend to also watch the talk given by Chris Jantz Unboxing and Building Container Images (English). Chris is not only talking about what a container image contains, but also about the rather new Google tool Kaniko, which can build container based on Dockerfiles without root permission and without dockerd.

Beside of that two of my colleagues gave a talk about Kafka from a developer perspective Apache Kafka: Lessons learned (German). Judging from the feedback it was well received.

All in all it was a great experience and a huge thank you to all the volunteers keeping this event alive, especially to those who helped to set me up for the talk. You're awesome!

Posted Mon Aug 27 14:47:47 2018

I've just uploaded iptables 1.6.2 to stretch-backports (thanks Arturo for the swift ACK). The relevant new feature here is the --random-fully support for the MASQUERADE target. This release could be relevant to you if you've to deal with a rather large amount of NATed outbound connections, which is likely if you've to deal with the whale. The engineering team at Xing published a great writeup about this issue in February. So the lesson to learn here is that the nf_conntrack layer propably got a bit more robust during the Bittorrent heydays, but NAT is still evil shit we should get rid of.

Posted Sun Aug 12 14:45:00 2018

At work we're running nginx in several instances. Sometimes running on Debian/stretch (Woooh) and sometimes on Debian/jessie (Boooo). To improve our request tracking abilities we set out to add a header with a UUID version 4 if it does not exist yet. We expected this to be a story we could implemented in a few hours at most ...

/proc/sys/kernel/random/uuid vs lua uuid module

If you start to look around on how to implement it you might find out that there is a lua module to generate a UUID. Since this module is not packaged in Debian we started to think about packaging it, but on a second thought we wondered if simply reading from the Linux /proc interface isn't faster after all? So we build a very unscientific test case that we deemed good enough:

$ cat uuid_by_kernel.lua
#!/usr/bin/env lua5.1
local i = 0
repeat
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
  i = i + 1
until i == 1000


$ cat uuid_by_lua.lua
#!/usr/bin/env lua5.1
package.path = package.path .. ";/home/sven/uuid.lua"
local i = 0
repeat
  local uuid = require("uuid")
  local content = uuid()
  i = i + 1
until i == 1000

The result is in favour of using the Linux /proc interface:

$ time ./uuid_by_kernel.lua
real    0m0.013s
user    0m0.012s
sys 0m0.000s

$ time ./uuid_by_lua.lua
real    0m0.021s
user    0m0.016s
sys 0m0.004s

nginx in Debian/stretch vs nginx in Debian/jessie

Now that we had settled on the lua code

if (ngx.var.http_correlation_id == nil or ngx.var.http_correlation_id == "") then
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
    return content:sub(1, -2)
  else
    return ngx.var.http_correlation_id
end

and the nginx configuration

set_by_lua_file $ngx.var.http_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;

we started to roll this one out to our mixed setup of Debian/stretch and Debian/jessie hosts. While we tested this one on Debian/stretch, and it all worked fine, we never gave it a try on Debian/jessie. Within seconds of the rollout all our nginx instances on Debian/jessie started to segfault.

Half an hour later it was clear that the nginx release shipped in Debian/jessie does not yet allow you to write directly into the internal variable $ngx.var.http_correlation_id. To workaround this issue we configured nginx like this to use the add_header configuration option to create the header.

set_by_lua_file $header_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
add_header correlation_id $header_correlation_id;

This configuration works on Debian/stretch and Debian/jessie.

Another possibility we considered was using the backported version of nginx. But this one depends on a newer openssl release. I didn't want to walk down the road of manually tracking potential openssl bugs against a release not supported by the official security team. So we rejected this option. Next item on the todo list is for sure the migration to Debian/stretch, which is overdue now anyway.

and it just stopped

A few hours later we found that the nginx running on Debian/stretch was still running, but no longer responding. Attaching strace revealed that all processes (worker and master) were waiting on a futex() call. Logs showed an assert pointing in the direction of the nchan module. I think the bug we're seeing is #446, I've added the few bits of additional information I could gather. We just moved on and disabled the module on our systems. Now it's running fine in all cases for a few weeks.

Kudos to Martin for walking down this muddy road together on a Friday.

Posted Sat Jun 23 18:08:51 2018

Sounds crazy and nobody would ever do that, but just for a moment imagine you no longer own your infrastructure.

Imagine you just run your container on something like GKE with Kubernetes.

Imagine you build your software with something like Jenkins running in a container, using the GKE provided docker interface to build stuff in another container.

And for a $reason imagine you're not using the Google provided container registry, but your own one hosted somewhere else on the internet.

Of course you access your registry via HTTPS, so your connection is secured at the transport level.

Now imagine your certificate is at the end of its validity period. Like ending the next day.

Imagine you just do what you do every time that happens, and you just order a new certificate from one of the left over CAs like DigiCert.

You receive your certificate within 15 minutes.

You deploy it to your registry.

You validate that your certificate chain validates against different certificate stores.

The one shipped in ca-certificates on various Debian releases you run.

The one in your browser.

Maybe you even test it with Google Chrome.

Everything is cool and validates. I mean, of course it does. DigiCert is a known CA player and the root CA certificate was created five years ago. A lot of time for a CA to be included and shipped in many places.

But still there is one issue. The docker commands you run in your build jobs fail to pull images from your registry because the certificate can not be validated.

You take a look at the underlying OS and indeed it's not shipping the 5 year old root CA certificate that issued your intermediate CA that just issued your new server certificate.

If it were your own infrastructure you would now just ship the missing certificate.

Maybe by including it in your internal ca-certificates build.

Or by just deploying it with ansible to /usr/share/ca-certificates/myfoo/ and adding that to the configuration in /etc/ca-certificates.conf so update-ca-certificates can create the relevant hash links for you in /etc/ssl/certs/.

But this time it's not your infrastructure and you can not modify the operating system context your docker container are running in.

Sounds insane, right? Luckily we're just making up a crazy story and something like that would never happen in the real world, because we all insist on owning our infrastructure.

Posted Fri Jun 15 21:04:44 2018

A small followup regarding the replacement of hp-health and hpssacli. Turns out a few things have to be replaced, lucky all you already running on someone else computer where you do not have to take care of the hardware.

ssacli

According to the super nice and helpful Craig L. at HPE they're planing an update for the MCP ssacli for Ubuntu 18.04. This one will also support the SmartArray firmware 1.34. If you need it now you should be able to use the one released for RHEL and SLES. I did not test it.

replacing hp-health

The master plan is to query the iLO. Basically there are two ways. Either locally via hponcfg or remotely via a Perl script sample provided by HPE along with many helpful RIBCL XML file examples. Both approaches are not cool because you've to deal with a lot of XML, so opt for a 3rd way and use the awesome python-hpilo module (part of Debian/stretch) which abstracts all the RIBCL XML stuff nicely away from you.

If you'd like to have a taste of it, I had to reset a few ilo passwords to something sane, without quotes, double quotes and backticks, and did it like this:

#!/bin/bash
pwfile="ilo-pwlist-$(date +%s)"

for x in $(seq -w 004 006); do
  pw=$(pwgen -n 24 1)
  host="host-${x}"
  echo "${host},${pw}" >> $pwfile
  ssh $host "echo \"<RIBCL VERSION=\\\"2.0\\\"><LOGIN USER_LOGIN=\\\"adminname\\\" PASSWORD=\\\"password\\\"><USER_INFO MODE=\\\"write\\\"><MOD_USER USER_LOGIN=\\\"Administrator\\\"><PASSWORD value=\\\"$pw\\\"/></MOD_USER></USER_INFO></LOGIN></RIBCL>\" > /tmp/setpw.xml"
  ssh $host "sudo hponcfg -f /tmp/setpw.xml && rm /tmp/setpw.xml"
done

After I regained access to all iLO devices I used the hpilo_cli helper to add a monitoring user:

#!/bin/bash
while read -r line; do
  host=$(echo $line|cut -d',' -f 1)
  pw=$(echo $line|cut -d',' -f 2)
  hpilo_cli -l Administrator -p $pw $host add_user user_login="monitoring" user_name="monitoring" password="secret" admin_priv=False remote_cons_priv=False reset_server_priv=False virtual_media_priv=False config_ilo_priv=False
done < ${1}

The helper script to actually query the iLO interfaces from our monitoring is, in comparison to those ad-hoc shell hacks, rather nice:

#!/usr/bin/python3
import hpilo, argparse
iloUser="monitoring"
iloPassword="secret"

parser = argparse.ArgumentParser()
parser.add_argument("component", help="HW component to query", choices=['battery', 'bios_hardware', 'fans', 'memory', 'network', 'power_supplies', 'processor', 'storage', 'temperature'])
parser.add_argument("host", help="iLO Hostname or IP address to connect to")
args = parser.parse_args()

def askIloHealth(component, host, user, password):
    ilo = hpilo.Ilo(host, user, password)
    health = ilo.get_embedded_health()
    print(health['health_at_a_glance'][component]['status'])

askIloHealth(args.component, args.host, iloUser, iloPassword)

You can also take a look at a more detailed state if you pprint the complete stuff returned by "get_embedded_health". This whole approach of using the iLO should work since iLO 3. I tested version 4 and 5.

Posted Fri May 11 17:40:08 2018

We received our first HPE gen10 systems, a bunch of DL360, and experienced a few caveats while setting up Debian/stretch.

PXE installation

While our DL120 gen9 announced themself as "X86-64_EFI" (client architecture 00009), the DL360 gen10 use "BC_EFI" (client architecture 00007) in the BOOTP/DHCP protocol option 93. Since we use dnsmasq as DHCP and tftp server we rely on tags like this:

# new style UEFI PXE
dhcp-boot=bootnetx64.efi
# client arch 00009
pxe-service=tag:s1,X86-64_EFI, "Boot UEFI X86-64_EFI", bootnetx64.efi
# client arch 00007
pxe-service=tag:s2,BC_EFI, "Boot UEFI BC_EFI", bootnetx64.efi

dhcp-host=set:s1,AB:CD:37:3A:2E:FG,192.168.1.5,host-001
dhcp-host=set:s2,AB:CD:37:3A:2H:IJ,192.168.1.6,host-002

This is easy to spot with wireshark once you understand what you're looking for.

debian-installer

For some reason, and I heard some rumours that this is a known bug, I had to disable USB support and the SD-card reader in the interface formerly known as BIOS. Otherwise the installer detects the first volume of the P408i raid controller as "/dev/sdb" instead of "/dev/sda".

Network interfaces depend highly on your actual setup. Booting from the additional 10G interfaces worked out of the box, they're detected with reliable names as eno5 and eno6.

HPE MCP

So far we relied on the hp-health and ssacli (formerly hpssacli) packages from the HPE MCP. Currently those tools seem to not support Gen10 systems. I'm currently trying to find out what the alternative is to monitor the health state of the system components. At least for hp-health it's mentioned that only up to Gen9 is support.

That's what I receive from ssacli:

=> ctrl all show status

HPE P408i-a SR Gen10 in Slot 0 (Embedded)

APPLICATION UPGRADE REQUIRED: This controller has been configured with a more
                          recent version of software.
                          To prevent data loss, configuration changes to
                          this controller are not allowed.
                          Please upgrade to the latest version to be able
                          to continue to configure this controller.

That's what I found in our logs from the failed start of hpasmlited:

hpasmlited[31952]: check_ilo2: BMC Returned Error:  ccode  0x0,  Req. Len:  15, Resp. Len:  21

auto configuration

If you're looking into automatic configuration of those systems you'll have to look for Redfish. API documentation for ilo5 can be found at https://hewlettpackard.github.io/ilo-rest-api-docs/ilo5. Since it's not clear how many of those systems we will actually setup, I'm so far a bit reluctant to automate the setup further.

Posted Tue May 8 18:55:32 2018

While the memory leak is fixed in logstash 5.6.9 the logstash-input-udp plugin is broken. A fixed plugin got released as version 3.3.2.

The code change is https://github.com/logstash-plugins/logstash-input-udp/commit/7ecec49a3f1a0f8b51c77bd9243b8cc0dbebaeb8.

The discussion is at https://discuss.elastic.co/t/udp-input-is-crashing/128485.

So instead of fiddling again with plugin updates and offline bundles we decided to just go down the ugly road of abusing ansible, and install a file copy of the udp.rb file. This is horrible but works.

- name: check for br0ken logstash udp input plugin version
  shell: /usr/share/logstash/bin/logstash-plugin list --verbose logstash-input-udp | grep -E '3\.3\.1'
  register: logstash_udp_plugin_check
  ignore_errors: True
  tags:
    - "skip_ansible_lint"

- name: install fixed udp input plugin
  copy:
    src: "hacks/udp.rb"
    dest: "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.1/lib/logstash/inputs/udp.rb"
    owner: "root"
    group: "root"
    mode: 0644
  when: logstash_udp_plugin_check.rc == 0
  notify: restart logstash

Kudos to Martin and Paul for handling this one swiftly.

Posted Thu Apr 19 13:00:44 2018

In case you're using logstash 5.6.x from elastic, version 5.6.9 is released with logstash-filter-grok 4.0.3. This one fixes a bad memory leak that was a cause for frequent logstash crashes since logstash 5.5.6. Reference: https://github.com/logstash-plugins/logstash-filter-grok/issues/135

I hope this is now again a decent logstash 5.x release. I've heard some rumours that the 6.x versions is also a bit plagued by memory leaks. :-/

Posted Wed Apr 18 18:11:08 2018