RSS Atom Add a new post titled:

At work we're running nginx in several instances. Sometimes running on Debian/stretch (Woooh) and sometimes on Debian/jessie (Boooo). To improve our request tracking abilities we set out to add a header with a UUID version 4 if it does not exist yet. We expected this to be a story we could implemented in a few hours at most ...

/proc/sys/kernel/random/uuid vs lua uuid module

If you start to look around on how to implement it you might find out that there is a lua module to generate a UUID. Since this module is not packaged in Debian we started to think about packaging it, but on a second thought we wondered if simply reading from the Linux /proc interface isn't faster after all? So we build a very unscientific test case that we deemed good enough:

$ cat uuid_by_kernel.lua
#!/usr/bin/env lua5.1
local i = 0
repeat
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
  i = i + 1
until i == 1000


$ cat uuid_by_lua.lua
#!/usr/bin/env lua5.1
package.path = package.path .. ";/home/sven/uuid.lua"
local i = 0
repeat
  local uuid = require("uuid")
  local content = uuid()
  i = i + 1
until i == 1000

The result is in favour of using the Linux /proc interface:

$ time ./uuid_by_kernel.lua
real    0m0.013s
user    0m0.012s
sys 0m0.000s

$ time ./uuid_by_lua.lua
real    0m0.021s
user    0m0.016s
sys 0m0.004s

nginx in Debian/stretch vs nginx in Debian/jessie

Now that we had settled on the lua code

if (ngx.var.http_correlation_id == nil or ngx.var.http_correlation_id == "") then
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
    return content:sub(1, -2)
  else
    return ngx.var.http_correlation_id
end

and the nginx configuration

set_by_lua_file $ngx.var.http_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;

we started to roll this one out to our mixed setup of Debian/stretch and Debian/jessie hosts. While we tested this one on Debian/stretch, and it all worked fine, we never gave it a try on Debian/jessie. Within seconds of the rollout all our nginx instances on Debian/jessie started to segfault.

Half an hour later it was clear that the nginx release shipped in Debian/jessie does not yet allow you to write directly into the internal variable $ngx.var.http_correlation_id. To workaround this issue we configured nginx like this to use the add_header configuration option to create the header.

set_by_lua_file $header_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
add_header correlation_id $header_correlation_id;

This configuration works on Debian/stretch and Debian/jessie.

Another possibility we considered was using the backported version of nginx. But this one depends on a newer openssl release. I didn't want to walk down the road of manually tracking potential openssl bugs against a release not supported by the official security team. So we rejected this option. Next item on the todo list is for sure the migration to Debian/stretch, which is overdue now anyway.

and it just stopped

A few hours later we found that the nginx running on Debian/stretch was still running, but no longer responding. Attaching strace revealed that all processes (worker and master) were waiting on a futex() call. Logs showed an assert pointing in the direction of the nchan module. I think the bug we're seeing is #446, I've added the few bits of additional information I could gather. We just moved on and disabled the module on our systems. Now it's running fine in all cases for a few weeks.

Kudos to Martin for walking down this muddy road together on a Friday.

Posted Sat Jun 23 18:08:51 2018

Sounds crazy and nobody would ever do that, but just for a moment imagine you no longer own your infrastructure.

Imagine you just run your container on something like GKE with Kubernetes.

Imagine you build your software with something like Jenkins running in a container, using the GKE provided docker interface to build stuff in another container.

And for a $reason imagine you're not using the Google provided container registry, but your own one hosted somewhere else on the internet.

Of course you access your registry via HTTPS, so your connection is secured at the transport level.

Now imagine your certificate is at the end of its validity period. Like ending the next day.

Imagine you just do what you do every time that happens, and you just order a new certificate from one of the left over CAs like DigiCert.

You receive your certificate within 15 minutes.

You deploy it to your registry.

You validate that your certificate chain validates against different certificate stores.

The one shipped in ca-certificates on various Debian releases you run.

The one in your browser.

Maybe you even test it with Google Chrome.

Everything is cool and validates. I mean, of course it does. DigiCert is a known CA player and the root CA certificate was created five years ago. A lot of time for a CA to be included and shipped in many places.

But still there is one issue. The docker commands you run in your build jobs fail to pull images from your registry because the certificate can not be validated.

You take a look at the underlying OS and indeed it's not shipping the 5 year old root CA certificate that issued your intermediate CA that just issued your new server certificate.

If it were your own infrastructure you would now just ship the missing certificate.

Maybe by including it in your internal ca-certificates build.

Or by just deploying it with ansible to /usr/share/ca-certificates/myfoo/ and adding that to the configuration in /etc/ca-certificates.conf so update-ca-certificates can create the relevant hash links for you in /etc/ssl/certs/.

But this time it's not your infrastructure and you can not modify the operating system context your docker container are running in.

Sounds insane, right? Luckily we're just making up a crazy story and something like that would never happen in the real world, because we all insist on owning our infrastructure.

Posted Fri Jun 15 21:04:44 2018

A small followup regarding the replacement of hp-health and hpssacli. Turns out a few things have to be replaced, lucky all you already running on someone else computer where you do not have to take care of the hardware.

ssacli

According to the super nice and helpful Craig L. at HPE they're planing an update for the MCP ssacli for Ubuntu 18.04. This one will also support the SmartArray firmware 1.34. If you need it now you should be able to use the one released for RHEL and SLES. I did not test it.

replacing hp-health

The master plan is to query the iLO. Basically there are two ways. Either locally via hponcfg or remotely via a Perl script sample provided by HPE along with many helpful RIBCL XML file examples. Both approaches are not cool because you've to deal with a lot of XML, so opt for a 3rd way and use the awesome python-hpilo module (part of Debian/stretch) which abstracts all the RIBCL XML stuff nicely away from you.

If you'd like to have a taste of it, I had to reset a few ilo passwords to something sane, without quotes, double quotes and backticks, and did it like this:

#!/bin/bash
pwfile="ilo-pwlist-$(date +%s)"

for x in $(seq -w 004 006); do
  pw=$(pwgen -n 24 1)
  host="host-${x}"
  echo "${host},${pw}" >> $pwfile
  ssh $host "echo \"<RIBCL VERSION=\\\"2.0\\\"><LOGIN USER_LOGIN=\\\"adminname\\\" PASSWORD=\\\"password\\\"><USER_INFO MODE=\\\"write\\\"><MOD_USER USER_LOGIN=\\\"Administrator\\\"><PASSWORD value=\\\"$pw\\\"/></MOD_USER></USER_INFO></LOGIN></RIBCL>\" > /tmp/setpw.xml"
  ssh $host "sudo hponcfg -f /tmp/setpw.xml && rm /tmp/setpw.xml"
done

After I regained access to all iLO devices I used the hpilo_cli helper to add a monitoring user:

#!/bin/bash
while read -r line; do
  host=$(echo $line|cut -d',' -f 1)
  pw=$(echo $line|cut -d',' -f 2)
  hpilo_cli -l Administrator -p $pw $host add_user user_login="monitoring" user_name="monitoring" password="secret" admin_priv=False remote_cons_priv=False reset_server_priv=False virtual_media_priv=False config_ilo_priv=False
done < ${1}

The helper script to actually query the iLO interfaces from our monitoring is, in comparison to those ad-hoc shell hacks, rather nice:

#!/usr/bin/python3
import hpilo, argparse
iloUser="monitoring"
iloPassword="secret"

parser = argparse.ArgumentParser()
parser.add_argument("component", help="HW component to query", choices=['battery', 'bios_hardware', 'fans', 'memory', 'network', 'power_supplies', 'processor', 'storage', 'temperature'])
parser.add_argument("host", help="iLO Hostname or IP address to connect to")
args = parser.parse_args()

def askIloHealth(component, host, user, password):
    ilo = hpilo.Ilo(host, user, password)
    health = ilo.get_embedded_health()
    print(health['health_at_a_glance'][component]['status'])

askIloHealth(args.component, args.host, iloUser, iloPassword)

You can also take a look at a more detailed state if you pprint the complete stuff returned by "get_embedded_health". This whole approach of using the iLO should work since iLO 3. I tested version 4 and 5.

Posted Fri May 11 17:40:08 2018

We received our first HPE gen10 systems, a bunch of DL360, and experienced a few caveats while setting up Debian/stretch.

PXE installation

While our DL120 gen9 announced themself as "X86-64_EFI" (client architecture 00009), the DL360 gen10 use "BC_EFI" (client architecture 00007) in the BOOTP/DHCP protocol option 93. Since we use dnsmasq as DHCP and tftp server we rely on tags like this:

# new style UEFI PXE
dhcp-boot=bootnetx64.efi
# client arch 00009
pxe-service=tag:s1,X86-64_EFI, "Boot UEFI X86-64_EFI", bootnetx64.efi
# client arch 00007
pxe-service=tag:s2,BC_EFI, "Boot UEFI BC_EFI", bootnetx64.efi

dhcp-host=set:s1,AB:CD:37:3A:2E:FG,192.168.1.5,host-001
dhcp-host=set:s2,AB:CD:37:3A:2H:IJ,192.168.1.6,host-002

This is easy to spot with wireshark once you understand what you're looking for.

debian-installer

For some reason, and I heard some rumours that this is a known bug, I had to disable USB support and the SD-card reader in the interface formerly known as BIOS. Otherwise the installer detects the first volume of the P408i raid controller as "/dev/sdb" instead of "/dev/sda".

Network interfaces depend highly on your actual setup. Booting from the additional 10G interfaces worked out of the box, they're detected with reliable names as eno5 and eno6.

HPE MCP

So far we relied on the hp-health and ssacli (formerly hpssacli) packages from the HPE MCP. Currently those tools seem to not support Gen10 systems. I'm currently trying to find out what the alternative is to monitor the health state of the system components. At least for hp-health it's mentioned that only up to Gen9 is support.

That's what I receive from ssacli:

=> ctrl all show status

HPE P408i-a SR Gen10 in Slot 0 (Embedded)

APPLICATION UPGRADE REQUIRED: This controller has been configured with a more
                          recent version of software.
                          To prevent data loss, configuration changes to
                          this controller are not allowed.
                          Please upgrade to the latest version to be able
                          to continue to configure this controller.

That's what I found in our logs from the failed start of hpasmlited:

hpasmlited[31952]: check_ilo2: BMC Returned Error:  ccode  0x0,  Req. Len:  15, Resp. Len:  21

auto configuration

If you're looking into automatic configuration of those systems you'll have to look for Redfish. API documentation for ilo5 can be found at https://hewlettpackard.github.io/ilo-rest-api-docs/ilo5. Since it's not clear how many of those systems we will actually setup, I'm so far a bit reluctant to automate the setup further.

Posted Tue May 8 18:55:32 2018

While the memory leak is fixed in logstash 5.6.9 the logstash-input-udp plugin is broken. A fixed plugin got released as version 3.3.2.

The code change is https://github.com/logstash-plugins/logstash-input-udp/commit/7ecec49a3f1a0f8b51c77bd9243b8cc0dbebaeb8.

The discussion is at https://discuss.elastic.co/t/udp-input-is-crashing/128485.

So instead of fiddling again with plugin updates and offline bundles we decided to just go down the ugly road of abusing ansible, and install a file copy of the udp.rb file. This is horrible but works.

- name: check for br0ken logstash udp input plugin version
  shell: /usr/share/logstash/bin/logstash-plugin list --verbose logstash-input-udp | grep -E '3\.3\.1'
  register: logstash_udp_plugin_check
  ignore_errors: True
  tags:
    - "skip_ansible_lint"

- name: install fixed udp input plugin
  copy:
    src: "hacks/udp.rb"
    dest: "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-udp-3.3.1/lib/logstash/inputs/udp.rb"
    owner: "root"
    group: "root"
    mode: 0644
  when: logstash_udp_plugin_check.rc == 0
  notify: restart logstash

Kudos to Martin and Paul for handling this one swiftly.

Posted Thu Apr 19 13:00:44 2018

In case you're using logstash 5.6.x from elastic, version 5.6.9 is released with logstash-filter-grok 4.0.3. This one fixes a bad memory leak that was a cause for frequent logstash crashes since logstash 5.5.6. Reference: https://github.com/logstash-plugins/logstash-filter-grok/issues/135

I hope this is now again a decent logstash 5.x release. I've heard some rumours that the 6.x versions is also a bit plagued by memory leaks. :-/

Posted Wed Apr 18 18:11:08 2018

At work we're using Jfrog Artifactory to provide a Debian repository (among other kinds of repository). Using the WebUI sucks, uploading by cut&pasting a curl command is annoying too, so I just wrote down a few lines of shell to upload a single Debian binary package.

Expectation is a flat repository and that you edit the variables at the top to provide the repository URL, name and your API Key. So no magic involved.

Posted Wed Mar 14 19:26:37 2018

I spent an hour to add very basic support for the upcoming Java 10 to my fork of java-package. It still has some edges and the list of binary executables managed via the alternatives system requires some major cleanup. I think once Java 8 is EOL in September it's a good point to consolidate and strip everything except for Java 11 support. If someone requires an older release he can still get back on an earlier version, but by then we won't see any new releases of Java 8, 9, 10. Not speaking about even older stuff.

[sven@digital lib (master)]$ java -version
java version "10" 2018-03-20
Java(TM) SE Runtime Environment 18.3 (build 10+46)
Java HotSpot(TM) 64-Bit Server VM 18.3 (build 10+46, mixed mode)
Posted Fri Mar 9 19:22:29 2018

Maybe some recent events let to BIOS update releases by various vendors around the end of 2017. So I set out to update (for the first time) the BIOS of my laptops. Searching the interwebs for some hints I found a lot outdated information involving USB thumb drives, CDs, FreeDOS in variants but also some useful stuff. So here is the short list of what actually worked in case I need to do it again.

Update: Added a Wiki page so it's possible to extend the list. Seems that some of us avoided the update hassle so far, but now with all those Intel ME CVEs and Intel microcode updates it's likely we've to do it more often.

Dell Latitude E7470 (UEFI boot setup)

  1. Download the file "Latitude_E7x70_1.18.5.exe" (or whatever is the current release).
  2. Move the file to "/boot/efi/".
  3. Boot into the one time boot menu with F12 during the BIOS/UEFI start.
  4. Select the "Flash BIOS Update" menu option.
  5. Use your mouse to select the update file visually and watch the magic.

So no USB sticks, FreeDOS, SystemrescueCD images or other tricks involved. If it's cool that the computer in your computers computer running Minix (or whatever is involved in this case) updates your firmware is a different topic, but the process is pretty straight forward.

Lenovo ThinkPad P50

  1. Download the BIOS Update bootable CD image from Lenovo "n1eur31w.iso" (Select Windows as OS so it's available for download).
  2. Extract the eltorito boot image from the image "geteltorito -o thinkpad.img Downloads/n1eur31w.iso".
  3. Dump it on a USB thumb drive "dd if=thinkpad.img of=/dev/sdX".
  4. Boot from this thumb drive and follow the instructions of the installer.

I guess the process is similar for almost all ThinkPads.

Posted Sat Jan 6 23:34:55 2018

Sitting at home in a not so decent state made me finally fiddle with java-package to deal with Oracle Java 9 builds. For now I've added only some half-assed support for JDK 9 amd64 builds. That's what you download as "jdk-9.0.1_linux-x64_bin.tar.gz" from the Oracle Java pages. It's a works for me thing, but maybe someone finds it useful, the source is here.

git clone https://git.sven.stormbind.net/java-package.git
cd java-package
sed -i -e 's#lib_dir="/usr/share/java-package"#lib_dir="./lib"#' make-jpkg

and you can just start using it in this directory without creating and installing the java-package Debian package.

Side note: If you try out Java within a chroot mount /proc into it. Wasted half an hour to find that out this morning.

Posted Sun Nov 26 13:45:19 2017