nginx, lua, uuid and a nchan bug

At work we're running nginx in several instances. Sometimes running on Debian/stretch (Woooh) and sometimes on Debian/jessie (Boooo). To improve our request tracking abilities we set out to add a header with a UUID version 4 if it does not exist yet. We expected this to be a story we could implemented in a few hours at most ...

/proc/sys/kernel/random/uuid vs lua uuid module

If you start to look around on how to implement it you might find out that there is a lua module to generate a UUID. Since this module is not packaged in Debian we started to think about packaging it, but on a second thought we wondered if simply reading from the Linux /proc interface isn't faster after all? So we build a very unscientific test case that we deemed good enough:

$ cat uuid_by_kernel.lua
#!/usr/bin/env lua5.1
local i = 0
repeat
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
  i = i + 1
until i == 1000


$ cat uuid_by_lua.lua
#!/usr/bin/env lua5.1
package.path = package.path .. ";/home/sven/uuid.lua"
local i = 0
repeat
  local uuid = require("uuid")
  local content = uuid()
  i = i + 1
until i == 1000

The result is in favour of using the Linux /proc interface:

$ time ./uuid_by_kernel.lua
real    0m0.013s
user    0m0.012s
sys 0m0.000s

$ time ./uuid_by_lua.lua
real    0m0.021s
user    0m0.016s
sys 0m0.004s

nginx in Debian/stretch vs nginx in Debian/jessie

Now that we had settled on the lua code

if (ngx.var.http_correlation_id == nil or ngx.var.http_correlation_id == "") then
  local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
  local content = f:read("*all")
  f:close()
    return content:sub(1, -2)
  else
    return ngx.var.http_correlation_id
end

and the nginx configuration

set_by_lua_file $ngx.var.http_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;

we started to roll this one out to our mixed setup of Debian/stretch and Debian/jessie hosts. While we tested this one on Debian/stretch, and it all worked fine, we never gave it a try on Debian/jessie. Within seconds of the rollout all our nginx instances on Debian/jessie started to segfault.

Half an hour later it was clear that the nginx release shipped in Debian/jessie does not yet allow you to write directly into the internal variable $ngx.var.http_correlation_id. To workaround this issue we configured nginx like this to use the add_header configuration option to create the header.

set_by_lua_file $header_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
add_header correlation_id $header_correlation_id;

This configuration works on Debian/stretch and Debian/jessie.

Another possibility we considered was using the backported version of nginx. But this one depends on a newer openssl release. I didn't want to walk down the road of manually tracking potential openssl bugs against a release not supported by the official security team. So we rejected this option. Next item on the todo list is for sure the migration to Debian/stretch, which is overdue now anyway.

and it just stopped

A few hours later we found that the nginx running on Debian/stretch was still running, but no longer responding. Attaching strace revealed that all processes (worker and master) were waiting on a futex() call. Logs showed an assert pointing in the direction of the nchan module. I think the bug we're seeing is #446, I've added the few bits of additional information I could gather. We just moved on and disabled the module on our systems. Now it's running fine in all cases for a few weeks.

Kudos to Martin for walking down this muddy road together on a Friday.