At work we're running nginx in several instances. Sometimes running on Debian/stretch (Woooh) and sometimes on Debian/jessie (Boooo). To improve our request tracking abilities we set out to add a header with a UUID version 4 if it does not exist yet. We expected this to be a story we could implemented in a few hours at most ...
/proc/sys/kernel/random/uuid vs lua uuid module
If you start to look around on how to implement it you might find out that there is a lua module to generate a UUID. Since this module is not packaged in Debian we started to think about packaging it, but on a second thought we wondered if simply reading from the Linux /proc interface isn't faster after all? So we build a very unscientific test case that we deemed good enough:
$ cat uuid_by_kernel.lua
#!/usr/bin/env lua5.1
local i = 0
repeat
local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
local content = f:read("*all")
f:close()
i = i + 1
until i == 1000
$ cat uuid_by_lua.lua
#!/usr/bin/env lua5.1
package.path = package.path .. ";/home/sven/uuid.lua"
local i = 0
repeat
local uuid = require("uuid")
local content = uuid()
i = i + 1
until i == 1000
The result is in favour of using the Linux /proc interface:
$ time ./uuid_by_kernel.lua
real 0m0.013s
user 0m0.012s
sys 0m0.000s
$ time ./uuid_by_lua.lua
real 0m0.021s
user 0m0.016s
sys 0m0.004s
nginx in Debian/stretch vs nginx in Debian/jessie
Now that we had settled on the lua code
if (ngx.var.http_correlation_id == nil or ngx.var.http_correlation_id == "") then
local f = assert(io.open("/proc/sys/kernel/random/uuid", "rb"))
local content = f:read("*all")
f:close()
return content:sub(1, -2)
else
return ngx.var.http_correlation_id
end
and the nginx configuration
set_by_lua_file $ngx.var.http_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
we started to roll this one out to our mixed setup of Debian/stretch and Debian/jessie hosts. While we tested this one on Debian/stretch, and it all worked fine, we never gave it a try on Debian/jessie. Within seconds of the rollout all our nginx instances on Debian/jessie started to segfault.
Half an hour later it was clear that the nginx release shipped in Debian/jessie does not yet allow you to write directly into the internal variable $ngx.var.http_correlation_id. To workaround this issue we configured nginx like this to use the add_header configuration option to create the header.
set_by_lua_file $header_correlation_id /etc/nginx/lua-scripts/lua_uuid.lua;
add_header correlation_id $header_correlation_id;
This configuration works on Debian/stretch and Debian/jessie.
Another possibility we considered was using the backported version of nginx. But this one depends on a newer openssl release. I didn't want to walk down the road of manually tracking potential openssl bugs against a release not supported by the official security team. So we rejected this option. Next item on the todo list is for sure the migration to Debian/stretch, which is overdue now anyway.
and it just stopped
A few hours later we found that the nginx running on Debian/stretch was still running, but no longer responding. Attaching strace revealed that all processes (worker and master) were waiting on a futex() call. Logs showed an assert pointing in the direction of the nchan module. I think the bug we're seeing is #446, I've added the few bits of additional information I could gather. We just moved on and disabled the module on our systems. Now it's running fine in all cases for a few weeks.
Kudos to Martin for walking down this muddy road together on a Friday.