tl;dr; OpenSSL 3.0.1 leaks memory in ssl3_setup_write_buffer()
, seems to be
fixed in 3.0.5 3.0.2. The issue manifests at least in stunnel
and keepalived on CentOS 9. In addition I learned the hard way that running a
not so recent VirtualBox version on Debian bullseye let to dh parameter generation
crashing in libcrypto in bn_sqr8x_internal()
.
A recent rabbit hole I went down. The actual bug in openssl was nailed down and documented by Quentin Armitage on GitHub in keepalived My bugreport with all back and forth in the RedHat Bugzilla is #2128412.
Act I - Hello stunnel, this is the OOMkiller Calling
We started to use stunnel on Google Cloud compute engine instances running CentOS 9.
The loadbalancer in front of those instances used a TCP health check to validate the
backend availability. A day or so later the stunnel instances got killed by the OOMkiller. Restarting stunnel and looking into /proc/<pid>/smaps
showed a heap
segment growing quite quickly.
Act II - Reproducing the Issue
While I'm not the biggest fan of VirtualBox and Vagrant I've to admit it's quite
nice to just fire up a VM image, and give other people a chance to recreate that
setup as well. Since VirtualBox is no longer released with Debian/stable I just
recompiled what was available in unstable at the time of the bullseye release, and
used that. That enabled me now to just start a CentOS 9 VM, setup stunnel with a
minimal config, grab netcat and a for loop and watch the memory grow.
E.g. while true; do nc -z localhost 2600; sleep 1; done
To my surprise, in addition to the memory leak, I also observed some crashes but
did not yet care too much about those.
Act III - Wrong Suspect, a Workaround and Bugreporting
Of course the first idea was that something must be wrong in stunnel itself. But
I could not find any recent bugreports. My assumption is that there are
still a few people around using CentOS and stunnel, so someone else should probably
have seen it before. Just to be sure I recompiled the latest stunnel package from
Fedora. Didn't change anything. Next I recompiled it without almost all the patches
Fedora/RedHat carries. Nope, no progress.
Next idea: Maybe this is related to the fact that we do not initiate a TLS context
after connecting? So we changed the test case from nc
to openssl s_client
, and
the loadbalancer healthcheck from TCP to a TLS based one. Tada, a workaround, no
more memory leaking.
In addition I gave Fedora a try (they have Vagrant Virtualbox images in the "Cloud"
Spin, e.g.
here for Fedora 36)
and my local Debian installation a try. No leaks experienced on both.
Next I reported
#2128412.
Act IV - Crash in libcrypto and a VirtualBox Bug
When I moved with the test case from the Google Cloud compute instance to my
local VM I encountered some crashes. That morphed into a real problem when I
started to run stunnel with gdb and valgrind. All crashes happened in libcrypto
bn_sqr8x_internal()
when generating new dh parameter (stunnel does that for
you if you do not use static dh parameter). I quickly worked around that by
generating static dh parameter for stunnel.
After some back and forth I suspected VirtualBox as the culprit. Recompiling
the current VirtualBox version (6.1.38-dfsg-3) from unstable on bullseye works
without any changes. Upgrading actually fixed that issue.
Epilog
I highly appreciate that RedHat, with all the bashing around the future of CentOS, still works on community contributed bugreports. My kudos go to Clemens Lang. Now that the root cause is clear, I guess RedHat will push out a fix for the openssl 3.0.1 based release they have in RHEL/CentOS 9. Until that is available at least stunnel and keepalived are known to be affected. If you run stunnel on something public it's not that pretty, because already a low rate of TCP connections will result in a DoS condition.