tl;dr; OpenSSL 3.0.1 leaks memory in ssl3_setup_write_buffer(), seems to be fixed in ~~3.0.5~~ 3.0.2. The issue manifests at least in stunnel and keepalived on CentOS 9. In addition I learned the hard way that running a not so recent VirtualBox version on Debian bullseye let to dh parameter generation crashing in libcrypto in bn_sqr8x_internal().

A recent rabbit hole I went down. The actual bug in openssl was nailed down and documented by Quentin Armitage on GitHub in keepalived My bugreport with all back and forth in the RedHat Bugzilla is #2128412.

Act I - Hello stunnel, this is the OOMkiller Calling

We started to use stunnel on Google Cloud compute engine instances running CentOS 9. The loadbalancer in front of those instances used a TCP health check to validate the backend availability. A day or so later the stunnel instances got killed by the OOMkiller. Restarting stunnel and looking into /proc/<pid>/smaps showed a heap segment growing quite quickly.

Act II - Reproducing the Issue

While I'm not the biggest fan of VirtualBox and Vagrant I've to admit it's quite nice to just fire up a VM image, and give other people a chance to recreate that setup as well. Since VirtualBox is no longer released with Debian/stable I just recompiled what was available in unstable at the time of the bullseye release, and used that. That enabled me now to just start a CentOS 9 VM, setup stunnel with a minimal config, grab netcat and a for loop and watch the memory grow. E.g. while true; do nc -z localhost 2600; sleep 1; done To my surprise, in addition to the memory leak, I also observed some crashes but did not yet care too much about those.

Act III - Wrong Suspect, a Workaround and Bugreporting

Of course the first idea was that something must be wrong in stunnel itself. But I could not find any recent bugreports. My assumption is that there are still a few people around using CentOS and stunnel, so someone else should probably have seen it before. Just to be sure I recompiled the latest stunnel package from Fedora. Didn't change anything. Next I recompiled it without almost all the patches Fedora/RedHat carries. Nope, no progress. Next idea: Maybe this is related to the fact that we do not initiate a TLS context after connecting? So we changed the test case from nc to openssl s_client, and the loadbalancer healthcheck from TCP to a TLS based one. Tada, a workaround, no more memory leaking. In addition I gave Fedora a try (they have Vagrant Virtualbox images in the "Cloud" Spin, e.g. here for Fedora 36) and my local Debian installation a try. No leaks experienced on both. Next I reported #2128412.

Act IV - Crash in libcrypto and a VirtualBox Bug

When I moved with the test case from the Google Cloud compute instance to my local VM I encountered some crashes. That morphed into a real problem when I started to run stunnel with gdb and valgrind. All crashes happened in libcrypto bn_sqr8x_internal() when generating new dh parameter (stunnel does that for you if you do not use static dh parameter). I quickly worked around that by generating static dh parameter for stunnel. After some back and forth I suspected VirtualBox as the culprit. Recompiling the current VirtualBox version (6.1.38-dfsg-3) from unstable on bullseye works without any changes. Upgrading actually fixed that issue.

Epilog

I highly appreciate that RedHat, with all the bashing around the future of CentOS, still works on community contributed bugreports. My kudos go to Clemens Lang. Now that the root cause is clear, I guess RedHat will push out a fix for the openssl 3.0.1 based release they have in RHEL/CentOS 9. Until that is available at least stunnel and keepalived are known to be affected. If you run stunnel on something public it's not that pretty, because already a low rate of TCP connections will result in a DoS condition.