Failing with F5: Implementing HTTP session persistence

In the old days people deployed the Apache HTTPD together with mod_jk and mod_balancer in front of Apache Tomcat instances to achieve scalability and resilience. That sooner or later turns out to not provide features you nowadays would like to have to run a service 24/7. So it happened that I had to jump into the sometimes cold sea of managing services fronted by F5 BigIP devices, a somewhat common load balancing solution.

Soon after you start to introduce loadbalancing, because you grow beyond the "one fat webserver" state, you'll have the requirement to implement some way of session persistence. In the old days you might have added a unique jvmRoute value to your mod_balancer pool and to your tomcat configuration to match the connection to the right pool member. If you're clever you've set it on the Tomcat as a variable that automatically builds up the jvmRoute based on the hostname, so you can add new instances without having to touch the Tomcat configuration. In case you were not that clever (like we were when we started) you might have added placeholders in the server.xml to set the jvmRoute value in your deployment script.

I think in a somewhat better world your first step usually is to introduce a shared session store, be it the php session handler backed by a memcached or Hazelcast for the Java afficionados, to avoid the persistence requirement on your balancer. But if you grow even bigger the requirement will soon reappear because you might have to add some form of sharding and now have to sent users to the same cluster serving a specific shard.

In the end all of that is not cool and your developers have to be aware of those issues. For all cases where fault tolerance within the session is not a priority we aimed for something that we can enable or disable on the balancer without application code changes.

The "let the box do its job" solution

The nicest solution from our point of view is letting the F5 inject a cookie that points the device on subsequent requests to the right pool node. If you like you can define the cookie name, a timeout, force the injection on replies and some other tunables. Take a look at the help for ltm persistence cookie if you'd like to have a look at the details.

The "let's do it by hand but on the box" solution

Most Java frameworks provide a session cookie called JSESSIONID. Why not use that one and add a persistence table entry on the BigIP with the cookie as the lookup key? As always if there's no solution provided out of the box you can implement an iRule for it. Kind of an F5 mantra.

There are many examples out there, here is what we ended up with:

when HTTP_REQUEST {
    if { [HTTP::cookie "JSESSIONID"] ne "" }{
        persist uie [string tolower [HTTP::cookie "JSESSIONID"]] 1800 
    }
}

when HTTP_RESPONSE {
    if { [HTTP::cookie "JSESSIONID"] ne "" }{
        persist add uie [string tolower [HTTP::cookie "JSESSIONID"]] 1800
    }
}

And here is where I failed

Now imagine you're running an application that so far handles only machine to machine traffic without session persistence. A few month later you're approached to enable session persistence, because someone introduced a management console for the application as a web user interface. Without further ado you jump on board and provide - showcasing the flexibility of the new balancer - session persistence based on cookies injected by the BigIP. You've done it before, you enable it for the test environment, everything looks fine, you enable it for the production setup. Everything is fine (at least for now ...), the WUI workes fine, sessions are persistent to the node the user hit first.

A month or two later someone from the development team approaches you again, this time asking if the balancing setup has some issues because over 90% of the live traffic hit one of the two machines in the pool until it wents down due to a crash or a deployment. Then about 90% hit the other node, even after the crashed one is resurrected. But every time the developer tests the balancing with a "while true" loop sending several hundred requests via curl the balancing is perfectly equal.

After beeing puzzled by the behaviour myself we started to look at the live traffic with tcpdump and suddenly it was all clear when I saw the BigIP cookie beeing passed around. Actually everything behaved as we configured it. The application used a proper HTTP library and that one implemented cookies: If you receive one you're polite and pass it back to the sender.

The original sender of course also implements HTTP with cookie support and now, due to the natur of the traffic, we have long living connections between two applications passing around the same cookie. On every single HTTP request sent through this connection. And every time the balancer will look at the cookie and happily base the routing decision on the pool member noted in this cookie.

When we tested with curl we did not have the usage of cookies on our mind. Because our false assumption was that the applications do not create cookies for machine to machine traffic. Nobody, including myself, actively remembered the decision to base the session persistence on cookies injected by the balancer.

We went on and switched to the iRule mentioned above. Luckily we already had that one implemented and tested for a different use case with the requirement to not inject additional cookies.

Can we maybe improve the change process for the balancer to handle a mixed traffic scenario?
Do we need monitoring to monitor the balancing of requests?
Is it a wise idea to mix machine2machine and user2machine traffic on the same port?
Should we maybe, if we seperate properly, place management applications on a different system?
Is our documentation sufficient?

Combine 3. and 4. and ask yourself how resilient you are if someone tries to exhaust the memory of the balancer with a lot of connection attempts that have random JSESSIONID cookies? Or is it even possible to exhaust ressource with faked BigIP cookies you inject? Maybe it's even a vector for something like http://events.ccc.de/congress/2011/Fahrplan/attachments/2007_28C3_Effective_DoS_on_web_application_platforms.pdf?

Quite some questions left to think about and draft answers for.