The night before my official start date at my newest job, I get a call…
the NFS server crashed!
I’m staying at the Ritz Carlton Sarasota with my girlfriend. About to go out for a romantic/ celebratory dinner. But I said I’ll take a look — SSH isn’t too bad over Verizon EVDO. The sysadmin adds users for me… so before I even had a formal employment agreement, I had root access to critical infrastructure.
The issue wasn’t a big deal — stale NFS file handles. And all told, we were back up within minutes. But I realized I had some architecture work ahead of me.
Note to self: Don’t serve your web root off NFS!@#$ OK? Yeah. I never got this one… Why load balance across 20 web servers which all rely on the same NFS server (OK — it was NFS cluster.) But still — the shared storage choked — and all the web servers were down effectively.
Challenge: Need easy way to distribute code, templates, and images to 20 web servers.
Solution: cron + rsync – was our first take at it. But now have a rsync push mechanism, triggered via web.
Even being down for just a few minutes — we lost a lot of money. Downtime is a serious issue. Because the PHP code was on NFS – we returned 4xx/5xx error messages to our clients. And without any business logic, we couldn’t push any transactions through to our data store.
Looking at the Apache logs – we realized we had almost enough information to replay all of the web transactions. Not quite enough with the standard combined log format. But we settled on an Apache CustomLog Format:
LogFormat “%v %h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-Agent}i\” \”Co%{Set-Cookie}o\” \”Ci%{Cookie}i\” \”To%{Timestamp}o\”
%X %s/%>s %I/%O/%B %{pid}P/%{tid}P %T/%D %{outstream}n/%{instream}n/%{ratio}n” combined-cookie
Looks kinda crazy eh? I love data!
As a minimalist, to replay the web transactions… we need:
- the action / URL (%r)
- some unique identifier
- and the state (if you’re not stateless…)
With that — we can replay the log files — and ensure that the transactions have been completed successfully. Our unique identifier is COOKIE + TIMESTAMP. So our CustomLog Format has:
- the incoming cookie %{Cookie}i
- the outgoing cookie (in case the incoming was non-existent) %{Set-Cookie}o
- a timestamp %{Timestamp}o
The application adds a Server header to every response, so the apache log can grab it…
$transaction_time = time();
header( “Timestamp: $transaction_time” );
Wait — so why can’t you use the Apache log timestamp? Umm… even though I grab the current timestamp (in PHP) immediately upon page load…. what if the page is delayed due to compilation? Or if the server is super busy? The Apache log timestamp will many times not perfectly match up with your first call to time(); …
Yeah, yeah – so outputing another server header is wasting an extra 22bytes of bandwidth with every request… for you bit savers out there, you could also set an environment variable and Apache will kindly output to the log with this directive %…{VARIABLE}e
But I’m happy! And my Apache logs are now audit logs!