Published 4 December 2013
Email is a service that has become so ubiquitous in daily life that it is rarely given a second thought. That is, until it stops working. Unfortunately, on November 12th this year, that's exactly what happened for a number of Web Drive email users.
As is common with technology, the issues experienced can not be attributed to a single failure. Instead, a number of more minor issues, each of which we have contingencies in place to mitigate, aligned to cause a headache for our customers and engineers alike.
At Web Drive, we make use of an ingenious filesystem called ZFS - released initially by Sun (now part of Oracle) as part of the Solaris operating system. One area we utilise this is in the storage of customer email.
ZFS has a number of useful features such as compression and Solid-State (SSD) caching baked right in that allows us to store more mail with typically better performance than other unix-like filesystems.
For redundancy and availability, we split our customer mail storage across several backend "storage servers" with each of those servers being mirrored to ensure we have a complete set of data on standby. Further, each server is configured with multiple data storage disks in a RAID configuration that allows us to seamlessly sustain the loss of 1 or more drives without any impact to our customers.
Around 10:00am on November 12th the Systems Administration team were notified of an issue with our backend mail server that caters for approximately 50% of our customers. This was quickly escalated to the Systems Engineering team and for investigation.
The affected server was found to be unresponsive. This in itself was nothing too worrying as technology is never 100% reliable and these kinds of issues are often solved with a little digging and the occasional power-cycle. On gaining access to the server console however, it quickly became apparent this was not a minor issue.
Each time the server attempted to boot, a "Kernel Panic" (Unix-like speak for a crash) was generated as the system attempted to import the ZFS filesystems containing customer data. The first apparent cause for this was the failure of one of the Solid-State cache drives - not a problem as these drives hold only a copy of the data that is stored safely on the remaining spinning disks in the server.
On eliminating the failed SSD as the cause of the failed boot attempts, a more sinister issue became apparent.
At this point, we return to the world of ZFS - more specifically, a ZFS error our Senior Engineers have encountered only once before and often thought to be a myth among the ZFS faithful. The error reads as such - "zfs freeing free segment".
There's only a handful of information available about this error, and, it would seem that most ZFS users have been lucky to avoid an encounter. Sadly, much of the information about this particular ZFS issue is available only to those with access to Oracle technical documentation, and, this information contains little by way of troubleshooting or recovery information.
Having tried (unsuccessfully) for a couple of hours to recover the lost ZFS dataset using all published (and some unpublished) methods, the decision was made to fail services over to the standby storage server. This in itself was pretty painless and allowed us to restore mail services to customers quickly, however, that's not where the story ends.
Normally, when we perform a failover between storage servers, a complete copy of data remains on the standby system meaning we only have to synchronise a small number of files and changes between the systems.
We perform these synchronisations regularly throughout the day and night to ensure we always have an up-to-date standby copy of customer email (something we couldn't bare to lose). In this case however, the copy was gone.
Faced with the possibility of having only one copy of our customers data, a synchronisation process was immediately kicked off to ensure a replica was available ASAP. It's at this point something inherent with email messages made itself known.
The bulk of email messages we send and receive in the modern age are short, concise, and, most importantly, very very small. In this instance, we had several terabytes of mail data comprised of files that were less than 1 kilobyte in size - this quickly amounts to millions of individual files.
The process of reading this data from spinning disks and pushing it across the network to the standby server generates a substantial amount of work for the active server. Alongside this process, the server in question was also working hard to receive and distribute hundreds of thousands of messages per hour to customers over IMAP and POP3.
The lengthy synchronisation of this huge volume of minute files manifested itself to our users as a seemingly endless series of password prompts and timeout messages. Simply, these can be attributed to the enormous load under which the active mail server was placed.
Our highest priority was ensuring that no customer data was lost, and, importantly, that we had a full copy of all customer mail should the worst happen. It's never an easy decision - cause weeks of frustration for users, or, run the serious risk of losing years of customer data. In this instance, we're sure we made the right call and that customers will agree with us.
The initial failures and lengthy synchronisation process highlighted a need for us to re-evaluate how we store email messages and how we handle the failure of systems charged with this task. As such, we've re-visited the storage architecture from top to bottom and will be making some changes in the very near future.
The first will be to further separate our customers mail data out across a greater number of storage systems. This will help ensure that should the worst happen again, fewer customers will be impacted and for a much shorter period of time.
The second important change will be the replacement of the hardware systems we use. We're in the process of evaluating new generation enterprise servers from HP - the same hardware our many dedicated server customers trust and that we use throughout our enterprise cloud offering.
We'll be keeping customers updated on the progress of these changes throughout, keep an eye on the Web Drive RSS feed for up to the minute information and check back on the Web Drive Blog for more in-depth detail as it becomes available.
- Bradley Scarisbrick, Senior Engineer