[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Anatomy of a filesystem crash

To: zimbra-hied-admins <zimbra-hied-admins@sfu.ca>
Subject: Anatomy of a filesystem crash
From: Steve Hillman <hillman@sfu.ca>
Date: Fri, 23 May 2008 12:14:09 -0700 (PDT)
In-reply-to: <1179993586.1830081211569668886.JavaMail.root@jaguar7.sfu.ca>

In the hopes that someone else will find this useful, or perhaps even report that they've been there before, I thought I'd share an experience we had this week with our pilot Zimbra system.

On Tuesday morning, there was a scheduled reboot of our data centre network switch. This switch provides interconnectivity between many of the hosts in the data centre and is used for NFS, iSCSI, as well as inter-application communication. The reboot took about 3-4 minutes to complete. Unfortunately, that was long enough for the Linux iSCSI daemon on our two Zimbra mailbox servers to issue a timeout back to the OS for all pending I/O. When that happens, the ext3 filesystem driver assumes a disk failure and reverts the filesystem to read-only mode. This has happened a couple of times in the past with a particularly troublesome iSCSI server and the fix has been to reboot the Linux client. And I must admit that in the face of this filesystem adversity, Zimbra has dealt with it well, issuing temporary LMTP failures for incoming mail, and Network Failure errors back to web users.

This time, however, all did not go so smoothly. When we rebooted the Linux clients, one came back up and reported that the ext3 filesystem was corrupt and would need to be fsck'd. So much for journalling! Outside of our research clusters, this is our first 'production' Linux service, so we were a little caught off-guard when the journalling failed us so easily. The automatic fsck wasn't able to correct the problem, so a manual fsck was forced. Which took 2 hours. This was for a filesystem with about 30gb in use and 500,000 files (we later figured out that a large part of this delay was due to an overworked NetApp iSCSI filer, so we don't know how long it would take under optimal conditions, or when it grows to multiple TB and millions of files).

Our config has two mailbox servers at the moment, and only one of them suffered this error when rebooted on Tuesday. However, yesterday morning, the other server suddenly threw its ext3 filesystem into read-only mode, and on later examination of the logs, it turned out that it had found corruption in the journal and disabled it (which triggers the read-only mode). Once again, a 2-hour manual fsck ensued.

In retrospect, the presence of this 'silent corruption' should have been more obvious to us - after the crash on Tuesday morning, a couple of users on the server that came up ok started getting very 'weird' mail delivery problems. The first attempt to deliver a message would always fail with a temporary LMTP error, but as soon as more than one msg was queued up for the user, the first would fail but the rest would come through. The LMTP error was throwing a Java exception, but with no stack trace, so it was impossible to determine what was causing the error (I'll be opening a bugzilla case on that one). They got similar errors (via the web interface) when trying to send messages, and when viewing certain calendar entries - but again, only sometimes.

Since this is a pilot with one of the objectives being to learn about optimal hardware config, our users are (somewhat) tolerant of these downtimes while we figure out the best way to run things. A few things are clear from this outage:

- We must have multipathing if we're going to do iSCSI. We have since enabled multipath on one of the mailbox servers and will do the other one shortly. Even if multiple paths aren't available, the multipath daemon can "trap" timeouts from the iSCSI layer and kick them back down as a retry, rather than passing them back to the filesystem layer. Your system will essentially hang until the iSCSI target comes back, but that's better than corruption.

- We learned that ext3, by default, enforces a full fsck every 20th mount or 180 days, whichever is sooner. This is "best practice", presumably to avoid the kind of 'silent corruption' we had on our second mailbox server. You can lengthen this time or override it completely, but you do so at your own peril. As such, we've lost a significant amount of faith in ext3 (remember, we're new to enterprise Linux, so have very little previous experience to fall back on) -- We're going to take a serious look at Veritas Storage Foundation for Linux. We're already running VxFS on our database servers for our ERP (on Solaris) and have never had a corrupt filesystem

- I'm more determined than ever to build up a Solaris-based development environment where we can make use of tools such as dtrace to get a better understanding of what's going on in Zimbra. The fact that there are parts of the code involved in mail delivery that can fail with no explanation (i.e no stack trace) means that it's pretty important to us that we have a way of figuring out what's going on internally in such circumstances.

- I will be opening Bugzilla cases with Zimbra to encourage them to better instrument every function involved in mail delivery. LMTP errors need to be logged as much as possible, since the end result of such errors is usually undelivered mail.

Thanks for listening!

--
Steve Hillman IT Architect
hillman@sfu.ca IT Infrastructure
778-782-3960 Simon Fraser University
Sent from Zimbra

Follow-Ups:
- Re: Anatomy of a filesystem crash
  - From: Will Froning <wfroning@aus.edu>

Prev by Date: Re: Zimlet Documentation
Next by Date: Re: Anatomy of a filesystem crash
Previous by thread: Zimlet Documentation
Next by thread: Re: Anatomy of a filesystem crash
Index(es):
- Date
- Thread