The First Law of Systems Administration
This post details an outage I caused this week by making several poor decisions.
Each point contains lessons I have learned over the past 10 years, and in this instance studiously ignored. Things I am typically very careful to avoid doing. My record for not breaking things is actually pretty decent, but when I do break things it tends to occur under the same set of circumstances (I'm tired and in a hurry).
Even with a decade of experience and a process that mitigates failures, I managed to do something really, really dumb.
A couple months ago I attended Surge in Baltimore, a conference whose focus is on scalability and dealing with failures. The best talks came down to "this is how we broke stuff, and this is how we recovered."
Hopefully illuminating this particular failure will not just help someone else recover from something similar, but remind my fellow sysadmins that sometimes you just need to take a nap.
🔗The First Law
Backups. Never do anything unless you have backups.
🔗Stupidity the First
A few weeks ago I added an OCZ Vertex 2 SSD to a ZFS pool as a write cache. These are low-end devices, with not a great MTBF, but my research suggested they would fit our needs.
The pool in question is configured as an array of mirrors. The system was running Solaris 10 U7, which does not have support for either import recovery (-F), import with a missing log device (-m), or removal of log devices.
I had tested the SSD for about a week, burning it in.
The SSD was added without a mirror.
I was quite pleased with myself: The performance increase was obvious and immediate. Good job, me, for making stuff better.
A week after being added to the pool, the SSD died. The exact error from the Solaris SCSI driver was "Device is gone."
The zpool hung, necessitating a hard system reset. The system came back up, with the SSD being seen as UNAVAIL. We lost whatever writes were in-flight through the ZIL, but given the workload, that was going to be either minor or recoverable.
I made myself a bug to RMA the SSD and order a pair of new ones, and stopped thinking about it, annoyed that a brand new device died after less than a month.
The stupid: Adding a single point of failure to a redundant system.
Bonus stupid: Not more than a month ago I argued on a mailing list that you should always have a mirrored ZIL, regardless of whether or not your system supported import -F or -m. Yup. I ignored my own advice, because I wanted an immediate performance increase.
Extra bonus stupid: Not fixing a problem relating to storage immediately. Sysadmins wrangle data. It's what we do and when we do it well, it's why people love us. Leaving a storage system in a hosed, if working, state, is just asking for pain later. Begging for it.
🔗The Second Law
You are not a computer.
Sometimes you are just too tired to work.
Never do anything when your judgement is impaired. In particular, never make major decisions without confirmation when you are overtired (and had, perhaps, just gotten a flu shot). It leads to calamities.
As sysadmins we often have to work on little sleep in non-optimal situations or environments. We sometimes take it as a point of pride that we can do incredibly complex things when we're barely functional.
At some point you are going to screw yourself, though.
One thing I know about myself: I get really stupid when I'm too tired. If I get woken up at 0300 by a page, I can muscle-memory and squint my way to a fix. If I've been up for 14-16 hours and I've been getting say, maybe, four hours of sleep a night for the past two months?
I'm going to do something dumb.
🔗Stupidity the Second
I have been upgrading systems to U9 over the last few weeks. The system with the UNAVAIL SSD came up on the rotation. With U9 I'd be able to remove the dead log device. We announced a 30m outage.
And here is where impaired judgement comes in. If the following two thoughts are in your head:
- I am exhausted
- I just want to get this done
Stop whatever it is you're doing. Go take a nap. Wait until a co-worker is around so they can tell you "holy crap, why are you eating live scorpions covered in glass? Stop that stupid thing you are doing!"
My wife is well aware that I do stupid things when I'm tired and tells me "do that later. Go to bed." Listen to my wife.
I decided to go ahead and upgrade the system with the DEGRADED pool. I have rolling backups for everything on the system except the dataset containing our spam indexes (which are required so customers can view spam we have discarded for them, and release false positives).
Rather than wait to sync that dataset off-system (3-4 hours, and why hadn't I just started a rolling sync earlier that day? Or had one for the last two years?) I decided to go ahead and upgrade the system.
The stupid: Why would you ever put unique data at risk like this?
Bonus stupid: Why is the data unique? There is no reason for it to be so. Replicating ZFS is trivial. Oversights happen, but this is still dumb.
(My systems all live in pairs. With very few exceptions there are no snowflake services. I take snapshots of my MySQL master. I replicate them, so I can clone and boot them to restore data quickly. I have MySQL replication set up so I can do hot failovers. I have zones replicated via ZFS, I have backups of /etc and /usr/pkg/etc even though the configs are all in git. I replicate all other big datasets to cross-site failover systems with standby zones. I do backups. So why, in my big table of datasets, does this one thing have a big TODO in the replicate column?)
Postpone the maintenance window. It's ok. Sometimes scheduling conflicts come up. Sometimes you aren't as prepared as you thought you were. Your customers don't care that they weren't supposed to be able to access something for 30 minutes tonight, but instead can't tomorrow night.
Really. Get some sleep. Wake up tomorrow and feel lucky you didn't totally break something and potentially lose unrecoverable data.
🔗The Third Law
Don't make a problem worse. Especially if you caused it.
Never do anything to disks which contain data you need, even if that data is currently inaccessible. Move the workload somewhere else. Hope you think of something.
You are already eating live scorpions covered in glass, don't go setting them on fire too.
🔗Stupidity the Third
I exported the pool and restarted the system. It Jumpstarted happily. I logged in and...
# zpool import pool: tank id: 17954631541182524316 state: UNAVAIL status: One or more devices are missing from the system. action: The pool cannot be imported. Attach the missing devices and try again. see: http://www.sun.com/msg/ZFS-8000-6X config: tank UNAVAIL missing device mirror-0 ONLINE c0t2d0 ONLINE c0t3d0 ONLINE mirror-1 ONLINE c0t4d0 ONLINE c0t5d0 ONLINE Additional devices are known to be part of this pool, though their exact configuration cannot be determined. # zpool import -F tank cannot import 'tank': one or more devices is currently unavailable Destroy and re-create the pool from a backup source.
At this point there was a fair amount of cursing.
The thing is, I knew the pool was fragile. I knew that reinstalling the system was going to blow away /etc/zfs/zpool.cache, which is likely the only reason U7 was happy to import the pool after the SSD died initially and it got rebooted.
But my judgement was impaired: I was making really stupid decisions.
The stupid: Doing something irrevocably destructive to a fragile, unique system.
At this point I was screwed. I couldn't import the pool. I had no backups.
I got critical zones back up on other systems (using data that had been replicating off the now hosed box), so services would not be unduly affected. Everything was back up, but customers couldn't see messages we had discarded for them, and as such not release important mail that had been improperly discarded.
After an hour of trying various things (like logfix, and booting newer instances of Solaris) I gave up. At 0430, I woke up my co-worker Rik, and explained I had totally screwed us.
"That does sound pretty bad."
I stood up another zone so we could start importing the last seven days of messages from the message queue (which we keep as a hedge in case something just like this happens, though I doubt anyone expected me to be the cause). In the process of this, he rewrote the reindexing system to make it an order of magnitude faster. We went from the refill taking 2 days to 6 hours.
🔗The Road to Recovery
Once the refill was running my body shut down for five hours.
My brain working slightly better, I started thinking: I had a copy of the old zpool.cache, which contained configuration about the now-defunct tank pool. But how could I turn that into something useful?
Keep in mind: My data was on the disk. No corruption has occurred. It was just my version of ZFS that didn't want to import the pool with a missing log device. How could I force it to?
I had thought about several things before crashing: The logfix tool basically replaces a missing log device with another by walking the ZFS metadata tree, replacing the device path and GUID with another device or a file. Okay, I could try something like that, right? But the code needs Nevada headers or Nevada.
I came back up to James McPherson having built a logfix binary for Solaris 10. Unfortunately it didn't work (but also didn't eat anything, so props to James).
So if logfix wasn't going to work, I was going to have to do something really complicated. Digging around with zdb. Terrifying.
James got me in touch with George Wilson, who had written the zpool import recovery code in the first place. He suggested some things, including:
# zpool import -V -c /etc/zfs/zpool.cache.log tank cannot open 'tank': no such pool
Well, that's not good. zpool import by itself can see the pool, but can't import.
Specifying the secret recovery flag (-V) doesn't help, using the alternative cache file that has configuration for the log device claims to not even see the pool!
# zpool import -V -c /etc/zfs/zpool.cache.log pool: tank id: 17954631541182524316 state: DEGRADED status: One or more devices are missing from the system. action: The pool can be imported despite missing or damaged devices. The fault tolerance of the pool may be compromised if imported. see: http://www.sun.com/msg/ZFS-8000-2Q config: tank DEGRADED mirror-0 ONLINE c0t2d0 ONLINE c0t3d0 ONLINE mirror-1 ONLINE c0t4d0 ONLINE c0t5d0 ONLINE logs c0t6d0p1 UNAVAIL cannot open
Okay, so I can see the pool using the old configuration data, but I can't import it. And it's seen as DEGRADED, not UNAVAIL. It's importable. That suggests that I don't need to go digging around with zdb or a hex editor. George is also starting with the import command, not a hex editor. That seems to imply he thinks it's recoverable.
(I get that sinking feeling that something you thought was going to be really complicated and dangerous is, in fact, trivial. And you've realized long, long after you should have.)
So: -V is the old import switch. I bet that would work on U7. U9 has an actual recovery mechanism now. Maybe...
# zpool import -F -c /etc/zfs/zpool.cache.log tank Pool tank returned to its state as of Thu Nov 04 01:25:50 2010. # zpool list NAME SIZE ALLOC FREE CAP HEALTH ALTROOT rpool 136G 2.05G 134G 1% ONLINE - tank 272G 132G 140G 48% DEGRADED -
Twelve hours later, there is much more cursing.
🔗Ghost of the Arcane
A lot of UNIX comes down to reading documentation and determine which switches are going to solve your immediate problem. Here, it's two: -F and -c. That's it. Let's assume that twelve hours previous I was well-rested but still astoundingly dumb, and had managed to get myself into the situation where my pool was UNAVAIL.
Because I was well-rested, I would have read the docs, understood them, and recovered the pool within a few minutes. Instead, I had to recharge my brain, created a lot of work for my co-workers, and annoyed my customers. Good job!
Ok. Now I want to get rid of the busted log device. The newly imported degraded pool is on ZFS v10. I need to get it to at least v19, which is when log device removal was added. Thankfully U9 supports v22.
# zpool upgrade tank This system is currently running ZFS pool version 22. Successfully upgraded 'tank' from version 10 to version 22
And get rid of the dead log device:
# zpool remove tank c0t6d0p1 # zpool status -v tank pool: tank state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 errors: No known data errors
And the pool is back online in a usable state.
Before we make any changes to the newly recovered pool I take a snapshot and send it to another system. This takes a few hours. This means that if the new indexer has a bug that doesn't interact with the existing index, we'll be able to go back to the pristine data.
I also start up the rolling replication script on the dataset. The first send takes a few hours; incrementals 20-30 minutes.
Both of those things should have already been in place.
🔗That How They Do
Shortly before I got the pool back online, the 7 day import had finished and we had announced to customers they could get back to seeing their discarded messages.
Well, now I had the last 30 days of spam, and all the metadata that went with it. Rebuilding the other 23 days on the new index was going to be both non-trivial and slow. We would have to pull the information off disk for each message (around 2TB of spam), and some data was only stored in the index.
The decision was made to revert to the original dataset. I pointed the new index refiller at it, and 9 minutes later we had the last 12 hours of spam indexed. We swapped around, merged the metadata from the temporary dataset into the original one, and we were back online.
We made the announcement, wrote a blog post, and everything was good again.
Almost as if I had never done anything incredibly stupid.
Maybe you are the lone SA at a small company, but you still have resources to ask for advice. There are certainly people on IRC whose opinion I value. Your boss and co-workers may not know as much about systems as you do, but they can probably recognize a three-legged chair when it's in front of them.
It is easy to do stupid shit in a vacuum. Talking to people about it is probably enough for you to recognize if it's a bad idea.
I'll have another post coming up (with pretty graphs, hopefully) abut hybrid storage pools and their impact on performance. Two SSDs just came in to act as a mirror for this host, so it should be interesting.
You have broken something. You feel dumb and defensive, and pissed off at yourself. Don't take it out on the people who are helping you get the system back online.
When you break something and can't fix it, you create work for other people. Make sure you thank them and apologize. Act like a professional, or even just a regular human being.
I can think of a few instances where Ricardo Signes has had to save my bacon in the last few years, but probably nothing so major as this case. I had to wake him up at 0430 to give me a hand, and while he's paid to do it, it's unfortunate how rare it is to find people as pleasant and professional as he is.
Over the years I've worked with lots of smart people, but few as smart and even-tempered as rjbs. Manhug!
A brief tangent.
Sysadmins are admittedly used to other people breaking things and wanting us to fix it. Treat your co-workers, customers, and users with respect. Do not call them lusers, do not make them feel bad. It is extremely aggravating at times, but they are not a puppy who just had an accident on your new carpet. They are adults, and your colleagues.
At some point you may find yourself on the other side of the table: You have done something and now they can't get any work done. Hopefully they will recall that when they screwed up, you did not berate them, and will afford you the same courtesy.
Educate them after you have solved their problem.
Don't be a dick.
Special thanks to James McPherson of Oracle/Sun and George Wilson of Delphix (previously of Oracle/Sun) for giving me a hand. George pointed me to -V and -c which finally helped me realize just how dumb I was being and got my pool back online.
Once I realized I was screwed and got the immediate booms out of the way, I opened a case with Oracle. P1, at 0700. A rep got back to me around 1900. Nearly 12 hours later. For a "system down" event, affecting many customers, on a paid support contract.
Andre van Eyssen says: If you have a P1 problem, call it in. Don't use SunSolve. Make the call.
A support contract is not a panacea.
Design your systems to be redundant and resilient.