X4500 and ZFS pool configuration

UPDATED: 2010-01-31 2245

RAS guru Richard Elling notes a couple bad assumptions I’ve made in this post:

Much thanks to Richard for correcting me!

I was recently asked to help install and configure a Sun X4500 (“Thumper”). The system has dual Opterons, 16GB RAM, and 48 500GB SATA disks. The largest pool I’d configured before this project was a Sun J4200: 24 disks.

The Thumper controller/disk setup looks like this:

That’s six controllers, with 46 disks available for data.

The mirrored ZFS rpool are on c5t0 and c4t0. Placing the mirror halves across controllers allows the operating system to survive a controller failure.

ZFS supports two basic redundancy types: Mirroring (RAID1) and RAIDZ (akin to RAID5, but More Gooder). RAIDZ1 is single parity, and RAIDZ2 double. I decided to go with RAIDZ2 as the added redundancy is worth more than capacity: The 500GB disks can trivially be swapped out for 1TB or 2TB disks, but the pool cannot be easily reconfigured after creation.

From the ZFS Best Practices and ZFS Configuration guides, the suggested RAIDZ2 pool configurations are:

  • 4x(9+2), 2 hot spares, 18.0 TB
  • 5x(7+2), 1 hot spare, 17.5 TB
  • 6x(5+2), 4 hot spares, 15.0 TB
  • 7x(4+2), 4 hot spares, 12.5 TB

ZFS pools consist of virtual devices (vdev), which can then be configured in various ways. In the first configuration you are making 4 RAIDZ vdevs of 11 disks each, leaving 2 spares.

(ZFS pools are quite flexible: You could set up mirrors of RAIDZs, three-way mirrors, etc. In addition to single and dual parity RAIDZ, RAIDZ3 was recently released: Triple parity!)

Distributing load across the controllers is an important performance consideration but limits possible pool configurations. Preferably you want each vdev to have the same number of members. Surviving a single controller failure is also required.

RAIDZ2 is double parity, so you lose the “+2” disks noted above, but the vdev can sustain two disk losses. The entire pool can then survive the loss of vdevs*2 disks. This is pretty important given the size of the suggested vdevs (6-11 disks). The ZFS man page recommends that RAIDZ vdevs not be more than 9 disks because you start losing reliability: More disks and not enough parity to go around. The likelihood of losing more than two disks in a vdev with 30 members, instance.

The goal is to balance number of vdev members with parity.

A vdev can be grown by replacing every member disk. Once all disks have been replaced, the vdev will grow, and the pools total capacity is increased. If the goal is to incrementally increase space by replacing individual vdevs, it can be something of a hassle if you have too many (say, 11) disks to replace before you get any benefit.

The process for growing a vdev is: Replace a disk, wait for new disk to resilver, replace a disk, wait for new to resilver, replace a disk… it is somewhat time consuming; not something you want to do very often.

So the tradeoff here is really:

  • More initial space (18TB) but less trivial upgrade (11 disks), and ok performance
  • Less initial space (15TB) but more trivial upgrade (7 disks), good performance
  • Less initial space (12.5TB) but more trivial upgrade (6 disks), best performance

With the 6x(7) or 7x(6) configurations we have four free disks. One or two can be assigned to the pool as a hot spare. The other two or three disks can be used for:

  • A mirror for tasks requiring dedicated I/O
  • Replaced with 3.5” SAS or SSDs for cache devices

I’ll discuss Hybrid Storage Pools (ZFS pools with cache devices consisting of SSD or SAS drives) in another post. They greatly affect pool behavior and performance. Major game-changers.

Unsurprisingly the 12.5TB configuration has the highest RAS, and best performance. It loads disk across controllers evenly, has the best write throughput, is easiest to upgrade, etc.

Sacrificing 6TB of capacity for better redundancy and performance may not be in line with your vision of the systems purpose.

The 15TB configuration seems like a good compromise. High RAS: Tolerance to failure, good performance, good flexibility, not an incredibly painful upgrade path, and 15TB isn’t anything to sneer at.

(Note: After parity and metadata, 13.3TB is actually useable.)

The full system configuration looks like this:

pool: rpool
state: ONLINE
scrub: none requested

    rpool         ONLINE       0     0     0
      mirror      ONLINE       0     0     0
        c5t0d0s0  ONLINE       0     0     0
        c4t0d0s0  ONLINE       0     0     0 

errors: No known data errors

pool: tank
state: ONLINE
scrub: none requested

    tank        ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t7d0  ONLINE       0     0     0
        c1t7d0  ONLINE       0     0     0
        c6t7d0  ONLINE       0     0     0
        c7t7d0  ONLINE       0     0     0
        c4t7d0  ONLINE       0     0     0
        c5t7d0  ONLINE       0     0     0
        c0t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t3d0  ONLINE       0     0     0
        c1t3d0  ONLINE       0     0     0
        c6t3d0  ONLINE       0     0     0
        c7t3d0  ONLINE       0     0     0
        c4t3d0  ONLINE       0     0     0
        c5t3d0  ONLINE       0     0     0
        c1t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t6d0  ONLINE       0     0     0
        c1t6d0  ONLINE       0     0     0
        c6t6d0  ONLINE       0     0     0
        c7t6d0  ONLINE       0     0     0
        c4t6d0  ONLINE       0     0     0
        c5t6d0  ONLINE       0     0     0
        c6t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t2d0  ONLINE       0     0     0
        c1t2d0  ONLINE       0     0     0
        c6t2d0  ONLINE       0     0     0
        c7t2d0  ONLINE       0     0     0
        c4t2d0  ONLINE       0     0     0
        c5t2d0  ONLINE       0     0     0
        c7t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t5d0  ONLINE       0     0     0
        c1t5d0  ONLINE       0     0     0
        c6t5d0  ONLINE       0     0     0
        c7t5d0  ONLINE       0     0     0
        c4t5d0  ONLINE       0     0     0
        c5t5d0  ONLINE       0     0     0
        c4t4d0  ONLINE       0     0     0
      raidz2    ONLINE       0     0     0
        c0t1d0  ONLINE       0     0     0
        c1t1d0  ONLINE       0     0     0
        c6t1d0  ONLINE       0     0     0
        c7t1d0  ONLINE       0     0     0
        c4t1d0  ONLINE       0     0     0
        c5t1d0  ONLINE       0     0     0
        c5t4d0  ONLINE       0     0     0
      c0t0d0    AVAIL   
      c1t0d0    AVAIL   

errors: No known data errors

The command to create the tank pool:

zpool create tank \
raidz2 c0t7d0 c1t7d0 c6t7d0 c7t7d0 c4t7d0 c5t7d0 c0t4d0 \ 
raidz2 c0t3d0 c1t3d0 c6t3d0 c7t3d0 c4t3d0 c5t3d0 c1t4d0 \
raidz2 c0t6d0 c1t6d0 c6t6d0 c7t6d0 c4t6d0 c5t6d0 c6t4d0 \
raidz2 c0t2d0 c1t2d0 c6t2d0 c7t2d0 c4t2d0 c5t2d0 c7t4d0 \
raidz2 c0t5d0 c1t5d0 c6t5d0 c7t5d0 c4t5d0 c5t5d0 c4t4d0 \
raidz2 c0t1d0 c1t1d0 c6t1d0 c7t1d0 c4t1d0 c5t1d0 c5t4d0 \
spare c0t0d0 c1t0d0

This leaves c6t0d0 and c7t0d0 available for use as more spares, for another pool or as cache devices.

I feel the configuration makes for a good compromise. If it doesn’t prove successful or we’ve misjudged the workload for the machine, we have the ability to add cache devices without compromising the pool’s redundancy.

That said, I’ll be quite interested in seeing how it performs!

Posted Sunday, January 31st, at 8:31 PM (∞).
Documenting infrastructure changes over time is useful for spotting trends in your knowledge. It’s also helpful when identifying areas that are lacking in investment. Keep track of the major changes to your platform; this table is nice, but you can see there was a large chunk of time I didn’t update it. Things were still changing, but they weren’t documented. Going back a year and digging through our ticketing system looking for major changes wouldn’t necessarily be trivial, either. Keep your documentation up to date or it’s useless!

In addition to a ticketing system, we utilize a CHANGELOG mailling list, where summaries of major operations, development, or policy changes should be sent to, tagged appropriately. We only started doing this in April, though, so populating the missing year is still hard.

On our wiki, I also have brief notes for the content of each release.

Documenting infrastructure changes over time is useful for spotting trends in your knowledge. It’s also helpful when identifying areas that are lacking in investment. Keep track of the major changes to your platform; this table is nice, but you can see there was a large chunk of time I didn’t update it. Things were still changing, but they weren’t documented. Going back a year and digging through our ticketing system looking for major changes wouldn’t necessarily be trivial, either. Keep your documentation up to date or it’s useless!

In addition to a ticketing system, we utilize a CHANGELOG mailling list, where summaries of major operations, development, or policy changes should be sent to, tagged appropriately. We only started doing this in April, though, so populating the missing year is still hard.

On our wiki, I also have brief notes for the content of each release.

Posted Friday, September 25th, at 4:04 PM (∞).
Available in higher resolution.

Versioned Will Enforcement (You Can Too!)

If security is a never-ending process, operations is systematic refutation of entropy.

When I first started out, I did everything by hand. Installs, configuration, every aspect of management. Eventually I started writing scripts. Scripts would install things for me, and copy configuration files, and whatever else. But the scripts were stupid. If they were run more than once, bad things might happen.

The day I did my first automated network install was an epiphany. No more hitting enter or clicking next until my eyes bled a merry pattern on the keyboard.

The weird thing is, my first job involved using Norton Ghost to install entire labs of workstations with an operating system image. But it never occurred to me, until many years later, that a similar thing could be had for servers. A major hole in my experience.

So then I started using images to install new systems. Of course, the problem with using images is that as soon as you build them, they’re out of date. What’s in the image is not actually representative of what you have in production. The image has new stuff the production boxes won’t, or the production systems were changed in some undocumented way that is not reflected in the image, or… Anyway, then you end up writing more scripts. To keep things in sync. Only they aren’t perfect, because by this point every system is just slightly different enough that you can’t find all the edge cases until they cause a boom.

Two years ago I discovered Puppet. I had seen change management before, but in the form of cfengine, and it didn’t really grab me. Its syntax didn’t make my life any easier. It didn’t offer a mental model for how the different pieces of my infrastructure interacted. Puppet did. Maybe Luke just explained it properly in the videos I watched while researching change management tools.

The joy of change management comes from documenting your infrastructure, and then enforcing that singular vision across it with a minimum of effort.

When you install a new host (presumably using Jumpstart/JET, or FAI, or Cobbler), you install Puppet. A few minutes later, that host is now configured with the same base as the rest of your installed hosts. They’re all the same. File permissions, users, directories, services, cron jobs…

If a service needs to be installed on a group of hosts, you write the service class, include it in the service group, and Puppet does the rest.

There’s no more “Oh, right, we changed how that works, but I guess this system we never think about didn’t get updated, and now we’ve totally screwed ourselves in some really unexpected way.”

There’s no more “Hm, someone changed something on this box, and I don’t know why, but I’d better not touch it,” because your Puppet classes are in a versioned repository. You always know who, and why, something was done. (If someone does make a local change, well, too bad for them, because Puppet is bloody well going to change it back until they create an auditable configuration trail.)

I think there’s a threshold: Once you hit a certain number of hosts, you can’t keep them all in your head. I have 20 physical hosts and 87 virtual ones. When I bring up a new Solaris zone, I don’t want to have to run some script that configures it. Heck, I don’t even want to bring it up myself. I just tell Puppet to do it, and then Puppet enables itself in the zone, and then the zoned Puppet configures the zone and suddenly whatever service I wanted to be running is.

I don’t want to have my installation method add a bunch of users. What if I have new users? Now I need to make sure my user adding scripts, and my post-installation scripts, will do the right thing! No, I think I’ll just let Puppet ensure, every 20 minutes, that users who are supposed to exist, do, and those who shouldn’t, don’t. (Not to mention that Puppet makes sure the users environment is always set up. No more having to copy your dot-files around, or checking them out from your version control system, or…)

Once you reach a certain amount of platform complexity, you need to abstract management into something you can keep in your head. Otherwise you end up spinning repetitively instead of focusing on newer, more interesting work.

It isn’t even really that much of a paradigm shift. We always end up writing scripts to manage our systems for us. Taking the next step and writing classes and functions in Puppet’s declarative language really isn’t a leap.

Once a codebase reaches a certain amount of complexity, it has to be refactored. It has to be abstracted. Otherwise it becomes unmaintainable. As with development, so too for operations.

If you’ve been at this game for a number of years, and you find yourself performing the same tasks over and over; or like me you are administering a moderate number of hosts; or you have thousands upon thousands of systems, and you aren’t using some form of versioned change management: Consider this an intervention.

Dude. You’re doing it wrong.

Posted Friday, September 25th, at 9:08 AM (∞).

Console Cowboy Wrangles Himself Out of Work

I’ve been at my current job for three years and change. In many respects, it has been the biggest learning experience of my career (which started in 1999). My previous job had been doing network security for a decently sized university, and the experience almost drove me crazy (go find pictures of my mad scientist hair from that year; you will not question my unstable mental state again, I assure you). When a friend mentioned his employer was hiring, and did I know any system admins looking for work, I said, yeah. Me.

The infrastructure was full of legacy: 10 year old code, four year old Linux boxes, crufty hardware… It wasn’t all doom and gloom; lots of new code existed and worked very well. The R&D side was populated with two very smart people, with good plans on how to fix their side of the shop.

The ops side was a bit of a mess. It was the end result of programmers shoved into administration. There had been no dedicated systems administrator in several years; the admin work had been doled out to the programmers, who understandably had little interest in systems.

After a year digging through and learning as much as I could about the setup, I decided the best solution was to rip it all down and build anew. It’s a testament to either my salesmanship (unlikely) or a willingness and trust by both development and management to try something new and, hopefully, better. Given the state of affairs, though, it probably wasn’t much of a leap. My arguments were sound, and the testing I had done backed them up even more. It wasn’t going to be easy, but migrating from Linux to Solaris 10 was definitely where we wanted to go.

Of course, the changes were rolled out incrementally. In February of 2007 I rolled out our first Solaris 10 box, on a Sun Fire X2100. A little entry-level system, but when you’re being disruptive to a complex ecosystem, it’s good to work incrementally. Otherwise people start asking why the frogs have all suddenly died off.

The subsequent two years saw a lot of changes. All our core services moved onto bigger and better Sun systems running Solaris 10 (the biggest currently being four X4170s that I love). We went from 50 Linux boxes, to a dozen or so Solaris systems.

Consolidation was the first order of business, which is sort of amusing. When I started, each MX ran not only an MTA, a lot of Perl dispatching services, and cached RBL data, it ran a complete replica of the database. The first thing I did to improve MX performance as get MySQL off the MXes onto a dedicated replica, and have each set of site MXes using that. If I remember right, the improvement was something like 50-75%.

So when I started consolidating services into Solaris Zones, the irony didn’t escape me. I had started out separating services onto dedicated hardware, and now I was stuffing a bunch of random toys into the same box again. (Of course, the databases are still on dedicated hardware; and well. New dual CPU quad core Xeons and Nehalems with SAS disks and 32GB of RAM kind of beat the pants off the dual Athlons we had been using…)

After consolidation came change management; Puppet proved to be an excellent choice, and I’ve been happy with it since. Puppet manages almost every aspect of our services. If it isn’t managed, it’s a bug, and a task gets made to fix it.

After consolidation came standardization; in addition to keeping all the systems near the same patch and release level, I rolled out pkgsrc across both our Solaris and Linux platforms. Having the same version of a package on both made life easier in a lot of ways.

We went through several iterations of both installation and management techniques. I had never admin’d Solaris before, so it was a learning experience both for me and (perhaps less so) for our developers. We had to port a lot of code that relied on Linuxisms, and one of our devs built a framework around the CPAN which would keep all our Perl modules in sync across any number of platforms (right now, just two: Debian Linux and Solaris 10, both on x86). We’re a big Perl shop; if you use Perl email modules from the CPAN, you probably use code we developed or maintain.

In addition to the operations turmoil, we went through several changes in how we scheduled and managed our actual work. We finally settled on two week iterations. Each iteration is planned in advance, at the end of the previous iteration. We use Liquid Planner for this, and it has really worked out.

My major regret in rolling out Solaris was not using Live Upgrade until far too late. It wasn’t until two months ago that I actually sat down and took the fifteen minutes to read the documentation and do a test upgrade. For the previous two years I had been patching and upgrading systems stupidly and with as much tedium as was possible. Live Upgrade is one of Solaris’s killer features, right up there with Zones, ZFS, DTrace, mdb, and SMF. I wasted a lot of time I needn’t have if I had been using it.

But… after two years, the infrastructure is stable. We no longer have a monitor that fires when a system boots (uptime.monitor), because systems don’t randomly reboot. If a host does fall offline, the monitors that watch the services the host provides fire instead (and, of course, the ICMP checks). Services live in discrete containers, and it’s easy to tell what’s causing problems at a glance; and if glancing doesn’t work, well, there’s the DTrace Toolkit. Every system’s configuration is enforced by Puppet. Everything from users, to services, to ZFS filesystems, to zones, are versioned and managed (I’ll expand on this in a later post, because I’ve come to believe if you aren’t using change management, You’re Doing It Wrong).

Last week I went away for five days, with no Internet access, and I received no harried phone calls from the developers or support staff. No one even emailed me any questions (not that I would have seen it); the systems just did what they’re meant to: Work.

It’s been percolating for a while, but that really was the clincher. When the lone admin can disappear for a business week and the world doesn’t notice, what becomes of him?

All the basic infrastructural problems have been solved. The foundation is now sound.

For the last two years that was my goal, and it’s been the core focus of every day I do work. All of my plans, from moving our fileservers from mirrored SATA drives in SuperMicros running reiserfs (how many nightmares did that filesystem cause me, I try not to think about) to Dell 210S JBODs on ZFS, to finally Sun J4200s, to… well. To everything. The websites, MX policy servers, spam storage, DNS, SASL, the build system, the development environment, support and billing… Putting out each of those fires was as far as I could see.

There are plenty of things left do on the operations side, certainly: Better monitoring and visualization (Reconnoiter?), refactoring our Puppet classes so they’re not horrible, code instrumentation and log searching that aren’t wrappers around grep, fixing the build and push systems so they’re not rsync and Makefile, Rakefiles, and things we call Bakefiles but are, in fact, not.

And that’s all really important stuff. But what we have works. It’s not falling over. It doesn’t cause a crisis. None of it is on fire.

Looking back at the last ten years, when I’m not in crisis mode, tearing stuff down and rebuilding it, I get bored. I get bored and I find another shop that is on fire.

I really like my job. I don’t much want to find another. I’ve come to enjoy going to bed at a reasonable hour and getting a reasonable amount of sleep. I’ve just turned 30. There are white streaks in my beard.

Firefighting is for younger people, with less experience but more energy.

Now I have to figure out what a systems administrator does, when the world isn’t actually on fire. When things are, on the whole, ticking along pretty well, in fact. In many respects this is where sysadmins always say they want to end up. Where their job is to sit around playing Nethack, because the thing they have designed Just Works. That would drive me mad. If I’m not designing and implementing something to improve the things I’m responsible for, I get really unhappy. My joy circuit ceases to fire. I have no aspirations for supreme slack.

My shop is no longer on fire.

So: Now what?

Posted Friday, September 25th, at 9:03 AM (∞).

Powered by Tumblr; themed by Adam Lloyd.