Console Cowboy Wrangles Himself Out of Work
I've been at my current job for three years and change. In many respects, it has been the biggest learning experience of my career (which started in 1999). My previous job had been doing network security for a decently sized university, and the experience almost drove me crazy (go find pictures of my mad scientist hair from that year; you will not question my unstable mental state again, I assure you). When a friend mentioned his employer was hiring, and did I know any system admins looking for work, I said, yeah. Me.
The infrastructure was full of legacy: 10 year old code, four year old Linux boxes, crufty hardware... It wasn't all doom and gloom; lots of new code existed and worked very well. The R&D side was populated with two very smart people, with good plans on how to fix their side of the shop.
The ops side was a bit of a mess. It was the end result of programmers shoved into administration. There had been no dedicated systems administrator in several years; the admin work had been doled out to the programmers, who understandably had little interest in systems.
After a year digging through and learning as much as I could about the setup, I decided the best solution was to rip it all down and build anew. It's a testament to either my salesmanship (unlikely) or a willingness and trust by both development and management to try something new and, hopefully, better. Given the state of affairs, though, it probably wasn't much of a leap. My arguments were sound, and the testing I had done backed them up even more. It wasn't going to be easy, but migrating from Linux to Solaris 10 was definitely where we wanted to go.
Of course, the changes were rolled out incrementally. In February of 2007 I rolled out our first Solaris 10 box, on a Sun Fire X2100. A little entry-level system, but when you're being disruptive to a complex ecosystem, it's good to work incrementally. Otherwise people start asking why the frogs have all suddenly died off.
The subsequent two years saw a lot of changes. All our core services moved onto bigger and better Sun systems running Solaris 10 (the biggest currently being four X4170s that I love). We went from 50 Linux boxes, to a dozen or so Solaris systems.
Consolidation was the first order of business, which is sort of amusing. When I started, each MX ran not only an MTA, a lot of Perl dispatching services, and cached RBL data, it ran a complete replica of the database. The first thing I did to improve MX performance as get MySQL off the MXes onto a dedicated replica, and have each set of site MXes using that. If I remember right, the improvement was something like 50-75%.
So when I started consolidating services into Solaris Zones, the irony didn't escape me. I had started out separating services onto dedicated hardware, and now I was stuffing a bunch of random toys into the same box again. (Of course, the databases are still on dedicated hardware; and well. New dual CPU quad core Xeons and Nehalems with SAS disks and 32GB of RAM kind of beat the pants off the dual Athlons we had been using...)
After consolidation came change management; Puppet proved to be an excellent choice, and I've been happy with it since. Puppet manages almost every aspect of our services. If it isn't managed, it's a bug, and a task gets made to fix it.
After consolidation came standardization; in addition to keeping all the systems near the same patch and release level, I rolled out pkgsrc across both our Solaris and Linux platforms. Having the same version of a package on both made life easier in a lot of ways.
We went through several iterations of both installation and management techniques. I had never admin'd Solaris before, so it was a learning experience both for me and (perhaps less so) for our developers. We had to port a lot of code that relied on Linuxisms, and one of our devs built a framework around the CPAN which would keep all our Perl modules in sync across any number of platforms (right now, just two: Debian Linux and Solaris 10, both on x86). We're a big Perl shop; if you use Perl email modules from the CPAN, you probably use code we developed or maintain.
In addition to the operations turmoil, we went through several changes in how we scheduled and managed our actual work. We finally settled on two week iterations. Each iteration is planned in advance, at the end of the previous iteration. We use Liquid Planner for this, and it has really worked out.
My major regret in rolling out Solaris was not using Live Upgrade until far too late. It wasn't until two months ago that I actually sat down and took the fifteen minutes to read the documentation and do a test upgrade. For the previous two years I had been patching and upgrading systems stupidly and with as much tedium as was possible. Live Upgrade is one of Solaris's killer features, right up there with Zones, ZFS, DTrace, mdb, and SMF. I wasted a lot of time I needn't have if I had been using it.
But... after two years, the infrastructure is stable. We no longer have a monitor that fires when a system boots (uptime.monitor), because systems don't randomly reboot. If a host does fall offline, the monitors that watch the services the host provides fire instead (and, of course, the ICMP checks). Services live in discrete containers, and it's easy to tell what's causing problems at a glance; and if glancing doesn't work, well, there's the DTrace Toolkit. Every system's configuration is enforced by Puppet. Everything from users, to services, to ZFS filesystems, to zones, are versioned and managed (I'll expand on this in a later post, because I've come to believe if you aren't using change management, You're Doing It Wrong).
Last week I went away for five days, with no Internet access, and I received no harried phone calls from the developers or support staff. No one even emailed me any questions (not that I would have seen it); the systems just did what they're meant to: Work.
It's been percolating for a while, but that really was the clincher. When the lone admin can disappear for a business week and the world doesn't notice, what becomes of him?
All the basic infrastructural problems have been solved. The foundation is now sound.
For the last two years that was my goal, and it's been the core focus of every day I do work. All of my plans, from moving our fileservers from mirrored SATA drives in SuperMicros running reiserfs (how many nightmares did that filesystem cause me, I try not to think about) to Dell 210S JBODs on ZFS, to finally Sun J4200s, to... well. To everything. The websites, MX policy servers, spam storage, DNS, SASL, the build system, the development environment, support and billing... Putting out each of those fires was as far as I could see.
There are plenty of things left do on the operations side, certainly: Better monitoring and visualization (Reconnoiter?), refactoring our Puppet classes so they're not horrible, code instrumentation and log searching that aren't wrappers around grep, fixing the build and push systems so they're not rsync and Makefile, Rakefiles, and things we call Bakefiles but are, in fact, not.
And that's all really important stuff. But what we have works. It's not falling over. It doesn't cause a crisis. None of it is on fire.
Looking back at the last ten years, when I'm not in crisis mode, tearing stuff down and rebuilding it, I get bored. I get bored and I find another shop that is on fire.
I really like my job. I don't much want to find another. I've come to enjoy going to bed at a reasonable hour and getting a reasonable amount of sleep. I've just turned 30. There are white streaks in my beard.
Firefighting is for younger people, with less experience but more energy.
Now I have to figure out what a systems administrator does, when the world isn't actually on fire. When things are, on the whole, ticking along pretty well, in fact. In many respects this is where sysadmins always say they want to end up. Where their job is to sit around playing Nethack, because the thing they have designed Just Works. That would drive me mad. If I'm not designing and implementing something to improve the things I'm responsible for, I get really unhappy. My joy circuit ceases to fire. I have no aspirations for supreme slack.
My shop is no longer on fire.
So: Now what?