Installing Chef on Joyent’s SmartOS

This is effectively the same procedure as installing Chef on Solaris with pkgsrc or OpenSolaris/OpenIndiana with IPS. We’ll be using Joyent’s provided pkgsrc setup and pkgin installer.

If you’re using SmartOS on the Joyent Cloud, you’ll have pkgin already available. If you’re running SmartOS yourself, you’ll need to install it.

@benjaminws threw up a bootstrap template for the following as well!

The bits:

pkgin install gcc-compiler gcc-runtime gcc-tools-0 ruby19 scmgit-base scmgit-docs gmake sun-jdk6

wget http://production.cf.rubygems.org/rubygems/rubygems-1.8.10.tgz
tar -xzf rubygems-1.8.10.tgz
cd rubygems-1.8.10
ruby setup.rb --no-format-executable

gem install --no-ri --no-rdoc chef

mkdir /etc/chef

cat <<EOF>> /etc/chef/client.rb
log_level        :info
log_location     STDOUT
chef_server_url  "http://chefserver.example.com:4000"
validation_client_name "chef-validator"
node_name "host.domain.com"
EOF

Drop your validation.pem in /etc/chef, and then run chef-client.

Posted Monday, October 10th, at 3:30 PM (∞).
Comments

Booting OpenIndiana on Amazon EC2

Since OpenSolaris was axed, we haven’t had an option for running a Solaris-based distribution on EC2. Thanks to the hard work of Andrzej Szeszo, commissioned by Nexenta, now we do.

This should be considered a proof of concept, and is perhaps not ready for production. Use at your own risk!

Spin up ami-4a0df023 as m1.large or bigger. This is an install of OpenIndiana oi_147.

Authentication

The image doesn’t currently import your EC2 keypairs, so you’ll need to log in as a user. root is a role in this image, so you’ll need to log in as the oi user.

The oi user’s password is “oi”.

# ssh oi@1.2.3.4
The authenticity of host '1.2.3.4 (1.2.3.4)' can't be established.
RSA key fingerprint is da:b9:0e:73:20:81:4f:a2:a7:91:0d:7d:3c:4b:cb:80.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '1.2.3.4' (RSA) to the list of known hosts.
Password: 
Last login: Sun Apr 17 06:54:24 2011 from domu-3-4-5-6
OpenIndiana     SunOS 5.11      oi_147  September 2010
oi@openindiana:~$ pfexec su -
Password: 
OpenIndiana     SunOS 5.11      oi_147  September 2010
root@openindiana:~# 

To make root a real user again, use the following commands:

$ pfexec rolemod -K type=normal root
$ pfexec perl -pi -e 's/PermitRootLogin no/PermitRootLogin yes/' /etc/ssh/sshd_config
$ pfexec svcadm restart ssh

You can now log in as root as you’d expect. This behavior is changing all over the place (including Solaris 11 proper), but I don’t mind being a dinosaur.

Caveats

There are some limitations, however.

Boot Environments

  • You have no console access
  • If an upgrade fails, you can’t choose a working BE from grub
  • For the same reason, you won’t be able to boot failsafe

Zones

  • You won’t be able to pull an IP from EC2 for your zones
  • Only one Elastic IP can be assigned to an instance, so you won’t be able to forward a public IP to an internal zone IP
  • You’ll be able to forward ports from the global zone to zones, of course, but this is less useful than zones having unique IPs associated with them

EBS

  • There is a bug in devfsadmd which doesn’t like device IDs over 23. I describe how to deal with this below.

Boot Volumes

There are two EBS volumes assigned with this AMI. The 8GB one is the root pool device. The 1GB is where the boot loader (pv-grub) lives.

Triskelios joked earlier that your instance got into a hosed state, you could mount the 1GB volume elsewhere, munge your grub config, then assign it back to your the busted instance. This theoretically gets around not having console access. It’s also hilarious. But could work.

Upgrading to oi_148

oi_147 has some known bugs, so we want to get up to oi_148. You could also update to the OpenIndiana illumos build (dev-il), but we’ll stick with 148 for now.

The old opensolaris.org publisher is still available, as there is software on it not available in OpenIndiana’s repo. However, we need to set that publisher non-sticky so it doesn’t hold back package upgrades. If you don’t set the repo non-sticky, you won’t get a complete upgrade. You’ll be running a 148 kernel, but lots of 147 packages. One symptom of this is zones won’t install.

root@openindiana:~# pkg publisher
PUBLISHER                             TYPE     STATUS   URI
openindiana.org          (preferred)  origin   online   http://pkg.openindiana.org/dev/
opensolaris.org                       origin   online   http://pkg.openindiana.org/legacy/

root@openindiana:~# pkg set-publisher --non-sticky opensolaris.org

root@openindiana:~# pkg publisher
PUBLISHER                             TYPE     STATUS   URI
openindiana.org          (preferred)  origin   online   http://pkg.openindiana.org/dev/
opensolaris.org          (non-sticky) origin   online   http://pkg.openindiana.org/legacy/

Once that’s done, we update the current image.

root@openindiana:~# pkg image-update
                Packages to remove:     4
               Packages to install:     7
                Packages to update:   531
           Create boot environment:   Yes
DOWNLOAD                                  PKGS       FILES    XFER (MB)
Completed                              542/542 11254/11254  225.5/225.5

PHASE                                        ACTIONS
Removal Phase                              1882/1882
Install Phase                              2382/2382
Update Phase                             19953/19953

PHASE                                          ITEMS
Package State Update Phase                 1073/1073
Package Cache Update Phase                   535/535
Image State Update Phase                         2/2

A clone of openindiana exists and has been updated and activated.
On the next boot the Boot Environment openindiana-1 will be mounted on '/'.
Reboot when ready to switch to this updated BE.


---------------------------------------------------------------------------
NOTE: Please review release notes posted at:

http://docs.sun.com/doc/821-1479
---------------------------------------------------------------------------

A new BE has been created for us, and is slated to be active on reboot.

root@openindiana:~# beadm list
BE            Active Mountpoint Space Policy Created      
--            ------ ---------- ----- ------ -------      
openindiana   N      /          94.0K static 2011-04-04 23:00
openindiana-1 R      -          3.90G static 2011-04-17 07:20

root@openindiana:~# reboot

Once the instance comes back up, log in:

# ssh oi@1.2.3.4
Password:
Last login: Sun Apr 17 06:54:25 2011 from domu-3.4.5.6
OpenIndiana     SunOS 5.11      oi_148  November 2010
oi@openindiana:~$ pfexec su -
Password:
OpenIndiana     SunOS 5.11      oi_148  November 2010

As you can see, we’re running oi_148.

To be sure the upgrade is happy and we don’t have any sticky 147 bits left:

root@openindiana:~# pkg list | grep 147
root@openindiana:~#

Awesome.

ZFS on EBS

Create some EBS volumes and attach them to the instance. You’ll need to specify a Linux-style device path for the volume. The older OpenSolaris AMI required a numeric device, as Solaris expects device IDs to be 0..23. This either seems to have been broken at some point in the last year, or doesn’t work with pv-grub. Regardless, we can work around it.

$ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b
$ ./aws create-volume --size 128 --zone us-east-1b

$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdc
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdd
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sde
$ ./aws attach-volume vol-vvvvvvvv -i i-iiiiiiii -d /dev/sdf

Once the volumes are available, you’ll see messages like this in dmesg:

failed to lookup dev name for /xpvd/xdf@2128
disk_link: invalid disk device number (2128)

Which is the devfsadmd bug I mentioned above. Solaris expects device IDs to be 0..23, and devfsadm doesn’t know how to deal with anything higher.

There’s very likely a way to automate this, but I just wrote a stupid script that creates links in /dev/dsk and /dev/rdsk for the devices we’ve attached to the instance. Until the devices have the proper links, you won’t see them in format or iostat. And cfgadm doesn’t work in a Xen guest, so.

The device IDs are consistent, however. The first two disks in the system (the rpool and the pv-grub volumes) are 2048 and 2064. The device IDs increment by 16:

root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskdone


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <drive type unknown>
          /xpvd/xdf@2048
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
          /xpvd/xdf@2064
Specify disk (enter its number):

So now we link in the new devices:

root@openindiana:~# ./links.sh c0t2d0 2080
root@openindiana:~# ./links.sh c0t3d0 2096
root@openindiana:~# ./links.sh c0t4d0 2112
root@openindiana:~# ./links.sh c0t5d0 2128

root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskFailed to inquiry this logical diskdone


AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <drive type unknown>
          /xpvd/xdf@2048
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
          /xpvd/xdf@2064
       2. c0t2d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
          /xpvd/xdf@2080
       3. c0t3d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
          /xpvd/xdf@2096
       4. c0t4d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
          /xpvd/xdf@2112
       5. c0t5d0 <??????HH???X?[??????? cyl 16709 alt 0 hd 255 sec 63>
          /xpvd/xdf@2128
Specify disk (enter its number):

Create our ZFS pool:

root@openindiana:~# zpool create tank mirror c0t2d0 c0t3d0 mirror c0t4d0 c0t5d0
root@openindiana:~# zpool list tank
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
tank   254G    80K   254G     0%  1.00x  ONLINE  -

Once the EFI labels have been written to the disks, format stops throwing errors on them, as well:

root@openindiana:~# format < /dev/null
Searching for disks...
Failed to inquiry this logical diskdone

AVAILABLE DISK SELECTIONS:
       0. c0t0d0 <drive type unknown>
          /xpvd/xdf@2048
       1. c0t1d0 <??????HH???X?[??????? cyl 4095 alt 0 hd 128 sec 32>
          /xpvd/xdf@2064
       2. c0t2d0 <Unknown-Unknown-0001-128.00GB>
          /xpvd/xdf@2080
       3. c0t3d0 <Unknown-Unknown-0001-128.00GB>
          /xpvd/xdf@2096
       4. c0t4d0 <Unknown-Unknown-0001-128.00GB>
          /xpvd/xdf@2112
       5. c0t5d0 <Unknown-Unknown-0001-128.00GB>
          /xpvd/xdf@2128
Specify disk (enter its number):

And, just for fun:

root@openindiana:~# dd if=/dev/urandom of=/tank/random bs=1024 count=204800
204800+0 records in
204800+0 records out
209715200 bytes (210 MB) copied, 9.14382 s, 22.9 MB/s

So now you have ZFS on EBS, with the ability to do anything ZFS can. Snapshots will be much, much faster than EBS snapshots (though not complete copies, and will obviously be lost if your pool is lost, whereas EBS snapshots are complete copies of the volume and can be cloned and mounted out of band), enable compression, dedup (though this would probably be terrifyingly slow on EC2), and so on.

links.sh is available here.

Virtual Networking

This is fodder for another post, but something I’ve done elsewhere is to use Crossbow to create a virtual network with zones and VirtualBox VMs. The global zone runs OpenVPN, giving clients access to these private resources. This model seems perfectly suited to EC2, given the IP assignment limitations noted above. Unfortunately I don’t’ imagine VirtualBox is an option here, but even just a private network of zones would be extremely useful.

And perhaps someday EC2 will let you assign multiple Elastic IPs to an instance.

Conclusion

While there are still a few bugs to be worked out, this proof of concept AMI does work and is ready to have its tires kicked.

I’m pretty stoked to have Solaris available on EC2. Thanks Andrzej and Nexenta both!

Posted Sunday, April 17th, at 7:18 AM (∞).
Comments

Building python26@pkgsrc on Solaris

Python fails to build its socket and ssl modules. This is fixed in 2.7, but won’t be backported to 2.6.

This recent post to pkg@netbsd details the problem and links to the patch:

http://mail-index.netbsd.org/tech-pkg/2011/04/11/msg006984.html

Posted Wednesday, April 13th, at 2:23 AM (∞).
Comments

Building DBD::mysql with SUNWspro and pkgsrc on Solaris 11 Express

I use resmon, which is a pretty nice system metric aggregator. It relies on the system Perl specifically for Solaris::Kstat, and so you don’t have to install pieces of the CPAN to get it running. Earlier tonight I decided to point its default MySqlStatus module at our MySQL master and ran into a few annoyances.

Historically, getting DBD::mysql installed with the system Perl has proven somewhat painful.

We use the packages from mysql.com, whose libmysqlclient is not built shared, so you can’t build DBD::mysql against them. I could have installed pkg:/database/mysql-5? but I already have a MySQL install via pkgsrc.

So, simply:

Make sure /opt/SUNWspro/bin/cc is first in your path, and:

# /bin/perl Makefile.PL —libs=”-L/usr/pkg/lib/mysql -R/usr/pkg/lib/mysql -lmysqlclient -lz” —cflags=”-I/usr/pkg/include/mysql -I/usr/include -m32” # make && make install

And huzzah. DBD::mysql.

Posted Wednesday, January 5th, at 8:36 PM (∞).
Comments

Perl 5.12.2, Solaris, Sun Studio, -m64, -Dvendorprefix woes.

UPDATE Nick Clark dug into this and determined it’s a bug in Sun Studio 12.1. Use 12.2 to build Perl. If anyone wants at Oracle wants to buy him some beers, send someone from Sun with them. UPDATE

I spent a fair chunk of yesterday afternoon (between diaper changing, swaddling, swinging, singing, and so forth) debugging a weird problem with Perl 5.12.2 on Solaris.

I had been deploying an updated pkgsrc build with ABI=64 and Sun^WSolaris Studio 12.1 for a new project, and ran into perl@pkgsrc segfaulting on certain modules. Extremely weird. I pulled the source and built that without issue, adding only -Dcc=cc -Accflags='m64' Aldflags='-m64' to build it 64bit with Studio.

This particular project requires deploying Perl modules in tiers, and I thought I would use vendor_perl for stuff I want installed by default that may not necessarily need to live in site_perl. As soon as I rebuilt Perl with -Dvendorprefix the same modules started throwing segv at me.

About five hours of rebuilds later (works fine with gcc, Studio and 32bit, 64bit on Linux with gcc, etc), and here’s the bug report.

Narrowing it down to that I just decided to use APPLLIB_EXT and site_perl.

Very weird.

Posted Thursday, December 16th, at 5:01 AM (∞).
Comments

Building nginx@pkgsrc on Solaris/sspro w

Hosed by default. See this post.

For amd64, you’ll want to use

CONFIGURE_ENV+= NGX_AUX=" src/os/unix/ngx_sunpro_amd64.il"

You can also just add that to the Makefile.

Posted Friday, December 10th, at 2:19 PM (∞).
Comments

The First Law of Systems Administration

This post details an outage I caused this week by making several poor decisions.

Each point contains lessons I have learned over the past 10 years, and in this instance studiously ignored. Things I am typically very careful to avoid doing. My record for not breaking things is actually pretty decent, but when I do break things it tends to occur under the same set of circumstances (I’m tired and in a hurry).

Even with a decade of experience and a process that mitigates failures, I managed to do something really, really dumb.

A couple months ago I attended Surge in Baltimore, a conference whose focus is on scalability and dealing with failures. The best talks came down to “this is how we broke stuff, and this is how we recovered.”

Hopefully illuminating this particular failure will not just help someone else recover from something similar, but remind my fellow sysadmins that sometimes you just need to take a nap.

The First Law

Backups. Never do anything unless you have backups.

Stupidity the First

A few weeks ago I added an OCZ Vertex 2 SSD to a ZFS pool as a write cache. These are low-end devices, with not a great MTBF, but my research suggested they would fit our needs.

The pool in question is configured as an array of mirrors. The system was running Solaris 10 U7, which does not have support for either import recovery (-F), import with a missing log device (-m), or removal of log devices.

I had tested the SSD for about a week, burning it in.

The SSD was added without a mirror.

I was quite pleased with myself: The performance increase was obvious and immediate. Good job, me, for making stuff better.

A week after being added to the pool, the SSD died. The exact error from the Solaris SCSI driver was “Device is gone.”

The zpool hung, necessitating a hard system reset. The system came back up, with the SSD being seen as UNAVAIL. We lost whatever writes were in-flight through the ZIL, but given the workload, that was going to be either minor or recoverable.

I made myself a bug to RMA the SSD and order a pair of new ones, and stopped thinking about it, annoyed that a brand new device died after less than a month.

The stupid: Adding a single point of failure to a redundant system.

Bonus stupid: Not more than a month ago I argued on a mailing list that you should always have a mirrored ZIL, regardless of whether or not your system supported import -F or -m. Yup. I ignored my own advice, because I wanted an immediate performance increase.

Extra bonus stupid: Not fixing a problem relating to storage immediately. Sysadmins wrangle data. It’s what we do and when we do it well, it’s why people love us. Leaving a storage system in a hosed, if working, state, is just asking for pain later. Begging for it.

The Second Law

You are not a computer.

Sometimes you are just too tired to work.

Never do anything when your judgement is impaired. In particular, never make major decisions without confirmation when you are overtired (and had, perhaps, just gotten a flu shot). It leads to calamities.

As sysadmins we often have to work on little sleep in non-optimal situations or environments. We sometimes take it as a point of pride that we can do incredibly complex things when we’re barely functional.

At some point you are going to screw yourself, though.

One thing I know about myself: I get really stupid when I’m too tired. If I get woken up at 0300 by a page, I can muscle-memory and squint my way to a fix. If I’ve been up for 14-16 hours and I’ve been getting say, maybe, four hours of sleep a night for the past two months?

I’m going to do something dumb.

Stupidity the Second

I have been upgrading systems to U9 over the last few weeks. The system with the UNAVAIL SSD came up on the rotation. With U9 I’d be able to remove the dead log device. We announced a 30m outage.

And here is where impaired judgement comes in. If the following two thoughts are in your head:

  • I am exhausted
  • I just want to get this done

Stop whatever it is you’re doing. Go take a nap. Wait until a co-worker is around so they can tell you “holy crap, why are you eating live scorpions covered in glass? Stop that stupid thing you are doing!”

My wife is well aware that I do stupid things when I’m tired and tells me “do that later. Go to bed.” Listen to my wife.

I decided to go ahead and upgrade the system with the DEGRADED pool. I have rolling backups for everything on the system except the dataset containing our spam indexes (which are required so customers can view spam we have discarded for them, and release false positives).

Rather than wait to sync that dataset off-system (3-4 hours, and why hadn’t I just started a rolling sync earlier that day? Or had one for the last two years?) I decided to go ahead and upgrade the system.

The stupid: Why would you ever put unique data at risk like this?

Bonus stupid: Why is the data unique? There is no reason for it to be so. Replicating ZFS is trivial. Oversights happen, but this is still dumb.

(My systems all live in pairs. With very few exceptions there are no snowflake services. I take snapshots of my MySQL master. I replicate them, so I can clone and boot them to restore data quickly. I have MySQL replication set up so I can do hot failovers. I have zones replicated via ZFS, I have backups of /etc and /usr/pkg/etc even though the configs are all in git. I replicate all other big datasets to cross-site failover systems with standby zones. I do backups. So why, in my big table of datasets, does this one thing have a big TODO in the replicate column?)

Postpone the maintenance window. It’s ok. Sometimes scheduling conflicts come up. Sometimes you aren’t as prepared as you thought you were. Your customers don’t care that they weren’t supposed to be able to access something for 30 minutes tonight, but instead can’t tomorrow night.

Really. Get some sleep. Wake up tomorrow and feel lucky you didn’t totally break something and potentially lose unrecoverable data.

The Third Law

Don’t make a problem worse. Especially if you caused it.

Never do anything to disks which contain data you need, even if that data is currently inaccessible. Move the workload somewhere else. Hope you think of something.

You are already eating live scorpions covered in glass, don’t go setting them on fire too.

Stupidity the Third

I exported the pool and restarted the system. It Jumpstarted happily. I logged in and…

# zpool import
  pool: tank
    id: 17954631541182524316
 state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-6X
config:

        tank        UNAVAIL  missing device
          mirror-0  ONLINE
            c0t2d0  ONLINE
            c0t3d0  ONLINE
          mirror-1  ONLINE
            c0t4d0  ONLINE
            c0t5d0  ONLINE

        Additional devices are known to be part of this pool, though their
        exact configuration cannot be determined.
# zpool import -F tank
cannot import 'tank': one or more devices is currently unavailable
        Destroy and re-create the pool from
        a backup source.

At this point there was a fair amount of cursing.

The thing is, I knew the pool was fragile. I knew that reinstalling the system was going to blow away /etc/zfs/zpool.cache, which is likely the only reason U7 was happy to import the pool after the SSD died initially and it got rebooted.

But my judgement was impaired: I was making really stupid decisions.

The stupid: Doing something irrevocably destructive to a fragile, unique system.

Regretful Morning

At this point I was screwed. I couldn’t import the pool. I had no backups.

I got critical zones back up on other systems (using data that had been replicating off the now hosed box), so services would not be unduly affected. Everything was back up, but customers couldn’t see messages we had discarded for them, and as such not release important mail that had been improperly discarded.

After an hour of trying various things (like logfix, and booting newer instances of Solaris) I gave up. At 0430, I woke up my co-worker Rik, and explained I had totally screwed us.

"That does sound pretty bad."

I stood up another zone so we could start importing the last seven days of messages from the message queue (which we keep as a hedge in case something just like this happens, though I doubt anyone expected me to be the cause). In the process of this, he rewrote the reindexing system to make it an order of magnitude faster. We went from the refill taking 2 days to 6 hours.

The Road to Recovery

Once the refill was running my body shut down for five hours.

My brain working slightly better, I started thinking: I had a copy of the old zpool.cache, which contained configuration about the now-defunct tank pool. But how could I turn that into something useful?

Keep in mind: My data was on the disk. No corruption has occurred. It was just my version of ZFS that didn’t want to import the pool with a missing log device. How could I force it to?

I had thought about several things before crashing: The logfix tool basically replaces a missing log device with another by walking the ZFS metadata tree, replacing the device path and GUID with another device or a file. Okay, I could try something like that, right? But the code needs Nevada headers or Nevada.

I came back up to James McPherson having built a logfix binary for Solaris 10. Unfortunately it didn’t work (but also didn’t eat anything, so props to James).

So if logfix wasn’t going to work, I was going to have to do something really complicated. Digging around with zdb. Terrifying.

James got me in touch with George Wilson, who had written the zpool import recovery code in the first place. He suggested some things, including:

# zpool import -V -c /etc/zfs/zpool.cache.log tank
cannot open 'tank': no such pool

Well, that’s not good. zpool import by itself can see the pool, but can’t import.

Specifying the secret recovery flag (-V) doesn’t help, using the alternative cache file that has configuration for the log device claims to not even see the pool!

However:

# zpool import -V -c /etc/zfs/zpool.cache.log

  pool: tank
    id: 17954631541182524316
 state: DEGRADED
status: One or more devices are missing from the system.
action: The pool can be imported despite missing or damaged devices.  The
        fault tolerance of the pool may be compromised if imported.
   see: http://www.sun.com/msg/ZFS-8000-2Q
config:

        tank        DEGRADED
          mirror-0  ONLINE
            c0t2d0  ONLINE
            c0t3d0  ONLINE
          mirror-1  ONLINE
            c0t4d0  ONLINE
            c0t5d0  ONLINE
        logs
          c0t6d0p1  UNAVAIL  cannot open

Okay, so I can see the pool using the old configuration data, but I can’t import it. And it’s seen as DEGRADED, not UNAVAIL. It’s importable. That suggests that I don’t need to go digging around with zdb or a hex editor. George is also starting with the import command, not a hex editor. That seems to imply he thinks it’s recoverable.

(I get that sinking feeling that something you thought was going to be really complicated and dangerous is, in fact, trivial. And you’ve realized long, long after you should have.)

So: -V is the old import switch. I bet that would work on U7. U9 has an actual recovery mechanism now. Maybe…

# zpool import -F -c /etc/zfs/zpool.cache.log tank
Pool tank returned to its state as of Thu Nov 04 01:25:50 2010.
# zpool list
NAME    SIZE  ALLOC   FREE    CAP  HEALTH  ALTROOT
rpool   136G  2.05G   134G     1%  ONLINE  -
tank    272G   132G   140G    48%  DEGRADED  -

Twelve hours later, there is much more cursing.

Ghost of the Arcane

A lot of UNIX comes down to reading documentation and determine which switches are going to solve your immediate problem. Here, it’s two: -F and -c. That’s it. Let’s assume that twelve hours previous I was well-rested but still astoundingly dumb, and had managed to get myself into the situation where my pool was UNAVAIL.

Because I was well-rested, I would have read the docs, understood them, and recovered the pool within a few minutes. Instead, I had to recharge my brain, created a lot of work for my co-workers, and annoyed my customers. Good job!

Ok. Now I want to get rid of the busted log device. The newly imported degraded pool is on ZFS v10. I need to get it to at least v19, which is when log device removal was added. Thankfully U9 supports v22.

# zpool upgrade tank
This system is currently running ZFS pool version 22.

Successfully upgraded 'tank' from version 10 to version 22

And get rid of the dead log device:

# zpool remove tank c0t6d0p1
# zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c0t2d0  ONLINE       0     0     0
            c0t3d0  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            c0t4d0  ONLINE       0     0     0
            c0t5d0  ONLINE       0     0     0

errors: No known data errors

And the pool is back online in a usable state.

Before we make any changes to the newly recovered pool I take a snapshot and send it to another system. This takes a few hours. This means that if the new indexer has a bug that doesn’t interact with the existing index, we’ll be able to go back to the pristine data.

I also start up the rolling replication script on the dataset. The first send takes a few hours; incrementals 20-30 minutes.

Both of those things should have already been in place.

That How They Do

Shortly before I got the pool back online, the 7 day import had finished and we had announced to customers they could get back to seeing their discarded messages.

Well, now I had the last 30 days of spam, and all the metadata that went with it. Rebuilding the other 23 days on the new index was going to be both non-trivial and slow. We would have to pull the information off disk for each message (around 2TB of spam), and some data was only stored in the index.

The decision was made to revert to the original dataset. I pointed the new index refiller at it, and 9 minutes later we had the last 12 hours of spam indexed. We swapped around, merged the metadata from the temporary dataset into the original one, and we were back online.

We made the announcement, wrote a blog post, and everything was good again.

Almost as if I had never done anything incredibly stupid.

Coda

Postmortem

Maybe you are the lone SA at a small company, but you still have resources to ask for advice. There are certainly people on IRC whose opinion I value. Your boss and co-workers may not know as much about systems as you do, but they can probably recognize a three-legged chair when it’s in front of them.

It is easy to do stupid shit in a vacuum. Talking to people about it is probably enough for you to recognize if it’s a bad idea.

I’ll have another post coming up (with pretty graphs, hopefully) abut hybrid storage pools and their impact on performance. Two SSDs just came in to act as a mirror for this host, so it should be interesting.

Your Co-workers

You have broken something. You feel dumb and defensive, and pissed off at yourself. Don’t take it out on the people who are helping you get the system back online.

When you break something and can’t fix it, you create work for other people. Make sure you thank them and apologize. Act like a professional, or even just a regular human being.

I can think of a few instances where Ricardo Signes has had to save my bacon in the last few years, but probably nothing so major as this case. I had to wake him up at 0430 to give me a hand, and while he’s paid to do it, it’s unfortunate how rare it is to find people as pleasant and professional as he is.

Over the years I’ve worked with lots of smart people, but few as smart and even-tempered as rjbs. Manhug!

Wheaton’s Law

A brief tangent.

Sysadmins are admittedly used to other people breaking things and wanting us to fix it. Treat your co-workers, customers, and users with respect. Do not call them lusers, do not make them feel bad. It is extremely aggravating at times, but they are not a puppy who just had an accident on your new carpet. They are adults, and your colleagues.

At some point you may find yourself on the other side of the table: You have done something and now they can’t get any work done. Hopefully they will recall that when they screwed up, you did not berate them, and will afford you the same courtesy.

Educate them after you have solved their problem.

Don’t be a dick.

Thanks

Special thanks to James McPherson of Oracle/Sun and George Wilson of Delphix (previously of Oracle/Sun) for giving me a hand. George pointed me to -V and -c which finally helped me realize just how dumb I was being and got my pool back online.

Vendor Support

Once I realized I was screwed and got the immediate booms out of the way, I opened a case with Oracle. P1, at 0700. A rep got back to me around 1900. Nearly 12 hours later. For a “system down” event, affecting many customers, on a paid support contract.

Andre van Eyssen says: If you have a P1 problem, call it in. Don’t use SunSolve. Make the call.

A support contract is not a panacea.

Design your systems to be redundant and resilient.

And don’t do stupid shit when you’re tired.

Posted Friday, November 5th, at 2:05 AM (∞).
Comments

textproc/libxslt on pkgsrc/solaris

Requires this patch or you get symbol errors on compile.

Need this for devel/hg-git. Working on getting illumos-gate pushed into github (#105).

Posted Monday, September 6th, at 5:24 AM (∞).
Comments

apr bug on pkgsrc/Solaris x86

While working on the illumos infrastructure roll-out I ran into an issue with apache22 segfaulting, but .. mostly working, when SSL was enabled. Disabling SSL seems to fixed the issue. Other SSL-enabled programs were not affected.

Turns out it’s an apr bug, and kind of deeply weird. Described here.

Posted Sunday, September 5th, at 6:56 PM (∞).
Comments

On software.

(grumpy face)

< bdha> I see its value.
< bdha> I just hate it.
< bdha> As a sysadmin I feel I am allowed to feel that way.

Posted Friday, September 3rd, at 1:36 PM (∞).
Comments

Segfaults with SSL-enabled and pkgsrc2010qN/solaris

Starting with 2010q1 I noticed that anything built with openssl, including, well, openssl, would segfault. To fix this, compile the openssl package with SSPRO. Don’t forget to modify your mk.conf:

#PKGSRC_COMPILER=       gcc
PKGSRC_COMPILER=        sunpro

# For sunpro
CC=     cc
CXX=    CC
CPP=    cc -E
CXXCPP= CC -E

I previously tried to compile everything with pkgsrc’s gcc, but perhaps I’ll change that policy now.

Fix defined here.

Posted Friday, September 3rd, at 1:14 PM (∞).
Comments

nginx on pkgsrc2010q2/solaris.

To compile the nginx package, you need to remove the patch-aa patch from distfiles. You also need to create hacks.mk and add the following to it:

.if (${PKGSRC_COMPILER} == sunpro)
.if (${MACHINE_ARCH} == i386)
CONFIGURE_ENV+= NGX_AUX=" src/os/unix/ngx_sunpro_x86.il" 
.endif
.endif

Problem report and fix here.

Posted Friday, September 3rd, at 1:11 PM (∞).
Comments

Building Postfix on OpenSolaris >=b130

NIS was finally sent into the cornfield around ONNV b130. Postfix does not have a definition for OpenSolaris (arguably 5.11), just Solaris 5.10. When building, it attempts to compile dict_nis and can’t, unsurprisingly.

To build, remove “#define HAS_NIS” from src/util/sys_def.h in the “#ifdef SUNOS5” section.

With pkgsrc 2010q1, apply this diff to pkgsrc/mail/postfix/patches/patch-ag.

The checksum is 3ea7ecaec06b0ff30fe1a1b2f5197def0219bd6b.

Posted Saturday, August 14th, at 10:58 PM (∞).
Comments

Elsechan…

< bda> I guess the X2270 is coming in soon. The power cables are here.

< e^ipi> unless it's the computer columbian drug lords and the
 power cable is a warning

< e^ipi> like a toe

Posted Wednesday, February 3rd, at 4:40 PM (∞).
Comments

ZFS and iSCSI

I was asked to share out the pool on the X4500 via NFS and iSCIS. NFS I was familiar with, and have used a fair amount. iSCSI for all its new hotness factor, I’ve never touched.

I was unsurprised, but pleased, with how trivial it is to set up. On the server (the target):

x4500# zfs create tank/iscsi
x4500# zfs set shareiscsi=on tank/iscsi
x4500# zfs create -s -V 25g tank/iscsi/vol001
x4500# zfs create -s -V 25g tank/iscsi/vol002
x4500# zfs create -s -V 25g tank/iscsi/vol003
x4500# zfs create -s -V 25g tank/iscsi/vol004
x4500# zfs create -s -V 25g tank/iscsi/vol005
x4500# zfs create -s -V 25g tank/iscsi/vol006
x4500# zfs list tank
tank                      1.39G  13.3T  53.3K  /tank
tank/iscsi                1.54M  13.3T  44.8K  /tank/iscsi
tank/iscsi/vol001          246K  13.3T   246K  -
tank/iscsi/vol002          246K  13.3T   246K  -
tank/iscsi/vol003          247K  13.3T   247K  -
tank/iscsi/vol004          262K  13.3T   262K  -
tank/iscsi/vol005          263K  13.3T   263K  -
tank/iscsi/vol006          264K  13.3T   264K  -
tank/nfs                  1.39G  13.3T  1.39G  /tank/nfs

This will start the iSCSI service (iscsigtd) and share not only the parent volume (tank/iscsi) but all the children as well.

Accessing and using the disks on the client (the initiator) is just as easy:

client# iscsiadm modify discovery --sendtargets enable
client# iscsiadm add discovery-address 10.0.100.40
client# svcadm enable initiator
client# iscsiadm list target
Target: iqn.1986-03.com.sun:02:7ea6450a-4a26-cfe5-d679-fe0dbabe66b9
        Alias: tank/iscsi/vol001
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Target: iqn.1986-03.com.sun:02:61d314ce-f4b3-ed1f-9891-e0c6c52f5601
        Alias: tank/iscsi/vol002
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Target: iqn.1986-03.com.sun:02:eaea5a32-f54a-6d04-a453-888a580504c2
        Alias: tank/iscsi/vol003
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Target: iqn.1986-03.com.sun:02:05272769-bd4a-6b54-8d6f-f525af20ad08
        Alias: tank/iscsi/vol004
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Target: iqn.1986-03.com.sun:02:fb296689-109e-cb6d-9515-a07f581a81ce
        Alias: tank/iscsi/vol005
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
Target: iqn.1986-03.com.sun:02:ec71ac8e-e417-ce3f-891c-ee1febdf9120
        Alias: tank/iscsi/vol006
        TPGT: 1
        ISID: 4000002a0000
        Connections: 1
client# format < /dev/null
Searching for disks...done
AVAILABLE DISK SELECTIONS:
       0. c0d0 <DEFAULT cyl 4174 alt 2 hd 255 sec 63>
          /pci@0,0/pci-ide@1,1/ide@0/cmdk@0,0
       1. c1t600144F04B66BAA30000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa30000144f21056400
       2. c1t600144F04B66BAA40000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa40000144f21056400
       3. c1t600144F04B66BAA50000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa50000144f21056400
       4. c1t600144F04B66BAA60000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa60000144f21056400
       5. c1t600144F04B66BAA80000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa80000144f21056400
       6. c1t600144F04B66BAA90000144F21056400d0 <DEFAULT cyl 3261 alt 2 hd 255 sec 63>
          /scsi_vhci/disk@g600144f04b66baa90000144f21056400

client# zpool create tank \
raidz c1t600144F04B66BAA30000144F21056400d0 c1t600144F04B66BAA40000144F21056400d0 c1t600144F04B66BAA50000144F21056400d0 \
raidz c1t600144F04B66BAA60000144F21056400d0 c1t600144F04B66BAA80000144F21056400d0 c1t600144F04B66BAA90000144F21056400d0
client#  zpool status -v tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME                                       STATE     READ WRITE CKSUM
        tank                                       ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c1t600144F04B66BAA30000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA40000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA50000144F21056400d0  ONLINE       0     0     0
          raidz1                                   ONLINE       0     0     0
            c1t600144F04B66BAA60000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA80000144F21056400d0  ONLINE       0     0     0
            c1t600144F04B66BAA90000144F21056400d0  ONLINE       0     0     0

errors: No known data errors

Very nice.

Posted Monday, February 1st, at 6:42 AM (∞).
Comments

Powered by Tumblr; themed by Adam Lloyd.