March 17th, 2011

Why debugging is harder than coding.

To test NFS containerization, I need to set up conflicting network routing. I need to come up with something the container can access but the host can't, and vice versa. To do this, I used my three layer setup (laptop/kvm/container, described here) so I can set up routings on the laptop which are a couple hops away from the point of view of the containers and the container host (I.E. the kvm system).

Initially, to get a routing the container could access and the host couldn't, I set up a alias on the laptop and ran an NFS server on it. The KVM's eth0 interface is using the default QEMU masquerading LAN and thus gets the address, so it can't see any _other_, and life is good.

Then to set up a routing that the KVM system could see but the container couldn't, I ran an NFS server on the laptop's The KVM system could access that via the alias in the virtual LAN its eth0 is plugged into, but the tap device the container uses has no way to route out to, the container would see its own loopback interface instead.

Then right before SCALE I changed my test setup so that the mount command in the container and the mount command on the kvm system were identical, both using On the laptop I set up a alias, and ran the NFS server on that and another instance on So the container and the container's host were both mount NFS on, but should be connecting to DIFFERENT servers when they did this.

The failure mode for the first test setup is "server not found", because if it uses the wrong network context, it'll route to an address that isn't running an NFS server. (The local address hides the remote address.) The failure mode for the second setup is accessing the wrong server: it's always going to route out to a remote address, the question is whether or not it gets the right one. (Side note: unfortunately you can't tell the NFS server "export this path as this path" because NFS servers are primitive horrible things configured via animal scarifice and the smoke signals from burnt offerings. What I should really do is run one the two instances in a chroot to get different contents for the different addresses. Generally I just killed one to see if that NFS mount became inaccessable or not.)

Over the past couple months I've made the first test setup work (more by adding lots of "don't do that" options to the mount -o list than by patching the kernel, but at least I got it to _work_). This second test setup spectacularly _did_not_work_, and it failed in WEIRD WAYS. Not only does the NFS infrastructure inappropriately cache data and re-use structures it throught referred to the same address (because its cacheing comparisons didn't take network context into account), but the network layer itself doesn't seem entirely happy routing to two different versions of the same IPv4 address at the same time. (After doing an NFS mount in the container, the host can't access that address anymore. Can't do an NFS mount, can't do a wget... an ssh server bound to that address can't take incoming connections. UNTIL, that is, the container opens a normal userspace connection to that address, such as running wget in the container. I have no idea what's going wrong there, but it's easily reproducible.)

The problem is, right as I started debugging this second test setup I was pulled away for several intense days of working the OpenVZ booth at the SCALE conference, and then I got sick for most of a week afterwards with the flu, and by the time I got back to working on NFS I'd forgotten I'd changed my test setup.

So now I was dealing with VERY DIFFERENT SYMPTOMS, and all sorts of strange new breakage, and I couldn't reproduce the mostly working setup I'd had before, and I couldn't figure out WHY. At first I blamed my "git pull" and tried to bisect it, but a vanilla 2.6.37 was doing this too and I KNEW that used to work. Sticking lots and lots and lots of printk() statements in the kernel wasn't entirely illuminating either. (Once you're more than a dozen calls deep in the NFS and sunrpc code, it's hard to keep it all in your head and remember what you were trying to _do_ in the first place.)

And of course the merge window opened, so I wanted to submit the patches I'd gotten working so far, but I always retest patches before submititng them and when I did that they DID NOT WORK so obviously I couldn't submit them until I figured out WHY...

It wasn't until today that I worked out why what I had USED to work, and where I'd opened a new can of worms that broke everything again. The code hadn't changed, my test sequence had. (It's a perfectly valid test sequence that _should_ work. The kernel _is_ broken. But it's not _new_ breakage, and my patch does fix something _else_ and make it work where it didn't work before, and thus I can give it a good description to help it get upstream.)

So yeah, I've had a fun week.