|I'm debugging NFS, not developing it.
||[Feb. 15th, 2011|05:36 pm]
For two months, I would fire up my three-level containers test case with the laptop context, the kvm context, and the container context. I'd run the NFS server on one of two laptop interfaces: either 127.0.0.1 which showed up to the KVM host as 10.0.2.2 but was inaccessable from the container's tun interface), or an lo:1 alias numbered 10.0.2.15, the same as the eth0 virtual LAN address provided by kvm, which the container could access but the kvm level would consider a local address and not route packets out to the laptop for).
And I would mount the NFS share from the host to confirm I had the server and routing right, and then I'd unmount it from KVM, kill the NFS server on the laptop and re-run it on the other address and try again and it wouldn't work, and I'd spend another couple hours digging through the code debugging why. This test worked with wget and netcat, I knew the basic network routing was ok, but it refused to work with NFS. I dug into the standards and the code to understand why. I extended the example server from the getaddrinfo() man page into a UDP packet forwarder that logged each packet going by so I could inspect its contents, read the RFCs to understand the wire protocol, dug down into the code to make enormous callgraphs ala:
( callgraphCollapse )
(And so on, and so forth...)
Yesterday, I finally figured out why it wasn't working. It is _embarassing_, and I really really hate NFS.
The proble is that unmounting an NFS share does not clear its cached superblock. I knew it has crazy superfluous layers of cacheing, overriding/reimplementing basic VFS behavior to do things like merge superblocks it thinks are the same, but I _thought_ that since everything's reference counted the cached the entries had to GO AWAY when there were no more references to them. When you unmount all NFS instances, the cache has to drain and expire, right? Nothing can be USING those cache entries if there are no NFS mounts.
But no. that's not how it works. Instead it takes several minutes for the dead cached entry to time out, and until it does so new mounts will inherit the old cache entries with the wrong routing. So if I mounted a 10.0.2.2 from the host, I had to wait several minutes before mounting another 10.0.2.2 from a container had any chance of working. EVEN THOUGH IT'S NOT THE SAME 10.0.2.2. (Which you could do by unmounting stuff on the HOST, doing ifdown eth0 and ifup eth1. But apparently nobody ever tried switching between conflicting routings when testing NFS mounts.)
Once I figured out NOT TO DO THAT, to boot a fresh kvm instance and test NFS in the container without first having tested it on the host, fixing it so the darn thing could in fact be mounted from within a container (however unreliably) took an afternoon.
(Oh, and despite not keeping the routing contexts quite separate, the RPC layer already seems to be doing all the network context reference counting properly from an object lifetime point of view. So I could just pass in the current process's net_ns and let the RPC layer handle incrementing its references for the lifetime of the transport object. The funky superblock merging and strange incestuous "server belongs to superblock and client belongs to server but all three have references to each other and you can only figure this out what thinks it owns what by reading the init code" stuff make figuring out object lifetimes a giant headache otherwise.)
The actual patch? A half-dozen one liners. Which I first _tried_ two months ago, but it didn't _work_ because my test case was triggering the bad cacheing behavior.
No, it's not a complete fix. I have to untangle the cacheing context leaks, figure out where DNS resolution is happening (please be in userspace, please be in userspace), confirm that .get_sb() is always called from process context (I _think_ so but proving a negative involves a lot of grep. I wouldn't put it past these guys to have all this cacheing and persistence and NOT properly retain context across transient network resets such as masquerading timeouts, so I don't want to rely on the RPC stuff pinning the network namespace unless I can prove there's no other way to get it _back_ so it must be retaining it or other users would already have noticed breakage. There is a remount function but I think it's for "mount -o remount" which is also process context...)
Except... what happens if you call "mount -o remount" on a filesystem you _inherited_? That's in your filesystm namespace, but not in your network namespace? That can't be allowed to _update_ the network namespace after the initial connection.
Just wait, NFSv4 is up next. With delegations. And those almost certainly _can_ be updated via -o remount. Whimper.
I hate NFS. I really, really, really hate NFS. But today, slightly less of it is my problem. And that's progress.