Wrestling with the curse of NFS.

So, with two patches, mounting NFS from within a container (read only) is working fine... but as soon as I do it, the host context can't route to that address anymore. From an entirely different network context, with wget. I can PING it from the host, and the container still works fine but attempts to actually contact it say "no route to host".

It... how... It's completely unrelated. It's not the same container, it's not the same network interface, the routing tables still look fine, AND I CAN PING. HOW DID THAT BREAK?

So now I've got to rip apart the network stack and see what it's doing. Figure out what exactly is going _wong_, and backtrack from there to identify which of the 8 gazillion strange and inadvisable things NFS is doing is triggering this behavior.

I despise NFS.


If I do a wget from the container, it works again. Doing an NFS mount horks the host's network context, and opening a normal network socket from the container's userspace fixes it.

This is epic levels of weird here. 100% reproducible. It happened in 2.6.37 and it's happening in current -git (darn near 2.6.38)...

Weeeeird... The funny thing is I'm ssh'ed into the box on the interface that's broken, and it continues to work fine while "broken", but I can't make _new_ network connections (either incoming or outgoing, I can't ssh a fresh session into the box) while it's screwed up. It's opening sockets, not sending packets, that's failing... Hmmm...
  • Current Mood
    nauseated nauseated
  • Tags


Even when I force different superblocks, it's still merging the network routing. I disabled superblock merging (and printed out the pointers to make sure they do not match), and the network routing is STILL getting cached out of the wrong context and used for any second mount instance unless I unmount the first instance and give it two minutes to time out.

The NFS code really has to be going out of its WAY to screw this up. The levels of unnecessary optimization I'm having to chop through continue to surprise me after MONTHS OF FINDING AND DISABLING BROKEN "OPTIMIZATIONS". THERE ARE STILL MORE.

Ok, yeah, it's my fault for actually thoroughly testing what I'm doing. What I had _seemed_ to work back in February. But still... how did this giant pile of interlocking crap ever work for anybody else?
  • Current Mood
    enraged enraged
  • Tags

The flu really hangs on.

So I got about 4 good hours yesterday and today looks like the first time in a while I'll be able to CONCENTRATE for longer than that. This whole being sick thing sucked at great length, but I think it's finally worked its way through. (Famous last words.)

So, what have I done recently: triaged my patch list, got the most recent versions of everything and fixed up all the conflicts to make sure they apply to the current -git. I need to check my last postings of each in the archives to see if anybody replied with comments I haven't addressed yet. (Unaware of any, but if they didn't cc: me I probably didn't see it, so...)

Went through livejournal here and added a "documentation" tag to the entries that I or other people might want to refer back to. Some of that I need to put on the lxc.sf.net website, although the format of wordpress is funky and some things can only be accessed via sftp. (My continuing efforts to write a new news post to announce the 0.7.4 release have met with no success. Emailed Daniel about it: it says it posted, but my changes don't show up.)

Upgraded my test environment to the new lxc 0.7.4, which of course ate my test containers. (Sigh. Building from source vs installing the prebuilt debian stuff.) So I had to set the debian container back up, and found out that my notes on doing that were incomplete. I need to update those and make them into a part 3 of my HOWTO series.

All this finally lets me try to fix the NFS superblock merging. Last message I linked to the Linux Weekly news entry where Al Viro ACTUALLY EXPLAINS get_sb() in the context of saying "that was a horrible API, it was my fault, it's going away, and we've already removed all users of it except NFS which as always is doing something uniquely horrible that needs special case handling".

I hate NFS.

Linus hasn't released 2.6.38 yet, and I'm sort of racing against this looming deadline because I don't want my patches to get lost in the noise. (It's probably inevitable anyway. Most of 'em are the kind of bugfixes I can get in even during/after the deluge, but it's a good habit NOT to do that. Now is the ideal time to drop new patches, it's the quiet before the storm. This is actually when Linus schedules his vacations. Being sick right now has not helped.)
  • Current Music
    I Fight Dragons cover of Jonathan Coulton's "Future Soon".
  • Tags

Why is Livejournal trying to become a second-rate Facebook?

Games? Really? If I wanted to use Facebook, I'd be using Facebook. If I wanted to visit armorgames or handdrawngames or any of the 8 gazillion other variants of that on the web, I know where they are.

It occurs to me that when other people split their blogs I tend to stop following both forks... and I've split my blog. The open source hobbyist stuff stays back at landley.net/notes.html, the work stuff here, and any other topics sort of wandering between the two. Not exactly ideal.

The advantage of my notes.html file is it's really easy for me to update, so I do so. It's an html file I edit with vi. Downside: no comments (people just email me), and I started sticking span tags in it _years_ ago but haven't done anything with them yet so it's not really good at fielding multiple topics. One big post per day.

I was pondering moving more stuff here and using tags to distinguish it, but people have been predicting Livejournal Going Away for years now and backing it up is a serious pain. (I'm aware of three solutions, all of which are way too many steps for me to have actually bothered yet, let alone doing it regularly.) So committing more to livejournal... unlikely.

Fade's talked about moving her journal to dreamhost or some such. How do I really know this would be an improvement?

(Yeah, still sick. Not entirely coherent just now. Fade says this is a flu, not a cold. Wheee.)
  • Current Mood
    tired tired

Scale! Plague!

Went to Scale in Los Angeles over the weekend with k001 Kir Kolyshn, the OpenVZ maintainer, to run a booth on OpenVZ. He was apparently unaware I do marketing, so we ran through the weekend's worth of flyers on Saturday.

I found out this was happening on Wednesday (Feb 23), flew out Thursday morning (on about 2 hours of sleep: I wasn't on a morning schedule at the time), spent Friday helping Kir update the slides and flyers and learning various details about the technology I'd be promoting, spent 9 hours Saturday and 7 hours Sunday at a booth pulling people out of the passing crowds and telling them how great containers are, flew back Monday, and spent Tuesday huddled on the couch being sick.

It could be standard con crud, although the incubation period for that's usually a bit longer. It could also be that while in LA, I drank the water. Or it could be the ill-advised visit to the Mariott hotel restaurant, where Kir and I split an appetizer that can only be described as a "Crap Cake". (Crab is not supposed to smell like old fish.)

So now it's the following Wednesday. Yesterday I didn't leave the house at all, didn't turn on my laptop once, and probably slept more than 12 hours. Today I'm merely coughing, dizzy, feverish, grouchy, unfocused. So a distinct improvement. (Oh, and I've moved from "coughs that taste like blood" to "coughs that give me a splitting headache by somehow pressurizing the inside of my skull". Sort of a lateral move there.)

I'm not sure if yesterday (and today, to be honest) count as sick days or just getting my weekend back. Either way, I have too much to do so I've fired my laptop back up to at least deal with the email backlog.

When I feel up to a longer drive I need to go collect my visa from Houston, and buy plane tickets to Russia. (Oh, and it turns out Spring Break is the week I was planning to visit Russia, so I may have a friend visiting from out of town that week. Hopefully work will be ok with me visting the following week. I can't go earlier than that because the visa doesn't kick in until the 15th...)


Sigh. My big breakthrough with NFS wasn't tracking down everything what needed to be fixed, but finally managing to narrow down a test case to the point where I could get a very artificial test to perform correctly with only three changes. (And all three changes were the same trick, which probably isn't the _right_ trick form a merging standpoint, but accomplish the same thing in a much less intrusive way: replace init_net with current->nsproxy->net_ns because the code is always executed in a specific user context from which the correct network context can be harvested and cached.) There are still three instances of init_net in fs/nfs but they're all nfsv4 and I don't have to care about them just yet. (Good thing too: they're "callbacks", from who knows where, I have no idea _what_ process context (if any) they get called in...)

I'm still not _quite_ sure about that "always", but I'm not sure how much of this is NFS specific. (What happens if you do "mount -o remount,rw" from a different network context...) This is the realm of follow-up patches, but I still have to care to make it reliable.

And there are eight more instances of init_net in net/sunrpc (and three false positives), and I'm still laboriously grinding my way through that crap to figure out what context each one is called in and where it should be getting its data from. But I don't think any of those are actually the problem I'm seeing, I think the problem is that it's matching cached RPC entries by address without also comparing network namespace.

Oh, my CIFS patch wasn't actually an init_net instance replacement, it was fixing the implicit init_net buried inside the sock_create_kern() function. I've been asked for a follow-up patch there to make kerberos work, but haven't got a kerberos test environment set up yet, and have yet to find where the darn kerberos code _is_. Probably the CIFS_UPCALL stuff... which is spawning a userspace /sbin/request-key binary as a child of the host's PID 1... And in the DNS resolver:

saved_cred = override_creds(dns_resolver_cache);
rkey = request_key(&key_type_dns_resolver, desc, options);

Yeah, that's not going to work... where is override_creds() defined?

The war continues...
  • Current Mood
    determined determined
  • Tags

Just spent several hours on kernel documentation.

I didn't _mean_ to, but the kernel documentation page is in much better shape than it's been in years. (I fixed the menuconfig parser. I made the htmldocs thing translate more than it has in a while, although there's still some bugs. I fixed the RFC file so that it generates links to kernel source in the format the git repository wants _this_ week. And I fixed the top level index so that the interesting stuff is no longer buried under buckets of todo items.)

Back to NFS...

I'm debugging NFS, not developing it.

For two months, I would fire up my three-level containers test case with the laptop context, the kvm context, and the container context. I'd run the NFS server on one of two laptop interfaces: either which showed up to the KVM host as but was inaccessable from the container's tun interface), or an lo:1 alias numbered, the same as the eth0 virtual LAN address provided by kvm, which the container could access but the kvm level would consider a local address and not route packets out to the laptop for).

And I would mount the NFS share from the host to confirm I had the server and routing right, and then I'd unmount it from KVM, kill the NFS server on the laptop and re-run it on the other address and try again and it wouldn't work, and I'd spend another couple hours digging through the code debugging why. This test worked with wget and netcat, I knew the basic network routing was ok, but it refused to work with NFS. I dug into the standards and the code to understand why. I extended the example server from the getaddrinfo() man page into a UDP packet forwarder that logged each packet going by so I could inspect its contents, read the RFCs to understand the wire protocol, dug down into the code to make enormous callgraphs ala:

Collapse )

(And so on, and so forth...)

Yesterday, I finally figured out why it wasn't working. It is _embarassing_, and I really really hate NFS.

The proble is that unmounting an NFS share does not clear its cached superblock. I knew it has crazy superfluous layers of cacheing, overriding/reimplementing basic VFS behavior to do things like merge superblocks it thinks are the same, but I _thought_ that since everything's reference counted the cached the entries had to GO AWAY when there were no more references to them. When you unmount all NFS instances, the cache has to drain and expire, right? Nothing can be USING those cache entries if there are no NFS mounts.

But no. that's not how it works. Instead it takes several minutes for the dead cached entry to time out, and until it does so new mounts will inherit the old cache entries with the wrong routing. So if I mounted a from the host, I had to wait several minutes before mounting another from a container had any chance of working. EVEN THOUGH IT'S NOT THE SAME (Which you could do by unmounting stuff on the HOST, doing ifdown eth0 and ifup eth1. But apparently nobody ever tried switching between conflicting routings when testing NFS mounts.)

Once I figured out NOT TO DO THAT, to boot a fresh kvm instance and test NFS in the container without first having tested it on the host, fixing it so the darn thing could in fact be mounted from within a container (however unreliably) took an afternoon.

(Oh, and despite not keeping the routing contexts quite separate, the RPC layer already seems to be doing all the network context reference counting properly from an object lifetime point of view. So I could just pass in the current process's net_ns and let the RPC layer handle incrementing its references for the lifetime of the transport object. The funky superblock merging and strange incestuous "server belongs to superblock and client belongs to server but all three have references to each other and you can only figure this out what thinks it owns what by reading the init code" stuff make figuring out object lifetimes a giant headache otherwise.)

The actual patch? A half-dozen one liners. Which I first _tried_ two months ago, but it didn't _work_ because my test case was triggering the bad cacheing behavior.

No, it's not a complete fix. I have to untangle the cacheing context leaks, figure out where DNS resolution is happening (please be in userspace, please be in userspace), confirm that .get_sb() is always called from process context (I _think_ so but proving a negative involves a lot of grep. I wouldn't put it past these guys to have all this cacheing and persistence and NOT properly retain context across transient network resets such as masquerading timeouts, so I don't want to rely on the RPC stuff pinning the network namespace unless I can prove there's no other way to get it _back_ so it must be retaining it or other users would already have noticed breakage. There is a remount function but I think it's for "mount -o remount" which is also process context...)

Except... what happens if you call "mount -o remount" on a filesystem you _inherited_? That's in your filesystm namespace, but not in your network namespace? That can't be allowed to _update_ the network namespace after the initial connection.

Just wait, NFSv4 is up next. With delegations. And those almost certainly _can_ be updated via -o remount. Whimper.

I hate NFS. I really, really, really hate NFS. But today, slightly less of it is my problem. And that's progress.
  • Current Mood
    relieved relieved
  • Tags

I was oversimplifying.

The NFS mount syscall winds up in nfs_get_sb() which calls nfs_validate_mount_data() which calls nfs_try_mount() which calls nfs_mount() which calls rpc_create() which calls xprt_create_transport which walks xprt_list to find xs_setup_udp() which calls xs_setup_xprt() which calls xprt_alloc() which does the get_net() that pins the network namespace I supplied to it.

It then loses track of that namespace later on, sending the packet twice (once out of each namespace) and then delivering the return packet to who knows where. (I repeat: this is just to make the unnecessary lookup for mountd to work, so it can find nfsd. It's attempting to locate the correct server to talk to. I haven't asked it to do a DNS lookup yet.)

I have a large pile of energy drinks and attempting to FIX THIS TONIGHT. We'll see...
  • Current Mood
    nauseated nauseated
  • Tags