So I got my quick and dirty NFS fix tested enough to post to the NFS list, and I've been focusing on the non-quick-and-dirty fix. Let's start at the beginning. .
One of the namespaces a container can have is a "network namespace", CONFIG_NET_NS, which gives the container its own network interfaces and routing. The global variable "struct net init_net;" is the initial network context, I.E. the one PID 1 starts with when the system boots up with. before any container starts. The current process's network context is (struct net *) "current->nsproxy->net_ns". (When CONFIG_NET_NS=n current->nsproxy->net_ns always equals &init_net.)
Namespaces are reference counted, so to use one safely you have to get_net(net) and put_net(net), and to compare them for equality use net_eq(net, net). (And to be able to dereference current->nsproxy->net_ns you need to #include <linux/user_namespace.h>.)
My first patch replaced three instances of the global net namespace with the current process's net namespace, I.E.
replace &init_net with current->nsproxy->net_ns in:
That samples the current process context three times, with only a loose guarantee that the current process is the mount process at the points it does it. (It currently is, but I dunno what the design goals of NFS are so I don't want to rely on that staying true in future.) It also relies on the rpc layer to do all the get_net() and put_net() lifetime stuff, which works but means the higher layers haven't got a reference to it after mount exits. It _also_ doesn't prevent superblock merging when an IP address gets reused but means different things in different network contexts.
A better fix is to copy the mount's network context into the NFS superblock, and then use that everywhere a network address gets used. That's a much, much bigger patch, which I've been working on for a while.
So when an NFS filesystem is mounted, I need to copy the network namespace out of the mount command's process context, and use that for all the NFS network operations. The network address is currently stored in struct nfs_client, so that's the place to add in include/linux/nfs_fs_sb.h I found struct nfs_client and added a "struct net *cl_net" next to cl_addr. Then grep for cl_addr in fs/nfs to see who is _using_ that address and make sure a corresponding network context is supplied each time. That turns up:
Instances of cl_addr:
compares with nfs_parsed_mount_data
compares with nfs_parsed_mount_data
called from nfs_get_client()
#nfs4_cb_match_client() - nfsv4
Already has .net, need to get from nfs_client instead of
net .net in nlm_init
enabled via CONFIG_NFS_FSCACHE
The commented out entries are hits in NFSv4 code, which I'm not fixing this time around because it's A) insanely complicated, B) I haven't got a test case for. My goal here is just to not BREAK nfsv4 in non-container contexts. (In a couple places I have to make them grab &init_net; to initialize new fields I'm adding, which is easy to grep for later when I get around to fixing them properly.) I'm also not trying to fix lockd, although my changes _might_ make the access control list stuff magically just work since rpc_clone looks like it's already copying network contexts. Not sure yet.
I have not TOUCHED the NFS server code yet. Ponder that each network context would require its own instance of the portmapper daemon, ye mighty, and despair. (Run userspace NFS servers. It's the only way to be sure.)
Of course all the above is only half the puzzle. I need to initialize nfs_client->cl_net, which means I need to find code guaranteed to run in the mount process context and grab the network reference from there, and incrememt the reference count before the mount process goes away.
Filesystems are defined by a struct file_system_type, and nfs_fs_type is defined in fs/nfs/super.c, which says that nfs_fs_mount() and nfs_kill_super() are the mount and unmount functions. So nfs_fs_mount() is the function guaranteed to be called from the mount process context. (Yes, it was different in 2.6.38 than in current -git, thanks to Al Viro's commit 011949811b946bd3b72fca71200f197c6168a5f8 on March 16th. But the changes are mostly cosmetic from this patch's perspective.)
To get the lifetime rules right, it's tempting to just do a net=get_net(current->nsproxy->net_ns) there, and do a corresponding put_net(net) in nfs_kill_super(). Unfortunately NFS is always a little more complicated, in this case because nfs_fs_mount() doesn't create a superblock directly. Instead it allocates a temporary "parsed_mount_data" structure, fills it out with all the options, and then hands it off to nfs_create_server() to copy the data into a persistent data structure the mount will actually use. Then it copies this new structure into a second temporary structure (nfs_sb_mountdata) and then attempts to sget() an existing instance of this superblock, and if there is one it will nfs_free_server() the new one it just created.
So NFS is acting like a C++ program and spending lots of time marshalling data to itself, copying things into and out of structures which it's the only user of, resulting in buckets of glue code to convert data formats that exist only so they can be converted into other data formats. Doing a get_net() and put_net() as each of these temporary structures is created and destroyed would be kind of disgusting.
On the other hand, we know nfs_parsed_mount_data is only ever filled out in mount context because it's reading data copied from userspace. And currently, that structure's lifetime matches the lifetime of the mount process. So we should be able to copy the net reference out of the process context when we initialize that structure (which happens in the insanely named nfs_validate_parsed_mount_data(), which initializes the structure rather than just checking values like the name implies). Then when we actually create the persistent struct nfs_client instances, then we can mess with the reference counts. (I asked on the NFS list if this was a good approach to the lifetime rules, and got no reply. So I'm trying it and waiting to see if they complain when they see the actual patch.)
The other code path that initializes an nfs_parsed_mount_data instance is nfs_remount(), where we want to copy the existing net context out of the superblock than use the process context. (Remount should not change network contexts, there's no way for that to end well.) Again, _not_ incrementing the reference count here but tying the net lifetimes to the nfs_client instance lifetimes works out fairly well, since if the filesystem we're remounting gets umounted out from under us we're already screwed, so we can rely on the existing VFS infrastructure to not let that happen.
This post is probably long enough, so I'll hold off on the gruesome details of drilling the network context down to where it's used tomorrow.