|How NFS mounts work.
||[Feb. 13th, 2011|09:54 pm]
So the last post was how normal filesystems work. NFS is so utterly horrible that it has to either punch holes in or reimplement lots of this common infrastructure. (This is not due to having a network connection between server and client, that's normal server backed filesystem, and cifs/p9/fuse don't do that. This is due to the designers of NFS being INSANE.) The design of NFS is horrible, the implementation is horrible, and that makes it unclear how I'm supposed to containerize it.
The Virtual File System (VFS) is a bunch of filesystem-independent infrastructure shared by all the filesystem implementations... except NFS, which reimplements common infrastructure because it's that crazy and nonstandard. NFS reimplements huge portions of the VFS, which is a problem trying to containerize it.
For example, the first network action my test case does is talk to the mount server to figure out where the NFS server lives. (Why are they separate? Because Sun Microsystems was insane. It's a theme here.)
As with all filesystems, the filesystem field specifies which driver and "nfs" says to use nfs. The VFS parses the target and mountflags, and NFS gets passed in two data blobs: source and data. The source is something like "126.96.36.199:/path" (except when it's "[188.8.131.52]:/path"). Except the options in the data blob can also say where the server lives, becuase this wouldn't be NFS if it wasn't fragmented and redundant.
So in the case of NFS, the mount system call goes into the kernel, which does a lookup on filesystemtype and winds up finding struct file_system_type nfs_fs_type (it's a global variable that gets added to a linked list when the NFS driver gets loaded, that's how registering a filesystem works), and that structure contains a pointer to the function nfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *raw_data, struct vfsmount *mnt) which "gets a superblock", which is sort of like having an "nfs_mount()" function except more broken. Three of those arguments (flags, dev_name, and raw_data) come straight from the system call arguments (mountflags, source, and data respectively). The other two are provided by the VFS: the first is our nfs_fs_type.
Glossing over HUGE REAMS of stupid, nfs_get_sb() calls nfs_validate_mount_data() which calls nfs_try_mount() which calls nfs_mount() which calls rpc_create() which is where the first actual network connection gets opened. (Yeah, there's some digging involved in understanding this stuff.)
Along the way, it parses the raw_data (options), which is generally a comma separated string where the individual entries might be "name=value" pairs. The raw_data blob gets passed to nfs_validate_mount_data() which figures out if it's a comma separated string or instead a binary blob created by the mount.nfs program. This is done by a giant horrible switch statement with 7 cases, which is basically checking to see if the first 4 bytes are integer values 1-6 (since three bytes of that are zero and a null terminated string can't accidentally start with that), and if so that's SIX DIFFERENT VERSIONS of the binary blob.
Note: there is no need for the binary blob case at all: the default case which parses text options can do everything the binary blob codepath can do. In fact, nfs_try_mount() is _only_ called from the default case, so the text options can do _more_. Sigh. (Yes, THE WHOLE OF NFS IS LIKE THAT.)
The purpose of nfs_validate_mount_data() is to parse options and dev_name to populate two structures: nfs_parsed_mount_data and nfs_fh. In theory, this is where all the network stuff gets determined (although not necessarily where it gets _used_). The obvious thing to do is add a "struct net *net" to nfs_parsed_mount_data (which has to be called from process context since the two of its arguments came from a system call), and then use that context from the appropriate places.
Of course, it's not that easy. First, NFS performs buckets of premature optimizations, such as cacheing NFS transactions and merging superblocks, which screw up the implementation in ways that aren't obvious in the design. Secondly, this code is a tangled mess of partial compartmentalizatoin which mean that pieces of the code can't see what other pieces of the code are doing, even when they're the only caller/user of something.
Specifically, nfs_parsed_mount_data contains structures named mount_server and nfs_server. It defines these structures in-line, they have instance names but no struct name. The function nfs_try_mount copies data out of these into a new struct nfs_mount_request, which it passes to nfs_mount(). Then nfs_mount copies the data out of _that_ into struct rpc_create_args() which it passes on to
Yes, NFS spends all its time copying data from one structure to another equivalent structure so it can talk to itself. This is what happens when object "orientation" turns into "fixation", you wind up MARSHALLING DATA TO YOURSELF. This is C++ disease, this is a failure mode of Java. Seeing it in the Linux kernel, implemented in C, is just _sad_.
This presents me with a design problem: do I add to all this unnecessary copying to be consistent with what's already there (which produces a huge intrusive patch), or do I just prove that the only users of these functions are calling them in response to the mount system call, and thus it's always going to be called from process context, and therefore I can reach out and touch current->nsproxy->net_ns from places currently using init_net? (For at least mountd, it _is_ only used from mount context.)
And this is still the easy part...