Rob Landley (landley) wrote,
Rob Landley
landley

Containerizing a filesystem: design level.

At the design level, the problem with containerizing a network filesystem is contexts and lifetime rules. Each process context has access to a network namespace, but filesystem operations are not tied to a specific process context: children can change their network namespace but inherit an NFS mount which should continue to work for them, and when data comes back from a device it generates an interrupt which is handled without any process context. (The handler code can looks up a process context to route the event to, but that's not relevant here because an NFS mount shouldn't go away when the mount command exits.)

So at mount time, we need to copy information from the process context of the mount command into the persistent information attached to the superblock of the new mount. Then we need to adjust all the users of that information to use the context cached in the mount structure, and not the global variable "init_net".

In the case sane filesystems, this is trivial. CIFS required changing four places: add a network context entry to the mount struct that was already saving the network address info for the connection, initialize that entry from process context during mount, establish the initial TCP/IP connection using the cached info, and if the connection needs to be re-established (CIFS servers reboot, linksys routers timeout inactive masquerading routes) also use that cached info.

The only reason I haven't already done a similar conversion on the p9 filesystem is A) I haven't found a good server to test it with (other than the one built into QEMU which uses virtio instead of IPv4 as its transport layer -- if I was doing this in a hobbyist context I'd stop and fix QEMU P9 server to use IPv4, arrange for a good userspace P9 server even if I had to write one, then fix p9's filesystem driver to work in containers, then unstack back to whatever I'd diverged from).

Oh, and B) the p9 code doesn't have the reconnect stuff CIFS does, if its initial network connection goes away due to a server reboot or masquerading timeout or one of those transient network errors that drops enough retransmit attempts that the packet sequence numbers get scrambled then you have to unmount and remount to get it back. Somebody did some work to fix that as a google summer of code project but honestly it looks pretty straightforward just doing what CIFS does.

P9 is reasonably simple and clean, and begs to be extended and filled out because there's low-hanging fruit everywhere, but despite the appeal of banging on P9 and speeding up the obsolescence of NFS, allowing myself to be extensively distracted from the steaming pile of NFS I have been tasked with shoveling would be unprofessional. So.

At the top of fs/nfs/super.c, we have this comment:
* - superblocks are indexed on server only - all inodes, dentries, etc. associated with a
* particular server are held in the same superblock
* - NFS superblocks can have several effective roots to the dentry tree
* - directory type roots are spliced into the tree when a path from one root reaches the root
* of another (see nfs_lookup())

You hear that? Multiple NFS mounts SHARE A SUPERBLOCK if they're exported from the same server. If I'd missed that comment, would I have even thought to test mounting different NFS shares from host and client off of the same server?

With sane filesystems, each mount is its own context, which can be attached to a specific NFS namespace. With NFS, multiple mounts share the same context by _default_, due to premature optimization (the root of all evil) stretching back to the 1980's.

Here's another one, from fscache.c:
/*
* Get the per-client index cookie for an NFS client if the appropriate mount
* flag was set
* - We always try and get an index cookie for the client, but get filehandle
* cookies on a per-superblock basis, depending on the mount flags
*/

Yup, I have to set a flag to tell this aspect of the cacheing infrastructure NOT to throw multiple mounts together into the same big pot and give it a stir.

Premature optimization is the root of all evil. This mess has been "optimized" within an inch of its life since the 1980's. (That's why all the cacheing. That's why it did UDP instead of TCP in the first place.) I have to stare at the code for days just to understand these "optimizations" well enough to SWITCH THEM OFF, and when it doesn't work it could be ANYTHING. The codepath this goes through is inexcusably long. (My first three attempts to trace my way through the code, from different starting points, NEVER CONNECTED UP.)

The only SANE approach to all this is to THROW IT OUT AND START OVER FROM SCRATCH. (Which is what the p9 guys are doing.) Unfortunately, the pointy-haired are still attached to NFS, "the cobol of filesystems"...

And it's degenerated into a rant again. Time for another break.
Tags: dullboy
Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 1 comment