How NFS mounts work.

So the last post was how normal filesystems work. NFS is so utterly horrible that it has to either punch holes in or reimplement lots of this common infrastructure. (This is not due to having a network connection between server and client, that's normal server backed filesystem, and cifs/p9/fuse don't do that. This is due to the designers of NFS being INSANE.) The design of NFS is horrible, the implementation is horrible, and that makes it unclear how I'm supposed to containerize it.

Collapse )
  • Current Mood
    frustrated frustrated
  • Tags

How mount works.

Way back when, I rewrote the busybox mount command three times to beat sane behavior out of it. There is no mount spec I could find, so here's how mount actually works:

The point of the mount comand is to call the mount system call, which has five arguments as you can see on the "man 2 mount" page: int mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data);

When you do "mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime", that information gets parsed and fed into those five system call fields. In this example, the source argument is "/dev/sda1", the target is "/path/to/mountpoint", the filesystemtype is "ext2".

The other two (mountflags and data) come from the "-o option,option,option" entries. The mountflags argument to the system call is options for the VFS, the data argument is used by the filesystem driver.

This options string is a list of comma separated values. If there's more than one -o argument on the mount command line, they get glued together (in order) with a comma. The mount command also checks the file /etc/fstab which can specify default options for filesystems, and the ones you specify on the command line get appended to those defaults (if any). Most other mount flags are just synonyms for adding option flags (for example "mount -o remount -w" is equivalent to "mount -o remount,rw"), and behind the scenes they just get appended to the string.

VFS stands for "Virtual File System" and is the common infrastructure shared by different filesystems. It handles common things like making the filesystem read only. The mount command assembles an option string to supply to the "data" argument of the option syscall, but first it parses it for VFS options (ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag from #include <sys/mount.h>. The mount command removes those options from the sting and sets the corresponding bit in mountflags, then the remaining options (if any) form the data argument for the filesystem driver.

A few quick implementation details: the mountflag MS_SILENCE gets set by default even if there's nothing in /etc/fstab. Some actions (such as --bind and --move mounts, I.E. -o bind and -o move) are just VFS stuff and don't require any specific filesystem at all. The "-o remount" flag requires looking up the filesystem in /proc/mounts and reassembling the full option string because you don't _just_ pass in the changed flags but have to reassemble the complete new filesystem state to give the system call. Lots of the options in /etc/fstab trigger magic behavior (such as "user" which only does anything if the mount command has the suid bit set).

But when mounting a _new_ filesystem, the "filesystem" argument to the system call specifies which driver to use. All the loaded drivers are listed in /proc/filesystems. A filesystem driver is responsible for putting files and subdirectories under the mount point: any time you open, close, read, write, truncate, list the contents of a directory, move, or delete a file, you're talking to a filesystem driver to do it. (And there's a few miscelaneous actions like ioctl(), stat(), statvfs(), and utime(). Yes I've implemented "touch" and "df" too.)

Different drivers implement different filesystems, and there are four types:

1) Block device backed filesystems, such as ext2 and vfat.

This kind of filesystem driver acts as a lens to look at the block device through, and there's another driver somewhere that implements the block device. The source argument is a path to a block device, ala "/dev/hda1", which stores the contents of the filesystem in a fixed block of sequential storage.

2) Server backed filesystems, such as cifs/samba or fuse.

These drivers convert the filesystem operations into a sequential stream of bytes, which it can send through a pipe to talk to a program. The filesystem server could be a local Filesystem in Userspace (connected to a local process through a pipe filehandle), behind a network socket (CIFS and v9fs), behind a char device, and so on. The common attribute is there's some program on the other end sending and receiving a sequential bytestream.

Note: a lot of these filesystems want to open their own connection so they don't need to pass the data through a userspace process, not really for performance reasons but because in low memory situations a chicken-and-egg situation can develop where all the process's pages have been swapped out but the filesystem needs to write data to its backing store in order to free up memory so it can swap the process's pages back in. If this mechanism is providing the root filesystem, this can freeze the system solid.

The source argument for these filesystems indicates where the filesystem lives. It's often in a URL-like format for network filesystems, but it's really just a blob of data that the filesystem driver understands.

3) Ram backed filesystems, such as ramfs and tmpfs.

These are very simple filesystems that don't implement a backing store. Data written to these gets stored in the disk cache, and the driver ignores requests to flush it to backing store (although tmpfs does swap it out). These drivers essentially mount the VFS's page cache as if it was the filesystem.

Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a chunk of memory to implement a block device, and then you can format that block device and mount it with a block device backed filesystem driver. This is significantly less efficient than ramfs, and allocates a fixed amount of memory up front for the block device instead of dynamically resizing itself as files are written into an deleted from the page cache the way ramfs does.

4) Synthetic filesystems, such as proc, sysfs, devpts...

These filesystems don't have any backing store either, because they don't store arbitrary data the way the first three types of filesystems do.

Instead they present artificial contents, which may represent processes or hardware or anything the driver writer wants them to show. Listing or reading from these files calls a driver function that produces whatever output it likes, and writing to these files submits data to the driver which can do anything it wants with it.

These are often implemented to provide monitoring and control knobs for parts of the operating system. It's an alternative to adding more system calls, providing a more human friendly user interface which programs can use but which users can use directly from the command line with "cat" and by redirecting the output of "echo" into a file.

Those are the four types of filesystems: backing store can be a fixed length block of storage, backing store can be some server the driver connects to, backing store can not exist, or the filesystem driver can just make up its contents programmatically.

And that's how filesystems get mounted, using the mount system call which has five arguments. The "filesystem" argument specifies the driver implementing one of those filesystems, and the "soure" and "data" arguments get fed to that driver. The "target" and "mountflags" arguments get parsed (and handled) by the generic VFS infrastructure (the filesystem driver can peek at that but generally doesn't need to care).

I need to buy a scanner.

I got some paperwork from Russia that's for a business visa (pretty sure it's the invitation thing I need, but they wanted to confirm and I still can't read Russian), so I had to scan it and email it back to them so they could confirm it's the right one.

I don't own a scanner. (We got a combo-everything machine once that could theoretically print fax and scan, but it didn't really do any of them reliably, and it's long gone.) So I went out to Kinko's. Then I went out again remembering to bring the actual form this time.

Kinko's could scan the thing, but weren't allowed to email it to me. I'd forgotten to bring a USB key. I almost had them burn it onto a CD for me ($10 charge) but then I remembered my phone could act as a USB key... but I didn't have the cable with me. But they _did_ have the adapter to write to the SD card when I took it out. (Tiny, fiddly little thing behind the battery.)

So, they'd written a PDF to my phone (which made an alarming crunching noise putting the card back in, but then worked fine). My phone utterly refused to acknowledge the file was there, because the only app I have with a "browse" option is an aftermarket text editor, which can only view files as text. Everything else expects files to be only where it put them, so the podcast app and the browser don't share files, although files both of them save wind up intermingled with the "play audio" thing unless I make playlists to exclusde them. The "attach" option to the gmail app will only allow you to attach photos you've taken from the "gallery", because obviously nobody would ever want to attach ANY OTHER KIND OF FILE. And my Linux laptop still won't talk to the phone, presumably because the USB driver for the strange thinkpad chipset is wonky. (Actually, I haven't tried since I upgraded to the 2.6.37 kernel, maybe it works now? Just thought of that...)

Anyway, I took the phone home, plugged it into Fade's mac with the cable I had at home, had her email me the file, and forwarded the email to my boss., who will presumably get the tomato juice to Colonel Potter. (M*A*S*H reference, don't worry about it.)

I need to go buy a scanner, but I may be driving to Houston tomorrow instead, to get there by noon when the consulate's desk thing closes. Assuming I don't need a blood test for the new category of visa. (Really, really, really hoping I don't. Needles. I has a problem with them.) Oh, and I need two passport photos and a money order, and it's a 3 hour drive, so I'd really want to hit the road by 8am at the latest to have time at the consulate to fill out the forms and have them review them...

Friday. Friday I may be driving to the Russian Consulate in Houston. Tomorrow I may be getting passport photos and a money order.

Anyway, back to NFS...

Went with dreamhost.

Even with "unlimited storage" and "unlimited bandwidth", it takes a while to upload 14 gigabytes, but is slowly going back up. (Currently grinding through the videos.)

On the bright side, there's now a mailing list for aboriginal linux again. On the downside, I haven't gotten mercurial to work on the new server yet. (Mark had it working for, which is redirecting to now, so I know it's possible. Just haven't figured out how yet...)

Anyway, back to NFS. (Well actually, back to converting the lxc man pages to HTML and posting them on the lxc website, and making proper cgroup documentation out of setting up a cgroup page that links ot things like the way the existing one links to the various exec flags, and grouping those together under a "documentation tab", and reading the other pages linked from the top of and seeing what of _that_ should be linked to and maybe writing one definitive document instead of lots of little snippets...)

But _also_ NFS...
  • Current Mood
    busy busy

Need to find new domain hosting for my website.

Anybody have any particular opinions about web hosts? I've been meaning to move for years, but now it's become an issue (it's down until I find new hosting).

I have the contents backed up (um, 14 gigabytes, 11 gigs of which I could ditch fairly easily, it's a directory on my laptop which I was updating via rsync) but I need a hosting service to put it on. I've looked at a few, but dunno how to choose between them.

I'm thinking on the theory they have billboards all around Austin with the text "Know Linux? We're Hiring", which strikes me as a vaguely good sign. But I dunno a thing about them as a hosting company other than that. (It looks like everything except serving mercurial is pretty easy to do. All the rest is static content...)


NFS: The Cobol of Filesystems.

So I broke down and downloaded the current RFC collection to my laptop because I keep disconnecting from the internet to avoid distraction, and then needing to reconnect because I need check some RFC. There are now over 5000 RFCs. Most of them seem either concerned with obsoleting each other, or proposing things nobody ever used and many of which nobody ever actually bothered to implement an example of. Sturgeon's Law applies to the IETF, go figure.

The NFS code is evil. The need for "struct nfs_fh" is conceptually disgusting. Two files with the same i_ino is conceptually disgusting. struct nfs_server starts with a struct nfs_client pointer, and struct nfs_client contains a list of superblock structures (cl_superblocks).

Backing up to that last bit: in include/linux/nfs_fs_sb.h there's a "struct nfs_server" which seems kind of important. (Its first entry is a pointer to struct nfs_client, and that has a linked list of struct nfs_server entries. This is horrible and incestuous, but that's a side issue.) Meanwhile, the function nfs_validate_mount_data also mentions nfs_server. This function's job is actually to fill out an nfs_parsed_mount_data structure (so it's really parse, not validate, although said structure gets _allocated_ elsewhere and passed in to this thing, just to keep the control flow fragmented and incomprehensible)... anyway, one of the fields it's filling out from struct nfs_parsed_mount_data is "nfs_server.address".

Note: struct nfs_server does not have an address field. The parsed->nfs_server entry is not a struct nfs_server. No, that's a locally defined anonymous struct in nfs_parsed_mount_data, which is in fs/nfs/internal.h. It's just _named_ nfs_server. As far as I can tell, none of this has anything to do with fs/nfsd (which is not a filesystem, it's the in-kernel nfs server).

So the NFS developers went out of their way to make sure that if you grep for "nfs_server" in the filesystem (client) code, it shows up a lot for purposes totally unrelated to each other. They went out of their way to sabotage grep.

When people talk about the quality of open source code, and all the review it gets, point them at NFS. And laugh.

I think I'm making progress. It's by PROCESS OF ELIMINATION, but there's a finite amount of this crap...



I was writing up a longish post about tracing the flow of control through the nfs and sunrpc code when my laptop battery died and I lost all my open windows, including some state I apparently hadn't saved in a while. (Oops.) On the other hand, the expedition/writeup was turning into one of those "Dr. Livingston I. Presume was Dr. Presume's full name" safaris where you never make it back to your point of origin and instead get lost in the jungle.

So instead I stepped back and hijacked the test code from the getaddrinfo man page, turning it into a quick and dirty UDP intercept and forward daemon, which lets me monitor the darn wire protocol, like so:

Forward 40 bytes from to 9999
Return 24 bytes to
Forward 108 bytes from to 9999
Return 80 bytes to
Forward 40 bytes from to 9999
Return 24 bytes to
Forward 40 bytes from to 9999
Return 24 bytes to
Forward 112 bytes from to 9999
Return 164 bytes to
Forward 112 bytes from to 9999
Return 140 bytes to
Forward 112 bytes from to 9999
Return 164 bytes to
Forward 116 bytes from to 9999
Return 120 bytes to
Forward 112 bytes from to 9999
Return 112 bytes to
Forward 136 bytes from to 9999
Return 32 bytes to
Forward 132 bytes from to 9999
Return 328 bytes to

That is the complete network traffic for mounting an NFSv3 share, listing its contents, and unmounting again. The actual "ls" is just the last pair of packets, everything before that is the mount. (Note that umount creates no actual network traffic, due to the designers being completely insane.) Oh, and mountd and nfsd are currently on the same port because otherwise I'd have to run two copies of the inteceptor and interleave their log output, but I suspect the move from port 35003 to port 50209 was the handoff there.

If all I had to deal with was the wire protocol, and not the fact that NFS _REIMPLEMENTS_HALF_THE_VFS_LAYER_ this would be merely unpleasant. As it is, my plan currently looks like this:

1) Track down every single packet in that series and get it to happen in the right network context.

2) Track down every single unnecessary optimization NFS is doing that would interfere with running that in two containers mounting from the same server (like merging superblocks and stuffing all non-idempodent transactions into a common cache) and either disable it (preferred) or glue network namespace information to it (if I can't disable it for non-init network namespaces). This involves working out proper test cases.

3) Add more tests like "df nfsdir" and "cat nfsdir/file".

4) Mount the sucker rw and make writing work. (touch, mkdir, mv...)

5) Scream a lot.
  • Current Mood
    melancholy melancholy
  • Tags