|How mount works.
||[Feb. 12th, 2011|09:48 pm]
Way back when, I rewrote the busybox mount command three times to beat sane behavior out of it. There is no mount spec I could find, so here's how mount actually works:
The point of the mount comand is to call the mount system call, which has five arguments as you can see on the "man 2 mount" page: int mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data);
When you do "mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime", that information gets parsed and fed into those five system call fields. In this example, the source argument is "/dev/sda1", the target is "/path/to/mountpoint", the filesystemtype is "ext2".
The other two (mountflags and data) come from the "-o option,option,option" entries. The mountflags argument to the system call is options for the VFS, the data argument is used by the filesystem driver.
This options string is a list of comma separated values. If there's more than one -o argument on the mount command line, they get glued together (in order) with a comma. The mount command also checks the file /etc/fstab which can specify default options for filesystems, and the ones you specify on the command line get appended to those defaults (if any). Most other mount flags are just synonyms for adding option flags (for example "mount -o remount -w" is equivalent to "mount -o remount,rw"), and behind the scenes they just get appended to the string.
VFS stands for "Virtual File System" and is the common infrastructure shared by different filesystems. It handles common things like making the filesystem read only. The mount command assembles an option string to supply to the "data" argument of the option syscall, but first it parses it for VFS options (ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag from #include <sys/mount.h>. The mount command removes those options from the sting and sets the corresponding bit in mountflags, then the remaining options (if any) form the data argument for the filesystem driver.
A few quick implementation details: the mountflag MS_SILENCE gets set by default even if there's nothing in /etc/fstab. Some actions (such as --bind and --move mounts, I.E. -o bind and -o move) are just VFS stuff and don't require any specific filesystem at all. The "-o remount" flag requires looking up the filesystem in /proc/mounts and reassembling the full option string because you don't _just_ pass in the changed flags but have to reassemble the complete new filesystem state to give the system call. Lots of the options in /etc/fstab trigger magic behavior (such as "user" which only does anything if the mount command has the suid bit set).
But when mounting a _new_ filesystem, the "filesystem" argument to the system call specifies which driver to use. All the loaded drivers are listed in /proc/filesystems. A filesystem driver is responsible for putting files and subdirectories under the mount point: any time you open, close, read, write, truncate, list the contents of a directory, move, or delete a file, you're talking to a filesystem driver to do it. (And there's a few miscelaneous actions like ioctl(), stat(), statvfs(), and utime(). Yes I've implemented "touch" and "df" too.)
Different drivers implement different filesystems, and there are four types:
1) Block device backed filesystems, such as ext2 and vfat.
This kind of filesystem driver acts as a lens to look at the block device through, and there's another driver somewhere that implements the block device. The source argument is a path to a block device, ala "/dev/hda1", which stores the contents of the filesystem in a fixed block of sequential storage.
2) Server backed filesystems, such as cifs/samba or fuse.
These drivers convert the filesystem operations into a sequential stream of bytes, which it can send through a pipe to talk to a program. The filesystem server could be a local Filesystem in Userspace (connected to a local process through a pipe filehandle), behind a network socket (CIFS and v9fs), behind a char device, and so on. The common attribute is there's some program on the other end sending and receiving a sequential bytestream.
Note: a lot of these filesystems want to open their own connection so they don't need to pass the data through a userspace process, not really for performance reasons but because in low memory situations a chicken-and-egg situation can develop where all the process's pages have been swapped out but the filesystem needs to write data to its backing store in order to free up memory so it can swap the process's pages back in. If this mechanism is providing the root filesystem, this can freeze the system solid.
The source argument for these filesystems indicates where the filesystem lives. It's often in a URL-like format for network filesystems, but it's really just a blob of data that the filesystem driver understands.
3) Ram backed filesystems, such as ramfs and tmpfs.
These are very simple filesystems that don't implement a backing store. Data written to these gets stored in the disk cache, and the driver ignores requests to flush it to backing store (although tmpfs does swap it out). These drivers essentially mount the VFS's page cache as if it was the filesystem.
Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a chunk of memory to implement a block device, and then you can format that block device and mount it with a block device backed filesystem driver. This is significantly less efficient than ramfs, and allocates a fixed amount of memory up front for the block device instead of dynamically resizing itself as files are written into an deleted from the page cache the way ramfs does.
4) Synthetic filesystems, such as proc, sysfs, devpts...
These filesystems don't have any backing store either, because they don't store arbitrary data the way the first three types of filesystems do.
Instead they present artificial contents, which may represent processes or hardware or anything the driver writer wants them to show. Listing or reading from these files calls a driver function that produces whatever output it likes, and writing to these files submits data to the driver which can do anything it wants with it.
These are often implemented to provide monitoring and control knobs for parts of the operating system. It's an alternative to adding more system calls, providing a more human friendly user interface which programs can use but which users can use directly from the command line with "cat" and by redirecting the output of "echo" into a file.
Those are the four types of filesystems: backing store can be a fixed length block of storage, backing store can be some server the driver connects to, backing store can not exist, or the filesystem driver can just make up its contents programmatically.
And that's how filesystems get mounted, using the mount system call which has five arguments. The "filesystem" argument specifies the driver implementing one of those filesystems, and the "soure" and "data" arguments get fed to that driver. The "target" and "mountflags" arguments get parsed (and handled) by the generic VFS infrastructure (the filesystem driver can peek at that but generally doesn't need to care).