January 8th, 2011

Containers setup HOWTO.

To play around with containers, I chose to use a 3 layer approach:

  • Laptop - the host system running on real hardware (my Ubuntu laptop).

  • KVM - a virtual debian Sid system running under KVM.

  • Container - a simple busybox-based system running in a container.

So "Laptop" hosts "KVM" which hosts "Container".

The advantage of this approach is we can modify and repeatedly reboot the KVM system without interfering with the host laptop. We can also play with things like network routing without disconnecting the laptop from the internet.

Step 1: Create a root filesystem for the KVM system.

Here's how to creates a debian "sid" (unstable) root filesystem and package it into an 8 gigabyte ext3 image. The root password is "root". If you prefer a different root filesystem, feel free to use that instead. This procedure requires the "debootstrap", "genext2fs", and "e2fsprogs" packages installed.

This creates a smaller image and resizes it because genext2fs is extremely slow at creating large images.

You'll have to run this stage as root, and it requires network access. The remaining stages do not require root access.

sudo debootstrap sid sid

echo -e "root\nroot" | chroot sid passwd
echo -e "auto lo\niface lo inet loopback\nauto eth0\niface eth0 inet dhcp" \
  > sid/etc/network/interfaces
ln -sf vimrc sid/etc/vimrc.tiny
rm -f sid/etc/udev/rules.d/70-persistent-net.rules
echo kvm > sid/etc/hostname
echo cgroup /mnt/cgroup cgroup defaults >> sid/etc/fstab
mkdir -p sid/mnt/cgroup

BLOCKS=$(((1024*$(du -m -s sid | awk '{print $1}')*12)/10))
genext2fs -z -d sid -b $BLOCKS -i 1024 sid.ext3
resize2fs sid.ext3 8G
tune2fs -j -c 0 -i 0 sid.ext3

Now chown the "sid.ext3" file to your normal (non-root) user, and switch back to that user. (If you forget to chown, the emulated system won't be able to write to the ext3 file and will complain about write errors when you fire up KVM. Use your username instead of mine here.)

chown landley:landley sid.ext3
exit  # Stop being root on Laptop now

Step 2: Build a kernel for KVM, with container support.

The defconfig in 2.6.36 is close to a usable configuration, but needs a few more symbols switched on:

# Start with the default configuration
make defconfig

# Add /dev/hda and more container support.
cat >> .config << EOF
CONFIG_IDE=y
CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
CONFIG_BLK_DEV_PIIX=y

CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_MEM_RES_CTLR=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED=y
CONFIG_BLK_CGROUP=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
EOF
yes '' | make oldconfig

# Build kernel (counting CPUS to supply appropriate -j to make)

CPUS=$(grep "^processor" /proc/cpuinfo | wc -l)
make -j $CPUS

This builds a (mostly) static kernel, because rebooting kvm with a new kernel image is trivial, but copying modules into a loopback mounted root filesystem image is a multi-step process requiring root access.

Step 3: Boot the result under QEMU or KVM, and add more packages.

This invocation boots the newly built kernel with the sid root filesystem image, configured to exit the emulator when the virtual system shuts down. It allocates 1 gigabyte of memory and provides a virtual gigabit network interface hooked up to a virtual masquerading router (for the 10.0.2.X address range), with port 9876 on the host's loopback interface forwarded to the SSH port on the emulated interface.

kvm -m 1024 -kernel arch/x86/boot/bzImage -no-reboot -hda ~/sid.ext3 \
  -append "root=/dev/hda rw panic=1" -net nic,model=e1000 -net user \
  -redir tcp:9876::22

Log in to the resulting system (user root password root), and install some more packages to fluff out the SID install a bit.

aptitude update
aptitude install file psmisc less strace bzip2 make gcc libc6-dev dropbear lxc

Step 4: ssh into the KVM instance.

The KVM/QEMU console window is a nice fallback, but awkward for serious use. To get multiple terminal windows, or use cut and paste, we need more.

Redirecting a port from the host's loopback interface to connect to the port of the KVM instance allows us to ssh in from the laptop system. In step 3, we installed the dropbear ssh server, and the "-redir tcp:9876::22" arguments we used to launch KVM forward port 9876 from the host's loopback interface to port 22 of KVM's eth0, so we should now be able to ssh in from the laptop system via:

ssh root@127.0.0.1 -p 9876

Remember, root's password is "root". (Feel free to change it.)

Step 5: Set up a simple busybox-based container under the KVM system.

The lxc-create command sets up a container directory with a new root filesystem. It takes three arguments: a name for the new container directory, a root filesystem build script, and a configuration file describing things like what network devices to put in the new container.

LXC calls its root filesystem build scripts "templates" (see /usr/lib/lxc/templates), the simplest of which is the "busybox" template.

Unfortunately, the default busybox binary in Debian sid is insufficient. The "busybox" package doesn't include the "init" command, and the "busybox-static" package doesn't have "login". To work around this, we download a prebuilt busybox binary from the busybox website, and add the current directory to the $PATH so lxc-create can find it.

We supply a trivial configuration file defining no network devices, mostly to shut up the "are you really really sure" babysitting lxc-create would spew otherwise.

wget http://busybox.net/downloads/binaries/1.18.0/busybox-i686 -O busybox
chmod +x busybox
echo -e "lxc.utsname = container\nlxc.network.type = empty" > container.conf
PATH=$(pwd):$PATH lxc-create -f container.conf -t busybox -n container

LXC creates the container's directory (including its config file and its root filesystem) under /var/lib/lxc.

Step 6: Launch the container

Launching containers requires the "cgroup" filesystem be mounted somewhere. (Doesn't matter where, LXC will check /proc/mounts to find it.) In step 1, we added an fstab entry to the KVM sid system to mount cgroup on /mnt/cgroup.

We also need the LXC command line tools, which we installed in step 3.

Now we get to experience the brittle bugginess that is LXC 0.7.3. The first step to launching an LXC container is:

lxc-start -n container

This starts busybox init in the container, which will tell you "press Enter to activate this console". Unfortunately, LXC's console handling code is buggy, and this console won't actually work. (Feel free to play with it, just don't expect to accomplish much.)

To get a working shell prompt in the container, ssh into the KVM system again and from that window type:

lxc-console -n container

This will connect to one of init's other consoles, which finally lets you log in (as root). Repeat: you have to run lxc-start, leave it running, and run lxc-console in a second terminal in order to get a usable shell prompt.

Step 7: Stop the container, and the KVM system.

To kill the container, run this on the KVM system:

lxc-stop -n container

Note that lxc-start undoes lxc-start. If you want to undo the lxc-create (delete the container from /var/lib/lxc), the command is:

lxc-destroy -n container

You can exit the KVM system by closing the QEMU console window, by hitting Ctrl-C in the terminal you ran KVM from, or by running "shutdown -r now" in the KVM system.

Summary

You should now be able to get a shell prompt in all three systems:

  • The host laptop.

  • The Debian sid KVM.

  • The busybox container.

Next time, we set up networking in the container.

Part 2: setting up networking in containers.

Last time, we set up a three layer container test environment:

  • Laptop - the host system running on real hardware (my Ubuntu laptop).

  • KVM - a virtual debian Sid system running under KVM.

  • Container - a simple busybox-based system running in a container.

So "Laptop" hosts "KVM" which hosts "Container". This lets us reconfigure and reboot the container host (the KVM system) without screwing up our real host environment (the Laptop system).

We ended with a shell prompt inside a container. Now we're going to set up networking in the container, with different routing than the KVM system so the Container system and KVM system have different views of the outside world.

LXC supports several different virtual network types, listed in the lxc.conf man page: veth uses Linux's ethernet bridging support, vlan sets up a virtual interface selects packets by IP address, and macvlan sets up a virtual interface that selects packets by mac address, that routes packets at the IP level, and veth joins interfaces together using Linux's ethernet bridging support (and the ebtables subsystem).

The other two networking options LXC supports are "empty" (just the loopback interface), and "phys" to move one of the host's ethernet interfaces into the container (removing it from the host system).

We're going to add a second ethernet interface to the KVM system, and use the "phys" option to move it into the container.

Step 1: Add a TAP interface to the Laptop.

The TUN/TAP subsystem creates a virtual ethernet interface attached to a process. (A TUN interface allows a userspace program to read/write IP packets, and a TAP interface works with ethernet frames instead.) For details, see the kernel TUN/TAP documentation.

We're going to attach a TAP interface to KVM, to add a second ethernet interface to the KVM system. Doing so requires root access on the laptop, but we can use the "tunctl" program (from the "uml-utilities" package) to create a new TUN/TAP interface and then hand it over to a non-root user (so we don't have to run KVM as root).

Run this as root:

# Replace "landley" with your username
tunctl -u landley -t kvm0
ifconfig kvm0 192.168.254.1 netmask 255.255.255.0
echo 1 > /proc/sys/net/ipv4/ip_forward

The above commands last until the next time you reboot your Laptop system, at which point you'll have to re-run them. It associates the address 192.168.254.1 with the TAP interface on the Laptop host, and tells the Laptop to route packets between interfaces.

If you want to remove the tun/tap interface from the host (without rebooting), the command is:

tunctl -d kvm0

Step 2: Launch KVM with two ethernet interfaces.

We need to reboot our KVM system, still using the kernel and root filesystem we built last time but this time specifing two ethernet interfaces. The first is still eth0 masqueraded through a virtual 10.0.2.x LAN (for use by the KVM host), and the other's a TAP device connected directly to the host (for use by the container).

To do this, we append a couple new arguments to the end of the previous KVM command line:

kvm -m 1024 -kernel arch/x86/boot/bzImage -no-reboot -hda ~/sid.ext3 \
  -append "root=/dev/hda rw panic=1"  -net nic,model=e1000 -net user \
  -redir tcp:9876::22 -net nic,model=e1000 -net tap,ifname=kvm0,script=no

The first "-net nic" still creates an e1000 interface as KVM's eth0, the "-net user" plugs that interface into the masqueraded 10.0.2.x LAN, and -redir forwards port 9876 of the laptop's loopback to port 22 on that interface. What's new is the second "-net nic" which adds another e1000 interface (eth1) to KVM, and "-net tap" which connects that interface to the TUN/TAP device we just created on the Laptop.

Step 3: Set up a new container in the KVM system.

To add a network interface to the container, we need a new configuration file in the format described by the "lxc.conf" man page. We're going to move a physical interface (eth1) from the host into the container. This will remove it from the host's namespace, and make it appear only in the container.

In the kvm system, go to the directory containing the static "busybox" binary and as root run:

cat > busybox.conf << EOF
lxc.utsname = busybox
lxc.network.type = phys
lxc.network.flags = up
lxc.network.link = eth1
#lxc.network.name = eth0
EOF

PATH=$(pwd):$PATH lxc-create -f busybox.conf -t busybox -n busybox
lxc-start -n busybox

The reason the last line of busybox.conf is commented out is to work around another bug: if the container's interface has the same name as the host interface, the two bleed together. So the host's eth1 interface will still be called "eth1" in the container, even though there's no eth0 there.

Leave that running and SSH into the KVM system again, get a shell prompt in the container and configure the container's new network interface:

lxc-console -n busybox

ifconfig eth1 192.168.254.2 netmask 255.255.255.0
route add default gw 192.168.254.1

Step 4: Fun with routing.

Now let's show that the container can access things the KVM can't. On the Laptop system, set up an alias of the loopback interface with the same IP address assigned to the KVM's eth0 (10.0.2.15). Then download the busybox binary to the Laptop and run busybox netcat in server mode so it prints "hello world" when you connect to port 12345.

sudo ifconfig lo:1 10.0.2.15 netmask 255.255.255.0
wget http://busybox.net/downloads/binaries/1.18.0/busybox-i686 -o busybox
chmod +x busybox
./busybox nc -p 12345 -lle echo hello world

Now from the container, try to connect to it with netcat:

nc 10.0.2.15 12345

It should print "hello world", meaning you connected to the laptop's lo:1 interface rather than the KVM's eth0. If you try the same command from the KVM system (./busybox nc 10.0.2.15 12345), it won't connect.

Making CIFS work in a container, part 1.

The previous two posts were documentation on things I got to work. Some of what I documented were bug workarounds, such as the fact that the busybox template installs a broken inittab (that's the lxc-start/lxc-console workaround), and the inittab should be:

::sysinit:/etc/init.d/rcS
tty1::respawn:/bin/getty -L tty1 115200 vt100
console::askfirst:/bin/sh


But the point is, I got it to work.

Mounting network filesystems in a container doesn't work, because internally the kernel uses the original network namespace rather than the container's network namespace. This means that you can mount using IP addresses and routing visible from the _host_, but not from ones that should be visible from the container.

Wrapping my head around NFS is enough of a roadblock that I'm taking a break to deal with a network filesystem that's crazy in _different_ ways: Samba. It has the same general issues, but it's just one TCP/IP session per mount (modulo reconnecting if that connection breaks), and then everything else it does goes through that connection. No weird TCP vs UDP stuff, no portmap daemon handing out constant information for historical reasons, no layering violations to handle DNS callbacks (or at least much less obvious ones)...

So let's start out by documenting what _works_ here.

Collapse )
  • Current Mood
    indescribable indescribable
  • Tags