Old Fashioned Virtualization

If you’re working in the server end of the IT field today, you’d be hard pressed to avoid the topic of virtualization. Everyone from the largest of vendors (IBM, HP, Sun, VMWare) to the smallest (XenSource) are pushing it. They claim that virtualization will reduce IT costs, simplify your infrastructure, and allow you to better leverage the computing resources you already own. Are they right? Maybe. Many decision-makers only ask “does the product really perform as advertised?” There are some items to be careful of with this approach which may not seem obvious. First, listen carefully to what the vendor is not claiming their product can do. Many times, vendors will show demos or presentations about amazing capabilities or implementations of their products – with the help of their consulting services. Vendors will typically have no problem selling you a solution more advanced than their product can handle out of the box – more consulting services means more dollars. Sometime it is difficult to determine what their product provides and what they can make it provide for addition money. For this reason, and also because vendors will try to sell you more than you may need, even in the base product, there might be a better, more fundamental question. “How much of this can we implement ourselves, without vendor help?”

The answer is actually “quite a bit.” This article describes a basic, but very practical and real network environment which provides complete virtualization for every server in your organization. For the purposes of this article, we are going to be designing the network infrastructure for Search Me, a web site search company.

Search Me is a startup who has just designed a revolutionary new distributed search application, which runs on any number of servers, and powers their search engine, Unfortunately, this program is just unbelievably huge, and in order for the developers to do anything productive, they must use DistCC to compile the server in a distributed build on as many machines as they can. The server environment is all Linux. Since Search Me is a small company, they don’t have the resources to purchase an infinite number of servers – they must leverage what they have efficiently.

After several weeks in production, the company realizes that the amount of web traffic varies greatly. Sometimes, all their servers are pegged. Other times, usage drops to below 25%. The load changes gradually over hours. Upper management wants to know – can Search Me use virtualization to make the organization more efficient?

Goals
Let’s use some standard virtualization-industry goals:

  • 1. Get more done with the same physical resources
  • 2. Lower the time required for maintenance

And then let’s define some goals of the system administrators

  • 3. No custom kernels or programming
  • 4. Keep It Simple, Stupid

3 and 4 might surprise you. They don’t usually show up on a vendor’s list of goals. Well, they should. This is what keeps you from having to work nights and weekends. Make it a priority.

Approach

The answer to Search Me management’s question is “yes.” Not only can virtualization allow the company to run more efficiently, but no expensive purchases are necessary. Here’s how: near-diskless servers. These servers will perform as well as they did before, but let administrators or automated processes re-task them in minutes, even if that means different OS versions, configurations, or Linux distributions. It will let them replace a dead machine in the same amount of time it would take you to reboot it, and allow them to perform company-wide server updates in an “update-once, deploy everywhere” model, either universally or in phases.
Here’s how this is accomplished. Most of each server will be hosted on a network file server using NFS. At boot time, a bootp/pxe server will instruct the servers to look to the NFS server for its boot image. Most of the filesystem is mounted read-only, although each server will have a “snapshot” directory on the nfs server, where any writable and machine-specific files are located. Additionally, depending on the task of the server, it will use its local disk for swap, tmp, and data serving tasks. This means that io-intensivc tasks can still occur on local disk, where they can take advantage of the maximum performance. Most system files and executables, which are typically accessed once and then cached by the OS, are loaded over the network, as are infrequently accessed configuration files. Any upgrade to the packages in the root filesystem will apply to all the servers which use that network image – a large timesaving device for OS updates and configuration.
Now is where virtualization comes into play. The system administrators at Search Me are going to create multiple os images on the NFS server. There will be a “production’ web server, a “production” compiler server, as well as a “prototype” version of each. If they choose, they can easily create additional images. The “prototype” images exist so that new OS updates can be deployed in stages, instead of automatically upgrading all machines simultaneously, as their IT policy dictates. Each time a server boots, the bootp/pxe server tells it what sort of server it should be. Hence, switching a server between a CentOS 4.2 web server and a Mandrake 10 compile server is as simple as rebooting the server, and having the bootp/pxe server tell it to become one.

Dispelling Common Myths

The astute reader might be thinking that this configuration looks pretty familiar. In fact, diskless systems are a configuration widely used for KIOSK workstations in Internet Café’s and Universities. The diskless server, however, is not so widely used. Many of these are granted as truisms, while in reality they are far from it.

“Network-access will reduce performance over local disk.” This will be the first argument in a high performance computing environment. However, notice that we have specifically addressed this issue. Heavy IO-based tasks – those that require both reads and writes, will still take place on the local disk. Note also that more of the local disk will be available to applications, as the OS itself does not need to be stored there. The OS itself, as well as the binaries for the server processes, will reside remotely. However, typical access patterns for binaries are that they are heavily cached. A typical Linux server accesses no more than 500 MB of unique data during a system boot, and very little after that. With typical servers massing multiple gigabytes, this data will be cached at both the client and the server. This means that first, the client will only rarely have to go to the server to access files and second, that the server will have the files already cached itself, and be able to return the data immediately. In this way, the NFS server will be most limited by its network interface (which typically comes in 2 gigabyte Ethernet ports) rather than its disk speed. In real-world usage, a single 1-cpu NFS server with one gigabyte Ethernet port and an ide RAID-5 was easily able to serve 20-30 machines without negative impact on the clients.

“NFS is a bad protocol.” Most arguments will sound like this, but what the person usually means is that NFS doesn’t support failover natively, and server outages can manifest as client hangs. While these issues are true, careful planning can mitigate these concerns. First, active-passive NFS server configurations are common nowadays – although somewhat complex to configure (please see references at the end of this article). Such configurations completely mitigate server failure with “no single point of failure,” and will make the environment rock-solid. If uptime requirements are more lenient (or the budget isn’t), a satisfactory solution may be to simply run 2 mirrored NFS servers, each with half the load. In the rare event of a total NFS server failure, simply reboot it’s half of the near-diskless servers off the remaining NFS server.

Benefits

The savings in administration time, server recoverability, and flexibility are indisputable. A single system administrator can easily handle multiple hundreds of servers, with a comparable time commitment to administrating a handful. Similar but different server configurations can be accommodated with a single NFS server image by allowing individual machines to have differing rc.d directories. One cold spare server can be booted to replace any failed server, complete with IP address, within minutes. These are capabilities that only the most expensive virtualization software can provide in commercial space.

HowTo

First, you will need to set up your bootp/pxe/dhcp server. This needs to be running on the same subnet as the diskless servers. In many cases, this can also be your NFS file server. You will need to make sure you have any relevant dhcpd and tftpd packages installed on your distribution. For Red Hat, you will need to make sure the following entries are in /etc/dhcpd.conf (adjusted for your network settings):


subnet 192.168.1.1 netmask 255.255.255.0 {
pool {
option domain-name "yourdomain.com";
option domain-name-servers 192.168.1.2;
option routers 192.168.1.254;
range dynamic-bootp 192.168.1.32 192.168.1.64;
range 10.200.16.64 10.200.16.128;
next-server tftpserver.yourdomain.com;
filename "linux-install/pxelinux.0";
group {
use-host-decl-names on;
host diskless-0.yourdomain.com {
# Use the diskless server’s MAC address
hardware ethernet 00:11:22:33:44:55;
fixed-address 192.168.1.10;
}
# You will have one of these entries for each
# diskless server.
host diskless-1.yourdomain.com {
hardware ethernet 00:11:22:33:44:56;
fixed-address 192.168.1.11;
}
}
}

For the next step, Red Hat has kindly provided a tool which allows the trivial conversion of a traditionally installed version of Red Hat (as well as many other distributions) into a network-bootable image, complete with individual system snapshots. Rather than repeat that documentation here, please refer to Red Hat’s excellent documentation. There are several items which will make your new network-based image more usable. These tips apply to Red Hat systems – adjust accordingly for your distribution. If you perform the below actions, you will not need to use the wizard to create each system snapshot – when a new diskless-machine is booted, it will automatically create its own snapshot directory.

  1. Assign a non-existent HWADDR in your ifcfg-eth0 file (like “HWADDR=none”), and set it for ONBOOT=no. If you don’t, the OS will try to shut down the network too early when it tries to shut down or restart.
  2. Copy the /etc/fstab from one of the snapshots created by the wizard to the root network filesystem. The wizard quietly creates a tweaked version, but doesn’t update the main one. This will allow new snapshots to automatically create the right one.
  3. Softlink /etc/mtab to /proc/mounts
  4. Create a file called /etc/sysconfig/realonly-root with the contents “READONLY=yes”
  5. Make sure that the important /dev files are in the network root, and that they are device files, not normal files. Specifically, check for zero, null, and the tty’s. If they aren’t there, use MKDEV to create them.
  6. Edit the new file in /tftpboot/linux-install/pxelinux.cfg to remove the snapshot argument from the “append” line. Without this line, it will default to using the hostname as the snapshot root. Rename the file to something descriptive of this filesystem, and soft-link the old filename to this one.

Ok, now you have a network bootable system. Where does virtualization come in? You may have noticed that the Red Hat wizard created a file in your pxelinux.cfg directory with a strange name. This filename is actually an IP address converted to hexadecimal. When each diskless server boots, it will use the configuration file for its IP address (which it acquires via bootp/pxe/dhcp). For example, if you convert each octet of 192.168.1.10, you get C0.A8.01.0A. There is a decent web site which can make conversion easier at http://www.searchlores.org/sonjas33.htm. Strip out the dots, and you get C0A8010A, which is the pxelinux configuration filename which the diskless server at 192.168.1.10 will look for when it boots.

If you follow the procedure above multiple times for each OS image you require, you will have several network root directories. As mentioned above, all the IP-based configuration files are actually soft-links to a main pxelinux configuration file for a particular OS image. By simple changing the softlink for one of the IP-based configuration files, the administrator can change which operating system that server will boot. This has no effect on the currently running system – it must be rebooted before it will start of the newly designated image.

Only three steps are necessary in order to add a new diskless server. This is certainly a lot simpler than any traditional method of provisioning and installing a new server!

  1. Add the new server’s MAC address to your dhcpcd.conf file.
  2. Create a softlink in the pxelinux.cfg file with the hexadecimal IP of the new server, pointing to the OS image configuration file your want to boot to.
  3. Make sure the diskless server is configured to boot from the network in its BIOS.

Upgrading an OS image is also very simple. The administrator can issue the following command on one of the diskless systems “mount –o remount,rw /”. Afterward, they can install, update, or remove packages the same way they would on a traditional system. The difference is that these changes will actually be updating the network filesystem image, and be applied to any diskless server which uses it.

Now, creating new versions of OS images is also very simple – the administrator simply creates a copy OS image on the network file server, and uses the copy as a new network OS image. In this way, you can create a copy of OS images, upgrade the packages on the copy, and then configure some or all of the diskless systems to boot to the new version. If something goes wrong, you can configure them to boot back to the originally version just as easily. This can be a huge benefit in addressing upgrade risks in complex environments.