KVM Backups – an exercise in screwing up your ext4 filesystem

We needed a way to provide VM level image backups purely for disaster recovery. Our KVM hosts have local disk, so offloading a qcow2 file seemed the best way. I found a script online [1] and started digging.

Few roadblocks at first. Centos 6 ships with qemu-0.12 where ‘qemu-img create’ doesn’t accept a snapshot name, and the ‘qemu-img convert’ can’t pull it back out. Little bit of compilation later, i had a shiny 2.1.2 qemu-img (the only thing the script needed differently to run.)

Looked great at first, ramped up with 3-4 VM guests. Nightly runs, blah blah.

Then a few days later the badness happened.

 Mar  5 12:15:49 host.sdsc.edu kernel: EXT4-fs error (device vda2): ext4_lookup: deleted inode referenced: 1178554
 Mar  5 12:15:49 host.sdsc.edu kernel: EXT4-fs error (device vda2): ext4_lookup: deleted inode referenced: 1177551
 Mar  5 12:15:49 host.sdsc.edu sshd[31024]: pam_env(sshd:setcred): Unable to open config file: /etc/security/pam_env.conf: Input/output error
 Mar  5 12:17:32 host.sdsc.edu kernel: EXT4-fs error (device vda2): ext4_lookup: deleted inode referenced: 1177463

Only one of the guests was having a problem, but on top of that the ‘qemu-img convert’ snapshot images were bad too. No amount of fsck’ing could fix those suckers.

The other guinea pig KVM guests were relatively idle, but this system was a postgres server. Firing up bonnie++ to the localdisk on a “working” guest caused it to break similarly.

I poked around for a solution that didn’t require installing non-standard installs on the host and found

virsh snapshot-create-as --quiesce

Whipped up a script revolving around that and…

It requires a few extra tweaks that need to be included. It does require qemu-ga, the QEMU Guest Agent, be installed/running on the guest and the guest XML definition to properly make a new device:

<channel type='pty'>
 <target type='virtio' name='org.qemu.guest_agent.0'/>
 <address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>

 
So far i’ve given this a proper run with multiple bonnie++ running and not had any problems with either the running base, a reboot on a rebased image, a copied off “old” base. And it even appears that leaving off the –quiesce doesn’t cause a break either. This is handy when you haven’t installed ‘qemu-ga’ and/or can’t restart the guest with modified XML definitions.

I find this a much better solution so far, but it’s not without limitations.

You must have sufficient disk space to hold two working copies of a guest (a limitation of libvirt-0.10). The new rebased running image and the old base. Depending on how you copy-out the old one, that may be the only space you need. With later versions of libvirt you should be able to use the ‘virsh blockcommit’ to shove the snapshot delta back into the old base. We’re stuck with ‘virsh blockcopy’ that makes a new base.

When we get our Centos 7 KVM guest up and going, i will see how things are different with a ‘modern’ libvirt/qemu pair.

References

[1] : http://www.sleepdontexist.com/2014/03/28/kvm_manage-sh-a-script-to-manage-your-kvm-machines/

GMond Python Module Notes

https://github.com/ganglia/monitor-core/wiki/Ganglia-GMond-Python-Modules

To create a custom GMond Python module requires a config file, and a module file which has three required functions: metric_init, metric_cleanup, & get_value*. In your config file (/etc/ganglia/conf.d/MODULE_NAME.pyconf) follow the structure:

modules {
  module {
    name = 'MODULE-NAME'
    language = 'python'

    param KEY {
      value = 'SOMETHING'
    }
  }
}

collection_group {
  collect_every = 30
  time_threshold = 30

  metric {
    name_match = "MODULE-NAME_(.+)"
  }
}

In your python module file (/usr/ganglia/lib64/ganglia/python_modules/MODULE_NAME.py) you need three functions:

  • metric_init : This is called when gmond starts, then not called again. You need to return a description of your checks (more about that below). It is passed a hash of params from the “param” block in the conf file.
  • metric_cleanup : Probably do not need it, but have to include it or it complains. Called when gmond if shutting down if you have any cleanup to do. Just use the block below…
  • get_value : Does not need to be named get_value, but whatever you pass to the ‘call_back’ value in the descriptors hash. This function is called by gmond for each metric each time it is run. After the initial run it will be called directly without going through metric_init. Gets passed the ‘name’ value from the descriptors hash. It must return the metric value.
Desc_Skel = {
    'name'        : 'METRIC_NAME', 
    'call_back'   : get_value, 
    'time_max'    : 10, # Does not do anything??
    'value_type'  : 'float', 
    'format'      : '%f', # https://docs.python.org/2/library/stdtypes.html#string-formatting
    'units'       : '', 
    'slope'       : 'both', # zero|positive|negative|both, but probably 'both'
    'description' : 'METRIC DESCRIPTION', # Only used when you run 'gmond -m'
    'groups'      : 'MODULE_NAME', # Not used the way we use gmond
}

If you want to have multiple metrics, pass them as an array in your return from the metric_init block.

Beyond that you can do whatever you want that is valid python. For debugging, include the following block at the bottom of your code:

# the following code is for debugging and testing
if __name__ == '__main__':
    descriptors = metric_init(PARAMS)
    for d in descriptors:
        print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))

Random Notes:

  • Global variables are maintained between calls
  • To control frequency of calls I find it best to set a global variable with a time, and another with values. Then in my get_value function have it only refresh the data if it has been less then some time frame.
  • Use ‘gmond -m|grep MODULE_NAME’ to make sure gmond is finding your module.
  • View the system logs for gmond startup output that might tell you why your module is not working.
  • For some reason the first pass of checks seems to run as ‘root’ and not the user specified in the gmond.conf file. This seems like a bug, but can be useful if you just need elevated privileges for your metric_init, but not follow up calls. Also can be the cause of your check initially working, and then later failing.