VMware snapshots are an extremely useful feature for saving the state of a VM and being able to roll back. They are nearly instantaneous to create, and reverting to a snapshot is also very fast. They do have one huge drawback though – when they exist for more than a few days, they can take up more and more space, and affect performance of your infrastructure. This performance hit is incurred not just by VM with the snapshots but other VMs as well, since the presence of snapshots increases the number of disk reads & writes necessary to work with the filesystem, and also increases CPU load on the host as it calculates deltas between data.

This wouldn’t be a problem if people who took snapshots deleted them shortly after creating them, but they have a tendency to be forgotten. In a large IT environment, with multiple datacenters, vCenters, clusters, and many VMs, it can be hard to figure out how many snapshots are out there, how old they are, how much they are affecting performance, and to enforce a policy of periodic expiration and automatic deletion of these.

We had this exact problem in our labs at CloudBolt (where we have every version of vCenter & ESX since 4.1 running), so we decided to automate a solution for this with a CloudBolt rule.

This rule condition looks at all VMs known to CloudBolt, searches for snapshots on them that were created more than the threshold number of days ago, and reports on them. If the Dry Run flag is set to False, the rule will also initiate a deletion of these snapshots.

We ran this rule a week ago against our labs, starting with a large threshold to delete the oldest of our snapshots. We thought we had been doing a good job of cleaning snapshots up as we went, but, as it turned out, there were a lot of snapshots, and some very old ones. Over the course of a day, we decreased this threshold, reran the rule, and repeated.

We immediately noticed a performance improvement in our labs, some of our automated CIT tests that were failing with timeouts began succeeding, and all our developers, SEs, and other users of our infrastructure became happier.

After all the old snapshots were cleaned up, we set the rule to run nightly and delete any snapshot older than 14 days, ensuring that this problem does not affect us again.

Today, CloudBolt is releasing 6.1-alpha4, which comes with this rule built-in. To upgrade to this alpha release, navigate in your CloudBolt UI to Admin > Version & Upgrade Info (or download the upgrader from our support site). After upgrading, go to Admin > Rules where you can change the inputs to this rule and execute it. It starts with Dry Run enabled, so it will not delete any snapshots automatically.