My company offers a number of different "enterprise cloud" product for our customers, and one of them is a Hosted VM offering.  Customers can choose the level of availability and SLA they want (from 99%, Non-HA to 100%, FT-protected), they get an included amount of VPU, RAM and production storage and then they can build out the server they need from there.  It's a $0 capital alternative to buying a 1U or 2U server, it can be integrated into an existing co-lo environment and we can replicate them between three different markets.  Customers love it, the growth of the product is great and so everything is peachy, right?

The trouble I'm running into as I go through the exercise of scaling out the infrastructure to support our growth over the next 12 months is that there seems to be a huge delta between the amount of space we provide to the VM (.vmdk) and the raw storage needed to support it.

For example:

Let's say (completely hypothetically, of course) that my customers had an average VM disk size of 85 GB.  If that's true, you'd expect that I'd be able to support 824 VMs out of 70 TB of usable space.

However, you have overhead at every step of the way, don't you?  If you (again, completely hypothetically) use a 512 GB LUN sizing as your standard, each of those LUNs is going to use somewhere between 7% and 10% more actual space on the SAN.  While the VMFS partitioning is going to be pretty efficient, you are going to spend space inside the datastore not only for the individual .vmdk files, but also for the RAM each VM is using.  You also need to have some breathing room on the data stores, since you don't want those to fill up, so even if you cut it to 10% overhead, you are losing 10% of your capacity.

By my math, that original 85 GB .vmdk file ends up being more than 133 GB of raw space required, or over 60% more than the actual size of the .vmdk.  That 70 TB of space now only holds 526 VMs, and now I have a whole different business challenge to deal with.  The cost of the 70 TB didn't change, that's for sure.

There's got to be a better way, right?  Thin provisioning might be part of the answer, but I only start to see real gains there by increasing the standard size of the VMFS (thereby increasing the number of VMs I have in each container…), but then I have to start worrying about how many VMs I have per LUN.  Any ideas out there on how I can be more efficient and bring the actual storage needed close to the storage I'm generating revenue from?

16,748 total views, no views today


6 Responses to Disk Sizing Dilemma

  1. Paulmon says:

    We offer a similar IaaS/Cloud offering built in house using VMWare. I purposely decided not to use VMFS with FC/iSCSI LUNs, too complicated & too inflexible. I elected to use Netapp NFS – My aggregates and volumes being virtual I’m not encumbered by the limitations you’re now facing with your LUNs. Combine this with features such as VMWare thin provisioned disks, Netapp deduplication and Netapp features such as Flexclone and my experience and you’re left with a very flexible environment that consumes far less disk than a more traditional FC SAN approach.
    We’re seeing ~79% deduplication which results in considerable cost savings. We then charge the customer based on disk consumed.
    More over because I can mount these NFS mounts on any server we can then run reports on amount of disk consumed by VM taking into account every piece of the VM including memory, swap, etc. We charge them for space used on disk and a simple job server with the required NFS mounts and an interface to our biller calculates that.
    I really believe NFS is the perfect protocol for VMWare deployments in the Cloud/IaaS space.

  2. Great point. We try and balance our offering between performance and value, and I think we err on the side of performance sometimes. Our customers are looking at this as enterprise-class infrastrcuture, and we try and live up to that expectation, in this case maybe to a fault.
    On the NFS trail, my biggest fear there is multi-pathing. How are you providing the VMware hosts with redundant connections to the storage in case of NIC/path/SAN failure? We are moving everything to 10GB in the next few months so I’m not as worried about the throughput as I was in the past, but I know how many times the multi-pathing has averted a customer-impacting issue and I’m hesitant to give it up.

  3. Paulmon says:

    Storage is done via HP Virtual Connect modules in the blade chassis, so we have active passive networks handling the storage, fail over is virtually (no pun intended) instantaneous, VM can’t tell it has happened, and any user on the system can’t tell. The storage has redundant interfaces as do the hosts. It’s a total non-issue. I haven’t seen throughput of the network handling the storage as a problem, there are usually many other bottlenecks long before that becomes a problem and with 10Gb you’re performance bottle neck becomes the spindles.
    As for your push to enterprise-class infrastructure, don’t knock NFS, particularly 10Gb, if done right it’s every bit as good, and in this case many times better, than FC.
    Netapp snapshots combined with NFS I believe is a huge advantage. No performance hit to the filer when the snapshot is created, unlike most competing snapshot technologies. Those snapshots allow us to offer “self service” restore of customer VMs through our portal. When customers sign up they’re able to choose their snapshot level. Because our portal infrastructure and API systems can “see” the NFS storage we can manipulate the Netapp Snapshots of the VMDKs, again, just more flexibility. Want single file restore for a customer out of a VMDK? No problem, mount the snapshotted VMDK on a dedicated API server, grab the file, download to customer via portal. Better still expose it as an SMB share to the customer’s VM and have the customer copy it.

  4. This is great feedback, thank you for taking the time. Being able to deal with files rather than filesystems definitely sounds like it has its upside. We’ve always used NFS for ISO-Stores and such, but I’ll definitely throw some NFS into the lab for the VMs themselves and see what happens. The dedupe and snapshots combined with a good development team seems to have created you something that your customers find valuable, and that’s the whole reason for doing this, right?

  5. Paulmon says:

    I look forward to your next blog post about NFS. ;)
    At the end of the day I firmly believe the traditional server hosting business is doomed, it might take 10-15 years but the cloud WILL replace this model in all but edge cases. There is much that needs to happen between now and then, and some of that is simply educating the customers. Some of it is additional security layers. Some of it is out of necessity, building large customer facing data centers just isn’t sustainable.

  6. Amen, brotha. The current capital crunch is certainly moving things in that direction quicker, but I believe it’s also giving customers a taste of the value the “enterprise cloud” provides. No matter what the credit markets do in the future, most of the people who have found this business model aren’t ever going back to owning their own infrastructure!