imageI think I’ve been pretty clear that I’m a fan of boring hardware.  I’ve said it publicly, I’ve said it in presentations, I’ve said it on stage: no one cares about hardware, they care that the applications and workloads are available and performing appropriately.

Of course, just because it’s boring, doesn’t make it easy, or unrelated to the success of the platform that IT is supposed to be.  There’s not a piece of software that has ever been written that makes shitty hardware magically better.

Which leads us to this: http://www.reddit.com/r/vmware/comments/26zbkb/my_vsan_nightmare/

What are the takeaways here?  Is this a VSAN issue?  Is it a PERC H310 issue?  Is there blame that needs to be placed?  All of these are good questions, and from what I’ve seen so far, VMware has responded quickly and well to the situation, and I expect that they will get to the bottom of things with the customer.  I’m still a fan of the tech, regardless of how this early-adopter issue turns out.  Besides, if you have been around long enough you know that every vendor has an outage at some point, so while it’s VMware’s turn in the spotlight today, this definitely isn’t meant as a slight to the company or the VSAN team, many of whom I’ve shared fantastically good scotch with.

The bigger issue for me is the fact that a customer purchased something that was certified and put on an HCL, and that hardware appears to have been completely unsuitable for the job.  The best hardware may be boring, but don’t underestimate the amount of time, expertise and investment goes into the creation of that kind of gear, especially in a multi-vendor, multi-vector stack (compute and storage, for example).  If you are not going to include hardware with a software solution, whether that hardware is purpose-built or commodity, you have to invest even more in making sure that customers get the experience you want when they bring whatever hardware they have to the table.

In this case, VMware doesn’t want to include a hardware component to VSAN, although they do work with partners to deliver “Virtual SAN™ Ready Nodes” that can be purchased from the likes of Dell, Cisco, Fujitsu and others.  The nodes presumably go through some sort of testing process before VMware stamps them and puts them in the HCL, and then customers are given some level of reassurance that the hardware will indeed run the software.

Best case, that’s the end of it, and the intersection between hardware and software is done at arm’s length.  Ah, but this wasn’t a best case scenario, was it?

My guess is, given all of the information that has come out about the PERC H310 in the last 24 hours, that the customer would like to go back and ask a few more questions rather than just relying on the HCL.  First on that list, I’d imagine, is probably: “Sure the hardware will work, but is it appropriate for my use case?” From the looks of it, the answer may have been no, but there’s no way to tell.  It works and that’s all the HCL is good for.  There’s no real indication of the workloads a particular configuration is suitable for.

The other method of delivering this is to include hardware, tightly coupled to the software in order to provide a consistent experience for customers.  Even in the storage space, there are a number of companies who do this today.  Nutanix and Nimble use SuperMicro servers, Simplivity and SolidFire use Dell, Pure uses Xyratex.  In many of these cases, the actual software that is the core of the IP these companies produce will work just fine on many different kids of hardware (or in public clouds like AWS), but in order to provide the best user experience possible, hardware is included and often required.  This isn’t a start-up company thing either!  One could argue that even the might Vblock is a prescriptive, included, standardized hardware platform tightly coupled to (most of) the VMware platform.

image (2)Overall, I don’t know that there’s a right answer, since both approaches have their merits.  In one, the HCL is a public facing document, and inclusion on it becomes something that is largely driven by demand from partners and customers.  In the other, the HCL is an internal document and is there so that the development and support teams can have a firm foundation to work with.

I’m sure there are LOTS of people who have been following the Reddit thread who look at the VMware HCL in a whole new light.  When 50% (4/8) of the Dell “Virtual SAN™ Ready Nodes” that are listed include the same controller that seems to have contributed to the issue in question, maybe it’s not the reassurance they were looking for after all.

Which probably means your hardware isn’t as boring as it should be.

15,315 total views, 252 views today

 

7 Responses to Hardware is Boring–The HCL Corollary

  1. […] way some parts of the VSAN HCL are probably not the best for a production environment. Please read this post by my friend Jeremiah Dooley that shared his thoughts on this […]

  2. […] Some other blogs about this particular case Jeramiah Dooley Hardware is Boring–The HCL Corollary […]

  3. Excellent post, Jeramiah. Without coming out and saying it explicitly, you’ve put your finger on one of my biggest fears when it comes to storage (my own company included, as it’s not really a “storage-centric” vendor).

    One of the things that scares me about “Software-Defined-X” in general – and I’ve written about this regarding OpenStack but it certainly isn’t limited to that initiative – is that there is a great deal of focus on “hardware vendor lock-in” that seems to ignore completely the reason why hardware is so important in the decision-making process in the first place. The idea that developers can work around hardware nuances (having spent so much time looking at ASIC and board designs) seems alarmingly naive.

    I believe you are 100% correct. Using the HCL or any other vendor’s compatibility matrix has moved from being a “first stop” to the “last checkbox.” That is unfortunately, the reason why “caveat emptor” is – and will always be – a reality in Data Center solutions.

    • I agree. And you and I know there’s nuance on both sides of that discussion. Software-only products can preserve existing operational processes and capital investment, and are therefore valuable to a number of enterprises who wouldn’t otherwise be able to adopt new tech. On the other side, there are vendors who include hardware, not because of the intrinsic value of the hardware being closely aligned with the software, but simply because it’s a business model they can’t get away from.

      The hard part, as always, is figuring out where tightly aligned hardware makes sense (storage, it seems, might be one of those places) and where it doesn’t. Unfortunately the vendors themselves are the worst source of that information. As expected. :-)

      • They can be, from time to time. :) (insert comment about Evaluator Group here… :P )

        I think that what a lot of people don’t realize is that while they search for the Easy Button Holy Grail, that the software abstraction hides more and more complexity that needs to have more sophisticated planning and troubleshooting considerations, not fewer.

        Once you get to a point where you have a decision tree that says, “sometimes you include hardware because it’s part of a business model” and “sometimes you include hardware because it’s the best-in-class for this software application,” you have just re-inserted the requirement that the end user actually has to do *work* to get to the correct answer.

        (Not that I’m saying that’s a *bad* thing of course!)

        I just mean that it does relegate much of the promise of “Software-Defined X” to the category of hype, when all you’re doing is shifting your workload to recreating a confidence level up and down the stack, from the physical layer through the application stack.

        It seems to me that the software-defined movement may likely settle into an “Extra Value Meal” approach for specific application types, similar in concept to the VBlock but software-driven. albeit requiring specific hardware components. All in all, it is unlikely the utopia of “any software running on any hardware” is a realistic end-state.

        Then again, if you don’t like these ideas, I have others. :)

        J

  4. Chuck Hollis says:

    Great post, as usual. The resolution is up on Reddit for those who care: http://www.reddit.com/r/vmware/comments/2799p4/root_cause_analysis_of_my_vsan_outage/

    • Thanks Chuck! That was exactly what I expected it to be, and kudos to the whole VMware team for how quickly the RCA was completed, and especially for being transparent about the cause. Thanks for posting the link here, it’s no surprise that I’ve seen far less of the resolution link than I did of the original issue. :-)