CRN recently published this interesting article summarizing 10 hot startups in big data. It got me thinking.
But first let’s take a quick trip back in time to the ‘90s.
Until the 90s, applications and servers had local and direct-attached disks. Performance was great, but these islands of locally-connected storage were efficiency killers. Then SANs came along, replacing the disks tucked into each server with shared, centralized pools of storage. As with pretty much everything in IT, centralization generated operational benefits that quickly overshadowed any performance limitations. The momentum of SANs gave birth to a new storage market. Vendors innovated, finding ways to address the performance, cost, and scalability challenges. SAN solutions remain the predominant form of storage in IT today.
I’ll return to why this is important in a bit.
But first let’s jump back to present day. Big data is now front and center on your storage, if not overall business, agenda. We’ve seen a myriad of NoSQL platforms like Cassandra and MongoDB – as well as the proliferation of Hadoop variants from vendors like Cloudera, Hortonworks, and MapR – gain traction in pilots and small-scale production deployments. In fact, in this blog post, two Gartner analysts gathered anecdata (love that term!) that shows notable jumps in the size of Hadoop clusters with fewer than 10 nodes and between 11 and 50 nodes (see image, courtesy Gartner).
These NoSQL and Hadoop clusters are not like the applications of yesteryear. They’re a new breed. One that is built with “webscale” or “hyperscale” as their core DNA. Why? Because a new approach is needed to gather, manipulate, and analyze trillions of pieces of structured and unstructured data. WIth a new generation of big data apps comes the ability for you to predict customer sentiment, drive new business insights, and even develop new business models.
And what’s really interesting is these applications have their own data management! They scale elastically, automatically replicating and protecting data across local disks spread throughout the cluster.
Sound familiar? It should.
It brings us back to the dilemma of islands of storage. And, just like in the 90s, a new storage approach is needed. You don't need to use locally-attached NoSQL storage or Hadoop storage.
And here’s where software-defined storage (SDS) intersects with big data technology.
Let’s imagine you’re responsible for IT infrastructure at your company. Let’s also imagine your company is all over big data – either in production, or in a test/dev environment. Chances are you need to support multiple flavors of big data apps. Maybe your developers need access to all three Hadoop flavors. Or you’ve grown through M&A and now you’re stuck with multiple NoSQL deployments.
This results in unique big data storage requirements and traditional functions like RAID aren’t a good fit. But with software-defined storage, you can virtualize these application and underpin it with a single data management platform. Now you’re no longer dealing with islands of storage managed by each application. Instead, you’re benefiting from a consistent provisioning workflow, a radically simplified storage environment, and whitebox economics that let you store big data without it becoming a big cost. You get the benefits SANs delivered in the 90s, but without the proprietary hardware and inflexibility that SANs are constrained by today.
So if you’re doing big data should you rush out and buy software-defined storage today? Short answer: Yes. Longer answer: Yes, but be careful.
Not all SDS solutions are built with big data in mind. Just as I discussed with OpenStack storage, you need a unique set of capabilities. Look for SDS platforms that offer:
- Self-service and automated provisioning. Given it’s maturity, big data deployments are often limited to particular business units or development groups. As such, their storage requirements are often in flux and subject to frequent changes. Make sure your IT ops team is not constantly provisioning storage environments by picking an SDS platform that offers self-servicing and the ability to plug into the automation or orchestration frameworks you choose.
- Granular replication. Many big data apps have built-in replication. That’s great, but if you deploy them atop a platform that also replicates then you’ll end up with an exponential growth in data. For example, you’ll end up with a 9x data growth if both HDFS and your storage layer are doing the de facto standard of three-way replication.
- Global, in-line deduplication. As mentioned, the primary value of SDS in a big data context is to provide a single underlying data management platform. This drives operational efficiencies. But the other value is in capacity management. Look for a storage solution with in-line deduplication that can be applied across all big data sets. This results in significantly less raw disk capacity required in the cluster.
Bottom line: Big data applications require a new architectural approach. Make sure you pick a software-defined storage solution that aligns with your big data strategy.
If you'd like to learn more, download this Storage Swizterland whitepaper describing the pros and cons of using direct-attached storage (DAS) versus shared storage for Hadoop implementations.