Storage systems these days implement data protection in more-or-less 3 key ways. This is RAID, Erasure Coding and Replication Factor. It is important to note that while some systems do this at an object or file level (generally object based storage), most will split the file or data up into manageable chunks and then apply the data protection method on top of this, but I won’t discuss this in detail as its needlessly detailed for this high level overview. I still believe that RAID, Erasure Coding and Replication each have their place, so I'm trying to give a relatively unbiased view on each of these technologies. Please keep in mind I've generalised quite a lot here, and I'm not going into specific implementations, even for us here at Hedvig.
In the video and in the blog below, I provide a fairly detaild primer on RAID, erasure coding, and replication. Feel free to read on our watch the 30-minute whiteboard video.
Storage data protection: RAID
The data protection technology that everyone loves (to hate). This is very similar to Erasure Coding in concept, which won’t help as I’ve written about Erasure Coding second. RAID is a familiar concept to many and RAID is widely discussed on the internet and there are many great articles out there describing it in detail, so I won’t do it any injustice by trying to re-describe it here. There are some generalised points I’d like to clarify for this discussion though (yes some implementations are different, but this is generalised):
- RAID is set on a disk group level
- RAID rebuilds must rebuild an entire disk from a mix of data and parity
- RAID10 has no parity, so is the most efficient, it is simple a replica of the write
- Other RAID types use a form of parity
So let’s talk a bit about parity. It’s actually quite a simple concept, at a very basic level, the parity is the even-ness or odd-ness of a number, so a data stripe of 0101 would have a parity of 0 (0+1+0+1 is even, so 0), but 1101 would have a parity of 1 (1+1+0+1 is odd, so 1). So the more parity bits we add the greater the overhead, but also the increased data protection, and it’s surprising the actual amount of overhead. Generally agreed standards in the industry:
- RAID0: 1x write penalty
- RAID10: 2x write penalty
- RAID5: 4x write penalty
- RAID6: 5x write penalty
One of the nice things about parity is that it is generally implemented with a CRC error correction too, so it isn’t just resilient against data loss, but can also be used for error correction too. Now doing error correction inline for every read is an unmanageable overhead, so generally its just used as a background housekeeping process.
Most storage vendors get around the whole write penalty issue by sticking a big battery or flash backed write cache in front, so that writes can be immediately confirmed before the complicated bit of RAID calculation is done. The biggest modern problem with RAID is the time it takes to do a rebuild. It’s sloooooowww! RAID calculations are expensive to do and we’re limited on how many disks we can target for a rebuild because of the RAID group limitations. This prevents many storage vendors from using drives of more than around 2TB because it can take days to do a rebuild.
Storage data protection: Erasure Coding
The method I found most helpful in explaining Erasure Coding was to think of it in human terms. The NATO phonetic alphabet, which when delivered in a high loss environment (say a loud gun fight or over a high static radio) you can still reconstruct the message. If all you hear is “ …TEL, ***, ..LTA, VICT…, ..NDI.., …OLF” you should be able to reconstruct this as Hotel, ___, Delta, Victor, India, Golf and reconstruct the message as HEDVIG. This is due to the additional information included in the message that allows you to reconstruct the message. This is more efficient than just shouting “Hedvig” repeatedly until you confirm receipt (which could also be a lot of repeated shouting, think of this as TCP over a really bad network). This is quite a contrived example, but hopefully makes the concept easier to understand.
Data is typically Erasure Coded with something like a 10/16 ratio (every 10 bits of data is encoded into 16 bits), which is around a 1.5x hit on raw capacity (1.6 to be precise, but 1.5 is easier to calculate and compare quickly). This ratio gives the ability to lose 6 parts of the data before it is unrecoverable. All data is generally encoded meaning that the parity is distributed and that any of the remaining data blocks can be used to recover the whole data set. The locality of this data is important, in 10/16 terms if I have more than 6 bits of the data in the same component, I am putting my data at risk of loss. Erasure Coding is generally implemented with RAIN (Redundant Array of Independent Nodes), which means my 16 bits of data are spread across multiple nodes, as I can lose a maximum of 6 bits this means my minimum number of RAIN nodes for node redundancy is 3 (16/3 = 5ish, less than 6).
Erasure Coding has a write penalty that depends on the encoding ratio. 10/16 means that for every 10 bits written I’m encoding 16 bits and writing to disk. This isn’t as simple as a 1.5x(ish) overhead as this calculation is done in-line, so on the CPU and memory (I haven’t come across Erasure Coding offload cards yet). Additionally, as all the data bits are encoded, there is also a read penalty. This is important to note, we commonly see Erasure Coding used in backup and archive systems, as well as object based storage. These aren’t designed to be high-performance tier-1 systems so the performance overhead is acceptable given the data protection and increased storage efficiency. It is also important to note that actually most storage systems that use Erasure Coding take the performance penalty into consideration in their sizing, so the performance overhead may be masked by the capabilities of the platform.
A very important note here is this is resilient of erasure (hence the name), that is, missing data. If any part of the data was changed (an error event), erasure coding may not recover from it. The key difference here is Erasure Coding vs. Error Correction. Error Correction is generally handled by creating a hash/checksum of some sort of the original data, or data blocks written to disk so that when they are restored we can confirm they are valid.
A final note on Erasure Coding, it’s actually a very old technology, the Reed-Solomon codes where invented back in the 1960’s and thankfully are available in the public domain, so many people implement this. Although its one of the more inefficient for writes, it is the best proven in terms of data protection. Writing an Erasure Coding algorithm is no mean feat and you need to be sure that the chance of checksum collisions (which can mean data corruption) and the robustness of the algorithm for rebuilds are well tested. Due to the performance overhead of Reed-Solomon and the fact that most modern Erasure Coding techniques have been patented, everyone implements this slightly uniquely, although the above concepts generally ring through for everyone.
Storage data protection: Replication
Just to be clear, this is synchronous to the write, not a background process. This is a fairly easy concept, every write is broken up into parts (as with Erasure Coding and RAID) and then each bit of that data is copied to 2 or more locations. The RF (Replication Factor) will depend on either vendor implementing this, or the level you choose. The overhead here is simple, it is either the number of writes you’re making, or the quorum of writes that satisfies a majority (at Hedvig what we call a majority set). So for a RF of 3, the majority set is 2, for 4 its 3, 5 it’s also 3, and so on.
Replication has the ability (much like Erasure Coding) to be logically replicated and striped across multiple devices and nodes, in a RAIN architecture. This makes data protection very easy but you also get data locality. I can pull the entire data set from a single location, in fact I can get it from multiple locations so there is both a minimal overhead to the writes, and a performance boost from the reads. This is less relevant to single site solutions (although still key if you want close data locality in hyperconverged solutions), but really comes into its own with multi-site, where you can have any workload, in any location, still access (reads and writes) to the local storage nodes.
While replication is generally implemented with the addition of a CRC check (or simple checksum), it also comes with a built in level of error checking that doesn’t need an expensive checksum check. If the replication factor is 3 or above, then we have a quorum, if 1 data bit is flipped or corrupted, we have a quorum of other data bits which show the error and can be repaired with a simple new replica.
RAID is useful where I have a small number of disks, maybe I have a RAID card which can offload the overhead, and I want simple hardware implemented data protection. A great use-case here is a system OS of a physical machine, I don’t want to lose the data, but maybe I don’t want to boot from SAN, so I choose RAID10 for the best protection with minimal overhead.
Erasure Coding is typically great for single site backup, archive and object data. You don’t need a high performance system, but you want to be very efficient in a single location. While geographically distributed erasure coding exists, the problem here is that reads and writes need to go across the wire so the performance impact is exacerbated. Rebuilds are done across an entire RAIN cluster, so these are also very fast even though they do include a parity rebuild. As I've said earlier, this is a generalisation and different vendors work around this in different ways because Erasure Coding can be implemented in many different flavours.
Replication is the modern replacement for legacy RAID systems and we’re seeing this across the industry as modern storage vendors flock to adopt it (in various and sometimes bizarre ways!). The key advantages of Replication are really as follows:
- Minimal or no impact on writes (due to clever caching mechanisms)
- Accelerated reads due to distributed design
- Rapid rebuilds
- Active workloads across multiple sites
- Stretched clustering
- Minimal impact on failure events
If you're interested in learning more about the pros and cons of various data protection mechanisms, I recommended submitting a request and let us know how we can help. Alternatively, click below for a presenation my colleague, Abhijith Shenoy, recently gave at SNIA.