RAID as we know it is hitting the wall. The growing capacity of hard disks and the ever-longer time to rebuild arrays of such large disks are making it ever more likely that a RAID array will suffer an additional disk failure which loses data. By some calculations, RAID 5, the most common RAID level, is already becoming marginal and RAID 6, which can stand two drive failures, will be marginal by the end of the decade.
Clearly we need a new storage paradigm for protecting the data center.
And we're getting it. A number of companies, from giants like HP to start ups like Pure Storage, are busily developing advanced data protection methods. In another 10 years, we may refer to the storage arrays in the data center as RAID, but they will work very differently from what we call “RAID” today, and none of them will follow the definitions of RAID levels taught in today's schools.
The first problem with conventional RAID is that the size of the available drives continues to increase but drive reliability – as measured by the number of errors per read – has pretty much stalled.
As long as disk capacity is increasing rapidly it is hard – and expensive – to increase reliability at the same time while lowering the price. A classic “pick two of three” situation. Since capacity is an easy sell and reliability isn't, guess which gets the short end of the stick?
As drives exceed 3 terabytes (TB) there is more chance for an error. Part of this issue is that the reliability of disks hasn't increased in lock step with size increases; that makes failures more likely.
The second part of the problem is the way RAID works. When a RAID system suffers a failure on one of its drives, the entire array has to be rebuilt, reading all the drives. As drives get bigger, rebuild times stretch from a few minutes (on the 8 GB drives of a decade ago) to 24 hours or even more. Over the next few years, the rebuild time will stretch to days or weeks. While the system is rebuilding, not only does performance slow but a second failure will result in data loss – at least for simple RAID levels other than RAID 6 (I’ll get to that in a moment).
The capper is that one of the basic assumptions in figuring array reliability (as distinct from drive reliability) is usually not quite correct. (For more in-depth information, see Disk failures in the real world, a paper from 2007.) Reliability calculations for RAID arrays typically assume that drive failures are independent. In day-to-day experience, the drives in a RAID array are all likely to come from the same manufacturer and often from the same lot. That implies that when one drive in the array fails, there’s a good chance that one or more of the remaining drives will fail shortly.
What all this comes down to is that RAID is we know it becomes increasingly prone to data loss, as rebuild times grow (which increases the probability of an unrecoverable error) as disk capacity grows and array sizes increase. (For a real-life example of the problem, see this web host problem, especially the comments.)
Staying Small, Sticking With Quality
One solution is to stick with smaller drives in data center RAID arrays. It's worth noting that SCSI drives have increased capacity more slowly than have SATA or SAS drives. Today most SCSI drives top out at 500-600 gigabytes, in part because of rebuild and reliability issues.
Another way to improve protection is to use more reliable, and expensive “commercial grade” drives rather than SATA. SATA drives typically have reliability rates (unrecoverable data errors for number of reads) in the 10^14 range while SCSI drives are generally 10^15 to 10^16 or so.
This works, but it's expensive and denies the data center the advantage of high capacity drives.
Other RAID Levels
The short-term fix is to use RAID levels that recover from more than one drive failure. Of the basic RAID levels, only Level 6, which uses the equivalent of two disks per array for parity information, offers protection against a second failure before the array is rebuilt.
However, several of the so-called nested RAID levels can handle multiple failures. Generally these combine RAID Level 1 (mirrored disks) with another level to get added protection. One example is using Level 5 with Level 1 to produce what is known as a Level 51. In a Level 51 array, the disks are divided into two mirrored sets and the data is striped across the mirrors. In this kind of setup, the array can recover from multiple disk failures as long as one drive of each mirrored pair remains. If both drives in one mirrored group fails, you will lose data. If not, you can have five or six failures or more – depending on the size of the array – without losing data, depending on the number of drives in each RAID set.
Level 6 is available from most RAID vendors today and nested RAID levels like Level 51 are offered by a number of them. This takes a controller that supports it and decreases the amount of usable storage capacity by at least 50% from the theoretical capacity of the drives. This is less of a problem than it once was because drive capacity is getting cheaper.
However RAID 6 is by definition limited to two disks worth of parity, which means that two or more drive failures will lose data.
Further, other RAID levels don't solve the problem of rebuild times. RAID 6 takes more time because there's an extra parity drive. Nested arrays take even longer.
The increased rebuild times of conventional RAID, Level 6, nested and otherwise, causes an additional problem: degraded performance. Rebuilding an array takes a lot of computer power and advanced RAID levels take more than regular levels. This offers a storage manager the choice between leaving the array on line while it is built, which slows the rebuild even more, or taking the array offline, which makes the data unavailable for the duration of the rebuild. Of course modern advanced controllers have on-board microprocessors or ASICs to handle the parity calculations, but the problem still remains.
Fortunately solutions are available, but most of them aren't RAID in the sense we know it. In fact, there are several companies working on advanced data protection schemes that aren't what most people would recognize as RAID (even though some of the vendors call their systems RAID; we get to that in a moment).
One way to protect data in arrays is to move to a system that doesn't require as much rebuilding. Commonly this is a version of erasure coding. In erasure coding, the data blocks (M) are split into more smaller blocks (N where N>M) which are scattered across the array. To rebuild a missing block, the controller only has to recover M of the N blocks.
Potentially this is much more secure, since by choosing the appropriate number for N you can survive multiple drive failures and (in theory) you can rebuild the array much faster.
“Erasure coding is a different way of constructing parity,” says Ray Lucchesi of Silverton Consulting, a Denver, CO, consultancy. “At one level it looks just like RAID, but you're not using XOR for parity.”
The advantage of erasure coding, Lucchesi says, is “With extremely large disks, erasure code allows you to read less of the disks to rebuild data.” One of the other advantages of most erasure coding controllers is that they don't waste time trying to rebuild sectors on the drive that don't contain data. This isn't related to the algorithm but still speeds things up on the average of 50 percent per disk.
A number of vendors have announced storage schemes that go beyond RAID (no matter what their creators call them).
One example is HP 3Par's Utility Storage, which is a kind of RAID array on steroids. Rather than handling storage and resiliency at the disk level, 3Par breaks the disk into thousands of virtual “chunklets” of 256 MB each. The chunklets rather than the disk form the operational unit of the 3Par system. Chunklets are combined into micro-arrays, essentially small virtual disk arrays spread over multiple disks just as data is in RAID.
Further, each disk contains spare chunklets which can be used in the event of a failure to hold data from the failed drive. These are called micro spares.
In case it isn't clear, the whole system is virtualized. The 3Par system is a software version of RAID, supported by extra computing power in the controller in the form of a server-class microprocessor (Xenon in the current version) and ASICs to speed up all the calculations and translations needed to perform reads and especially writes.
In the event of a failure, the effect is confined to micro arrays. Since these micro arrays are much smaller than a hard disk it is much quicker to rebuild them using conventional RAID algorithms. The system only rebuilds the micro arrays containing the failed chunklets. This, along with proprietary algorithms, the ASICs on the controllers, striping across all the disks in the array (broad striping), and some other tricks, means that rebuilds are very, very fast.
Another example of advanced storage systems comes from Pure Storage, a Mountain View, CA, startup which is building arrays out of solid state disks (SSDs). (The product is in beta.) Pure Storage calls its system RAID 3D and says it is a RAID array. However it is so different from what we think of as RAID it's questionable whether anyone but the marketing department would recognize it as a version of RAID.
RAID 3D uses three different types of error correction to produce a software system that divides the data into small chunks and scatters it over all the drives in the system. “SSDs fail differently,” says Matt Kixmoeller, vice president of products at Pure Storage. Full disk failure, the most common mode on hard disks, is less common. However, failures of single bits are more common and SSDs are subject to what Kixmoeller calls “brownouts” which temporarily degrade performance.
Because SSDs have a different physical structure and failure modes than hard drives, the first line of defense is a parity scheme for working around bit errors. The other two types are concerned with cross SSD parity. Like many of these new schemes, RAID 3D is a software RAID that makes heavy use of virtualization.
The differences in RAID 3D really show up in what happens in the event of a failure. First, rather than being divided into sets of drives like large RAID arrays, RAID 3D uses all the drives available for all storage. This greatly increases the parallelism – and hence speed – of the rebuild. Second, the system examines the failed drive and gets all the data off it that it can, so the system only has to rebuild the actual failed sectors. Because SSDs are much faster than hard drives, and because there is less actual rebuilding to do (as distinct from copying back the uncorrupted sectors of the SSD) the rebuild times are very low.
“After 20 or 30 minutes, we have a completely perfect array with less than 100% capacity,” Kixmoeller says. “The window of vulnerability [to another failure] is much shorter.”
In spite of the higher cost of SSDs, Kixmoeller claims the Pure Storage arrays are actually less expensive than the equivalent non-SATA hard disk RAID array. Part of this is that the system does extensive deduplication before storing the data.
Why Call It RAID?
With the way these advanced systems work, the logical question is why vendors insist on calling their products “RAID,” especially as the weakness of RAID becomes more obvious? These new drives have different methods of calculating parity or no parity at all, disks divided differently, broad striping and parallelism across the entire set of disks rather than a subset, etc.
One reason is that most of these systems contain some RAID-type ideas, but the decision is also influenced by the comfort factor for data center managers and CIOs. RAID is a known quantity so it helps to sell the systems, even if the hardware and software don't have much in common with what we normally think of as RAID.
But this isn't the first time the evolution of a tech term has happened. What Bob Metcalf and the others at Xerox PARC developed as Ethernet doesn't bear much resemblance to the 10 Gig Ethernet of today.