Parity Data Explained

In computer science, parity data is like a magic trick. When students first learn about RAID, the concept is often glossed over, and only the attributes of parity data are explained: data from one drive can be used to recover from a loss of any other drive in the array, even though the size of the parity data is equivalent to only one drive’s capacity. In other words, if you have 10 hard drives in an array, you can use 9 of them for data storage and the remaining single drive to provide a backup for all the others. How is this possible?

The power of parity data lies in the binary nature of computer data. Let’s imagine a simplified version of a RAID array of four drives, utilizing striping with parity. The first three drives will be used for data, and the fourth drive will be used for parity. To keep things simple, let’s imagine that the stripe size is only one bit so the first stripe might look like this:

The RAID controller writes the ones and zeroes to each drive, and when it comes to the fourth drive, it calculates the parity data. It writes either a one or a zero: whichever would make the total number of ones even. In this case, since there is only a single one, the controller writes another one, to make it even.

Now, if any of the first three drives fail, the contents can be inferred from the parity data. For example, let’s say the first drive fails:

The RAID controller knows that the number of ones needs to be even, but there is only a single one left in the stripe. So it infers that drive one contained a one. If drive two or three fails, no ones would need to be added to make the number of ones even, so the controller would know the drive contained a zero. As long as only one drive fails at a time, its contents can be inferred by looking at the parity data and the data on the other drives.

And that’s how parity data works! Of course, in a real RAID array, the stripe size would be much larger and the parity data would be distributed across the drives. But the concept is the same. The storage capacity of a RAID 5 array is (n-1)x where n = the total number of drives and x = each drive’s capacity. Only a single drive’s worth of capacity is needed for parity data, and RAID 5 offers a nice balance of the benefits of performance and redundancy.

Note that RAID 5 is falling out of use in modern production environments in favor of RAID 10. Hard drives have gotten larger and cheaper, making it more cost effective to use mirroring rather than parity. RAID 10 offers the performance benefits of striping and the redundancy of mirroring, and allows 50% of an array’s capacity to be used.