AWS EBS RAID configurations

First things first, EBS is the AWS service providing Elastic Block Storage for EC2 instances. As announced by AWS in 2017, it has an SLA of at least 99.99%, which means that probably will fail at a rate of 1 in every 10.000+ EBS volumes per year. It can also be backed up “on-the-fly” using the snapshot feature which will create an EBS snapshot.

But before going any further, I want to explain two important concepts which are used in architecture when talking about backup and recovery strategies: Recovery Point Objective and Recovery Time Objective.

RPO means how old the data can be, once the system has recovered.

RTO means how long will it take for the service to be back up after a failure.

Defining RPO and RTO

Before Architecting a system, RPO and RTO are clearly defined by the business owners. These values are a hard requirement and are not decided during the design phase.

Recovery Point Objective can be achieved ONLY when:

(RTO + Backup Frequency) < RPO
Common mistake

Less experienced designers believe that if the backup frequency is let's say 6 hours, they can comply with an RPO of 6 hours. Unfortunately, it takes times to restore a failed volume, and how long it will take should be provided by the operational teams who are actually responsible for the restoration. Please note that it is important also how many snapshots are kept, as the last few ones taken can be corrupted as well, being taken from a corrupted volume.

In a typical production scenario, RTO is provided by the Operational teams based on what they feel comfortable with. It can be affected by the types of shifts they have, how is 24/7 support handled, internal procedures, etc.

Now, coming back to the AWS EBS concept, we notice that if a RAID0 configuration is used, on-the-fly snapshots of EBS volumes become useless, because data on the volumes will not be aligned and the RAID will not be able to recover from multiple non-aligned volumes. To be able to make snapshots of a RAID0 configuration, the applications requiring write access to the disk should be stopped, all data flushed to disk, and only then the snapshots should be taken.

Hard rules for using RAID0 (stripped mode):

  1. Use RAID0 when RPO is undefined because you will have to disable volume snapshots. Or use some other more complex backup mechanism.
  2. Use RAID0 when you want to achieve performance higher than what AWS can provide. Basically, if you need more than 32.000 IOPS or 16TB per volume. Nitro based instances are capable of 64.000 IOPS.

Hard rules for using RAID1 (mirrored mode):

  1. Use RAID1 if your SLA has to be higher than 99.99% which is the one AWS offers.
  2. Having multiple EBS volumes in the same Availability Zone has no guarantee that only a maximum of one EBS volume will fail at a time. AWS does not permit yet to attach an EBS volume across availability zones.

What about RAID10?

RAID10 will have a mix of constraints from both RAID0 and RAID1. Volume snapshots become useless; if RPO is a requirement, another mechanism for backups will need to be created. It will not be the cheapest option.

Volumes using XFS as the journaling file system have the advantage of xfs_freeze which basically will freeze all access to a mounted volume. Used in combination with EBS snapshots could be used to achieve stripped RAID levels where the volumes will not be corrupted in case of a recovery. Again, the application will have to be tolerant to a temporary freeze of the storage access.

What about other RAID configurations like RAID 5 or RAID 6?

Amazon does not recommend these variations as the usable IOPS of the created volume will drop by 20-30% performance wise. This is because some of the IOPS are used for parity checks. Cost becomes prohibitive, a better ROI can be achieved with off the shelf EBS volumes or a combination of RAID0 and/or RAID1 volumes.