Resiliency and Discovery Recovery in AWS

Dinakaran Sankaranarayanan

Posted

Resiliency and Discovery Recovery in AWS

As we build applications in the cloud, it is very important to have resiliency in the application that we design. Disaster Recovery(DR) is one of the main considerations. Last December, AWS had multiple outages for a few services that impacted entire regions and this resulted in many applications not accessible during this period. To avoid similar issues in future, it is important to design applications with Disaster Recovery as one of the main considerations.

Understanding Recovery Time Objective ( RTO )

The recovery time objective (RTO) is the maximum acceptable time that an application, computer, network, or system can be down after an unexpected disaster, failure, or comparable event takes place. RTO captures the maximum allowable time between the restoration of normal service levels and resumption of typical operations and the unexpected failure or disaster. RTO defines a turning point, after which time the consequences of interruption from a disaster or failure become unacceptable.

https://www.druva.com/glossary/what-is-recovery-time-objective-definitions-and-related-faqs/

Understanding Recovery Point Objective ( RPO )

Recovery point objective (RPO) is defined as the maximum amount of data – as measured by time – that can be lost after a recovery from a disaster, failure, or comparable event before data loss will exceed what is acceptable to an organization. An RPOs determines the maximum age of the data or files in backup storage needed to be able to meet the objective specified by the RPO, should a network or computer system failure occur.

https://www.druva.com/glossary/what-is-a-recovery-point-objective-definition-and-related-faqs/

Difference between RTO and RPO

The main difference is in their purposes – being focused on time, RTO is focused on downtime of services, applications, and processes, helping define resources to be allocated to business continuity; while RPO, being focused on amount of data, has as its sole purpose to define backup frequency.

Another relevant difference is that, in relation to the moment of the disruptive incident, RTO looks forward in time (i.e., the amount of time you need to resume operations), while RPO looks back (i.e., the amount of time or data you are willing to lose)

https://advisera.com/27001academy/knowledgebase/what-is-the-difference-between-recovery-time-objective-rto-and-recovery-point-objective-rpo/

https://www.youtube.com/watch?v=PurBJoYkh-I

Understanding the requirements of RTO and RPO will help in planning DR and Resiliency.

DR strategies

DR strategies can be thought based on the different types of workload and resources

  • Compute/Services - Managing applications and stateless service
  • Data/Storage - Managing different types of data stores/storage services
  • AWS Managed Services: Event streams, queues and related types of resources, Secret Managers, Systems Manager etc
  • Networking - Route 53, VPC etc.

For each of the workloads, the DR strategy required needs to be identified and the possible options need to be evaluated and then a decision has to be taken that provides the best value proposition.

Most of the time, the best option may be highly expensive and is not practically possible. So understanding the trade-off is crucial.

Approaches for DR

  • Backup and Recovery: Restore during a disaster event
  • Active - Passive. Data replicated actively, but services are booted on demand. Scaled-down infra compared to the active setup
  • Active-Active option, all of the services running at the same time. Expensive, but switching back may be easier during an outage can be easy and simple. Scaled-down infra compared to the active setup

Cost considerations

Cost is one of the key considerations for DR planning. Decisions around architecture and design decisions have to be thought of with cost as a key consideration as the best resiliency option can be very super expensive. A balanced approach needs to be considered.

DR for Computer Services

  • Running applications/services in multiple available zones
  • Bootstrapping applications/services in a different region whenever there is an outage in the main region
  • Managing DR for Compute Services is relatively easy.
  • Compute Services need not be running in other regions at the same time. During the time of the outage, the services can be launched in a different AZ or Region based on the need.

DR for Data Stores and Managed Services

When it comes to Data Stores and Managed Services, depending on the type of workload for which DRR is being planned, the approach and strategies could vary. Some commonly available options :

  • Auto-backup and Restore of data
  • Enabling Multi-AZ within a region
  • Enabling Multi-Region for resiliency across regions
  • Fall back strategies when one service is not available. For example, if cache datastores are not available due to an outage, applications/ services can then fall back to the underlying source of truth can be thought of as an option

For some of the AWS resources that I have used, the following are the DR approaches available

ElastiCache Redis

  • Enable Multi-AZ
  • Back-Up and Restore option
  • Enable Global Data Store
  • Fall back to the underlying source of truth when Redis is not available

Aurora MySQL

  • By default comes up with Multi-AZ setup
  • Back-Up and Restore options can be considered. Ensure backup is available in different regions in case of having to bring up the data in a different region
  • Enabling Global Data Store in Multi-Region setup with real-time data replication

DynamoDB

  • Multi-AZ enabled by default
  • Backup and Restore can be considered
  • Global DynamoDB tables for Multi-Region setup

Event Streams

Kinesis

  • It is Multi-AZ by default
  • No option available out of the box for Multi-region setup
  • Recommendations are to push events across regions if no data loss is the main consideration
  • Persist some of the event metadata in a datastore for backup and retry consideration

SQS

  • It is Multi-AZ by default
  • No option available out of the box for Multi-region setup
  • Push events across regions if no data loss is the main consideration
  • Persist some of the event metadata in a datastore for backup and retry consideration

Testing Disaster Recovery

Chaos Testing/Chaos Engineering is required to ensure the Multi-AZ / Multi-Region setup is working as expected. This approach was pioneered by Netflix’s Chaos Monkey and is now widely being adopted across the industry as part of Resiliency.

https://netflix.github.io/chaosmonkey/

https://en.wikipedia.org/wiki/Chaos_engineering

Other key considerations

Latency:

Whenever there is an outage, the fallback approaches being considered may increase latency. This aspect needs to be considered while designing the optimal approach. This means the backup AZ/regions need to be in relatively proximity to the customers to ensure the latency is well under control.

Impact on Downstream systems

During the outage, if a cache is not available and the plan is to call the downstream systems or datastores to get the actual data, this may result in overloading the downstream systems and services. This aspect needs to be well understood.

Understanding outage of Control Plane vs Data Plane

Data Plane is the actual AWS service. It can be accessed either via API, CLI or Web Console

Control Plane is the AWS web console. For some of the global services like Route53, S3, IAM, the Control Plane is available in US-EAST-1 region only. In case this region goes down for global services and if the control plane is affected, it is very difficult to switch or re-route services as the AWS web console will not be available due to the control plane outage. Hence it makes sense to have scripts available so that these services can be accessed via API or CLI when the control plane is not available.

Disaster Recovery is a huge topic and I have just covered the broad strokes of how it can be planned. It is not possible to implement DR 100% the very first time. It will be an ongoing process and fine-tuning may be required as and when we learn new ways things can fail or will not work.