Disaster Recovery (DR) is a critical strategy for ensuring business continuity during unexpected events. This guide will walk you through setting up a DR environment using Amazon Web Services (AWS) and Elastic Kubernetes Service (EKS), mirroring your production environment to ensure seamless operations in case of a disaster.
Before beginning the DR setup process, ensure you have the following:
- An AWS account with administrative permissions
- Terraform (version 1.2.0 or higher) installed on your local machine
- AWS Command Line Interface (CLI) installed and configured
- Access to the following GitHub repositories:
- Basic familiarity with command-line operations
- Terraform configurations for the DR environment are located in the
DRs-Prod
directory of the Terraform repository. - Helm charts and application manifests for the DR environment are in the
drs-prod-vi
directory of the Helm chart repository.
a. Clone the repositories:
git clone https://github.com/viplatform/aws-terraform.git
git clone https://github.com/viplatform/helmcharts.git
b. Navigate to the DR environment directories:
- For Terraform:
cd aws-terraform/DRs-Prod
- For Helm charts:
cd helmcharts/drs-prod-vi
c. Review and familiarize yourself with the contents of these directories.
d. Ensure you have the necessary permissions to make changes to these repositories and create/modify resources in your AWS account.
e. Set up your local environment with the required tools (AWS CLI, Terraform, kubectl, helm, etc.) as detailed in the following sections.
- Open Terminal on your Mac.
- Install Homebrew if not already installed:
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
- Add Hashicorp tap:
brew tap hashicorp/tap
- Install Terraform:
brew install hashicorp/tap/terraform
- Verify the installation:
Ensure the version is 1.2.0 or higher.
terraform -version
- Open Terminal.
- Run the AWS configure command:
aws configure
- When prompted, enter the following information:
- AWS Access Key ID
- AWS Secret Access Key
- Default region name (e.g., us-east-1)
- Default output format (leave blank for default)
- Open Terminal.
- Navigate to the directory where you want to clone the repository.
- Clone the repository:
git clone <your_organization's_github_repo_url>
- Navigate into the cloned repository:
cd <repo_directory>
- Open the
backend.tf
file in a text editor. - Ensure it contains the following configuration:
terraform { backend "local" { path = "terraform.tfstate" } }
- Open the
provider.tf
file in a text editor. - Ensure it contains the following configuration:
provider "aws" { region = var.region } variable "region" { description = "AWS Region" type = string default = "us-east-1" }
Ensure your Terraform configuration includes a VPC setup that mirrors the production environment. This typically involves:
- Defining the VPC CIDR block.
- Creating public and private subnets.
- Setting up an Internet Gateway.
- Configuring route tables.
Configure a Bastion Host in your Terraform files:
- Define an EC2 instance in a public subnet.
- Configure security groups to allow SSH access.
- Add a user data script to install necessary tools (Helm, AWS CLI, kubectl).
Set up an EKS cluster (version 1.28) in your Terraform configuration:
- Define the EKS cluster resource.
- Configure node groups.
- Set up necessary IAM roles and policies.
Cross Region Replication for ECR is already configured and operational:
- The production backend core ECR repository (
prod-be-core
) is replicated to the DR region. - Repository URL:
832726350816.dkr.ecr.us-east-1.amazonaws.com/prod-be-core
AWS Secrets Manager replication is fully configured:
- All necessary secrets are being replicated to US-EAST-1.
- The DR environment is set up to use these replicated secrets.
Cross Region Replication for S3 is in place:
- Both internal and external system buckets are being replicated to US-EAST-1.
- Buckets with the prefix "drs" are used for backend applications in the DR environment.
RDS Cross Region Replication is active:
- The production database is being replicated to the DR region.
- The DR environment is configured to use this replicated database.
- Log in to the AWS Management Console.
- Switch to the DR region (e.g., us-east-1).
- Navigate to AWS Backup:
- Go to the AWS Backup console.
- In the left navigation pane, choose "Backup vaults".
- Locate the backup:
- Select the appropriate backup vault.
- Find the most recent RDS backup for your production database.
- Initiate the restore process:
- Select the backup you want to restore.
- Click "Restore" to start the restoration process.
- Configure the restore settings:
- Choose "Restore to new RDS database" as the restore type.
- Select the appropriate DB engine version (should match your production version).
- Choose the DB instance class (match or scale as needed for DR).
- Set up network & security:
- VPC: Select the VPC created by Terraform for your DR environment.
- Subnet group: Choose the subnet group for your DR environment.
- Public accessibility: typically set to 'No' for security.
- VPC security group: Select the security group created for RDS in your DR environment.
- Configure instance settings:
- DB instance identifier: Give it a unique name (e.g., "drs-prod-db").
- Set the master username and password.
- Additional configuration:
- Set parameters like backup retention, monitoring, etc., to match your DR requirements.
- Review and initiate restore:
- Review all settings.
- Click "Restore DB instance".
- Monitor the restore process:
- Go to the RDS console.
- Find your new DR database instance.
- Wait for the status to change to "Available".
- Update your application configurations:
- Once the database is available, go to its details page in the RDS console.
- Copy the endpoint address.
- Update your application manifests or environment variables with this new endpoint.
- Verify database accessibility:
- From your bastion host or an application instance, try connecting to the new database to ensure it's accessible.
- (Optional) Set up ongoing replication:
- If continuous replication from production to DR is required, consider setting up AWS DMS (Database Migration Service) for ongoing replication.
Set up an MSK cluster in your Terraform configuration:
- Define the MSK cluster resource.
- Configure topics and partitions to match production.
Configure a Redis ElastiCache cluster in your Terraform files:
- Define the ElastiCache cluster resource.
- Set up subnet groups and security groups.
-
Log In to the AWS Management Console:
- Go to the AWS Management Console and log in.
-
Select the DR Region:
- Choose the appropriate DR region from the region dropdown in the top-right corner.
-
Open ACM Console:
- Navigate to the AWS Certificate Manager (ACM) Console.
-
View Certificates:
- Review the list of certificates to check if there are any for
*.api.virtualinternships.com
or*.viplatform.net
.
- Review the list of certificates to check if there are any for
-
Check Details:
- Click on the certificate to verify that the domain names match
*.api.virtualinternships.com
or*.viplatform.net
and ensure the status is "Issued."
- Click on the certificate to verify that the domain names match
-
Open ACM Console:
- Ensure you are in the correct region where you need the certificate.
-
Request a Certificate:
- Click “Request a certificate” in the ACM dashboard.
-
Choose Certificate Type:
- Select “Request a public certificate.”
-
Enter Domain Names:
- For a wildcard certificate, enter
*.api.virtualinternships.com
or*.viplatform.net
as the domain names. - You can enter both patterns if you need certificates for both domains.
- For a wildcard certificate, enter
-
Choose Validation Method:
- DNS Validation: ACM will provide CNAME records that need to be added to your DNS settings.
- Email Validation: ACM will send validation emails to the domain’s registered contacts.
-
Review and Confirm:
- Confirm the request by reviewing the details and clicking “Confirm and request.”
-
Add DNS Records (if using DNS Validation):
- For DNS validation, add the provided CNAME records to your DNS provider. Once done, return to the ACM console to continue the validation process.
-
Monitor Status:
- Track the status of the certificate request. It will be marked as “Issued” once validation is successful.
-
Use the Certificate:
- Associate the issued certificate with your resources such as CloudFront distributions or Elastic Load Balancers.
- Wildcard Certificates: Wildcard certificates like
*.api.virtualinternships.com
or*.viplatform.net
will cover all subdomains, but you must ensure that the domain names match exactly. - Domain Ownership: Ensure you have control over the DNS records for the domains you are requesting certificates for, as you'll need to complete domain validation.
- Open Terminal.
- Navigate to your Terraform directory:
cd aws-terraform/DRs-Prod
- Initialize Terraform:
terraform init
- Preview the changes:
terraform plan
- Apply the changes:
terraform apply
- When prompted, type 'yes' to confirm the changes.
- Check AWS resources:
- Log into the AWS Console.
- Verify that all expected resources (VPC, EKS, ECR, S3, etc.) are created in the DR region.
- Verify EKS cluster:
aws eks --
region us-east-1 describe-cluster --name drs-prod-vi
3. Update kubeconfig:
```bash
aws eks --region us-east-1 update-kubeconfig --name drs-prod-vi
- Check nodes:
kubectl get nodes
- Verify other AWS services:
- Check ECR repositories.
- Verify S3 bucket creation.
- Ensure Secrets Manager secrets are replicated.
- Review security groups and IAM roles to ensure proper access and permissions.
-
Open Terminal.
-
Run the following command:
aws eks update-kubeconfig --name drs-prod-vi
Replace "drs-prod-vi" with your actual EKS cluster name.
-
If you encounter connection issues, you may need to update the cluster's security group in the AWS EKS Subnet console to allow inbound traffic from the CIDR IP range.
-
Add the Argo CD Helm repository:
helm repo add argo https://argoproj.github.io/argo-helm helm repo update
-
Install Argo CD (go to cloned "helmchart/drs-prod-vi"):
helm upgrade --install argocd argo/argo-cd --version 5.27.1 --namespace argocd --values drs-prod-vi/argocd-values.yaml
-
Get the initial Argo CD password:
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
-
Set up port forwarding:
kubectl port-forward service/argocd-server -n argocd 8080:443
-
SSH to your Bastion host:
ssh -i drs-production-bastion.pem ubuntu@<bastion_ip> -L 8080:localhost:8080
-
Open a web browser and go to
http://localhost:8080
.
- Log in using the username 'admin' and the password obtained earlier.
- Navigate to 'Settings' > 'Repositories'.
- Add your Helm chart repository.
- Create applications for each of your services.
- Log in to the AWS Management Console and navigate to IAM.
- Review all roles created for the EKS DRs cluster.
- For each relevant role: a. Check the permissions to ensure they match your DR requirements. b. Update the trust relationships if necessary.
- Specifically, check the drs-AWSSecretManagerAccessProd Role:
a. Go to the role's "Trust relationships" tab.
b. Edit the trust relationship.
c. Ensure the following line is present and correct:
d. If it's not present or needs updating, add or modify it accordingly. e. Save the changes.
"oidc.eks.us-east-1.amazonaws.com/id/CE4F7AC97FCC50606C1AF23D17D0C463:sub": "system:serviceaccount:default:sa-eso"
- Navigate to your Helm charts repository.
- Locate the DRs-prod-vi directory.
- Review and update all Helm charts and manifests in this directory: a. Ensure all references to IAM roles are correct and match the roles you reviewed. b. Update any environment-specific configurations to match your DR setup. c. Verify that all necessary changes from the production environment are reflected here.
It's crucial to update the Kubernetes manifests for your DR environment. Focus on the following files in the drs-prod-vi
directory:
-
Review Region-Specific Configurations:
- Open
cluster-autoscaler.yaml
,external-secret.yaml
, andaws-ebs-csi-driver.yaml
- Locate the
AWS_REGION
orREGION
environment variable or any region-specific parameters - Ensure these are set to the DR region (e.g.,
us-east-1
)
- Open
-
Update Cluster Name in Load Balancer Controller:
- Edit
aws-load-balancer-controller.yaml
- Find the
--cluster-name
flag in the controller arguments - Verify it matches your DR EKS cluster name (e.g.,
drs-prod-vi
)
- Edit
-
Verify SSL Certificate ARN:
- In all ingress resource definitions, check the annotation:
alb.ingress.kubernetes.io/certificate-arn
- Confirm this ARN corresponds to a valid ACM certificate in the DR region
- In all ingress resource definitions, check the annotation:
-
Validate IAM Role ARNs:
- For each manifest using IAM roles (e.g.,
external-secret.yaml
,aws-load-balancer-controller.yaml
), verify the role ARNs are correct for the DR region
- For each manifest using IAM roles (e.g.,
-
Update Endpoint References:
- In application manifests, ensure any hardcoded endpoints (e.g., for RDS, ElastiCache, MSK) are updated to use the DR region resources
-
Check Resource Requests and Limits:
- Review and adjust CPU and memory requests/limits if your DR environment has different capacity constraints
-
Verify ConfigMaps and Secrets:
- Ensure any environment-specific ConfigMaps or Secrets are updated for the DR context
-
Update Ingress Hostnames:
- If using different domain names in DR, update the
host
fields in Ingress resources
- If using different domain names in DR, update the
-
Adjust HPA (Horizontal Pod Autoscaler) Settings:
- Review and modify HPA configurations if your DR scaling strategy differs
-
Validate PersistentVolumeClaim Configurations:
- Ensure storage class names and other storage-related configurations are valid for the DR region
After making these changes:
- Commit the updated manifests to your version control system.
- Use a diff tool to double-check all changes between production and DR manifests.
- Consider using Helm's template rendering (
helm template
) to verify the final Kubernetes resources that will be applied. - Plan a dry-run deployment in the DR environment to catch any configuration issues before an actual DR scenario.
Remember, maintaining parity between production and DR environments is crucial. Establish a process to review and update these manifests whenever changes are made to the production environment.
- Log in to your Cloudflare account.
- Navigate to the DNS settings for your domain.
- Update CNAME records to point to your new DR environment ingresses.
- check ingresses for all apps:
kubectl get ing -A
- Remove the CNAME from the existing production CloudFront distribution.
- Create a new CloudFront distribution for your DR environment.
- Update the S3 bucket for your front-end build.
- Adjust IAM policies to allow access to the DR S3 buckets.
-
Log into the AWS Management Console
-
Navigate to CloudFront:
- From the AWS services menu, select "CloudFront" under the "Networking & Content Delivery" section
-
Locate Your Distribution:
- In the CloudFront dashboard, you'll see a list of your distributions
- Find the distribution you want to modify
-
Edit the Distribution:
- Click on the ID of the distribution you want to edit
- This will take you to the distribution's detail page
- Click the "Edit" button at the top of the page
-
Navigate to the CNAME Section:
- Scroll down to the "Alternate Domain Names (CNAMEs)" section
-
Modify CNAME Records:
- In the text box, you'll see your current CNAME records
- To add a new CNAME:
- Type the new domain name on a new line
- To remove a CNAME:
- Delete the line containing the CNAME you want to remove
- To modify a CNAME:
- Edit the existing domain name as needed
-
SSL Certificate:
- If you're adding new CNAMEs, ensure that your SSL certificate covers these new domain names
- You may need to request a new certificate in AWS Certificate Manager if your current one doesn't include the new domains
-
Review Changes:
- Scroll through the other settings to ensure no unintended changes were made
-
Save Changes:
- At the bottom of the page, click "Save Changes"
-
Wait for Deployment:
- CloudFront will now deploy your changes across its network
- This can take up to 15 minutes to complete
-
Verify Changes:
- After deployment is complete, test your new or modified CNAMEs to ensure they're working correctly
-
Update DNS Records:
- If you've added new CNAMEs, don't forget to update your DNS records with your domain registrar
- Add or modify CNAME records to point to your CloudFront distribution's domain name (e.g., d1234abcd.cloudfront.net)
-
For DR Considerations:
- If this is part of your DR setup, ensure that your DR plan is updated to reflect these changes
- Document the process and any new CNAMEs in your DR documentation
Important Notes:
- Removing a CNAME from CloudFront doesn't automatically remove it from your DNS settings. Be sure to update your DNS records accordingly.
- Changes to CNAMEs can affect your site's availability. It's best to perform these changes during a maintenance window.
- Always test thoroughly after making changes to ensure your site is accessible via all intended domain names.
- If you're using this process as part of switching to a DR environment, ensure that you have a plan to quickly update DNS records to point to your DR CloudFront distribution if needed.
-
Log In to Cloudflare:
- Go to the Cloudflare login page and log in with your credentials.
-
Select Your Domain:
- On the Cloudflare dashboard, select the domain (viplatform.net, or virtualinternships.com ) you want to configure DNS records for.
-
Go to DNS Settings:
- Click on the “DNS” tab in the top navigation bar. This will take you to the DNS management page for your domain.
-
Add a DNS Record:
- Click on the “Add record” button.
-
Choose the Record Type:
- From the dropdown menu, select the type of DNS record you want to add. Common types include:
- A: Maps a domain to an IPv4 address.
- AAAA: Maps a domain to an IPv6 address.
- CNAME: Maps a domain to another domain (canonical name).
- MX: Defines mail exchange servers for your domain.
- TXT: Allows you to add arbitrary text to a domain.
- SRV: Specifies services available at specific ports.
- From the dropdown menu, select the type of DNS record you want to add. Common types include:
-
Enter Record Details:
- Name: Enter the subdomain or domain name for this record (e.g.,
www
,mail
, or@
for the root domain). - Value: Enter the value for the record, such as an IP address or hostname.
- TTL (Time To Live): Choose how long the DNS record should be cached by DNS resolvers. You can usually leave this as "Auto".
- Proxy Status: Choose whether to proxy traffic through Cloudflare. “Proxied” enables Cloudflare’s services like CDN and security, while “DNS only” does not use Cloudflare’s proxy.
- Name: Enter the subdomain or domain name for this record (e.g.,
-
Save the Record:
- Click the “Save” or “Add Record” button to save your changes.
-
Verify the Record:
- Once added, the new record will appear in the list of DNS records. Ensure it is correctly configured.
-
Check Propagation:
- DNS changes can take some time to propagate. Use tools like DNS Checker to verify that the changes are reflected globally.
- TTL Settings: Lower TTL values (e.g., 300 seconds) can be useful during testing, but increase them for production.
- Cloudflare Features: Explore additional Cloudflare features such as SSL/TLS settings, Firewall rules (WAF), and Performance optimizations that can enhance your domain's security and performance.
-
Regularly test your DR environment:
- Simulate failover scenarios.
- Verify data integrity.
- Ensure all applications are functioning correctly.
-
Keep your DR environment up-to-date:
- Sync any changes from production to DR.
- Regularly update and patch systems.
-
Document and review your DR plan:
- Keep detailed documentation of the DR setup and processes.
- Regularly review and update the DR plan as business needs change.
-
Train your team:
- Ensure all relevant team members understand the DR process.
- Conduct regular drills to familiarize the team with failover procedures.
By following this detailed guide, you will have set up a comprehensive Disaster Recovery environment that mirrors your production setup. Remember, the key to an effective DR strategy is regular testing, updating, and continuous improvement to ensure your business can quickly recover from any unforeseen events.
Happy disaster-proofing! 🚀