AWS Backup — Architecture Deep Dive Guide

00 / Before You Start

New to AWS Backup? Start Here

AWS Backup can feel intimidating at first. Here's the mental model that makes everything click:

Think of it like a Bank Vault

A Vault is your secure safe-deposit box. Your backups (called recovery points) sit inside it. You can have multiple vaults — one for prod, one for DR (disaster recovery), one for compliance.

Think of Plans like a Calendar Subscription

A Backup Plan is like a recurring calendar event: "every night at 2 AM, back up everything tagged Backup=true and keep it for 30 days." You write the plan once; AWS does the work automatically.

Recovery Points are Snapshots in Time

Each time a backup runs, it creates a Recovery Point — a frozen copy of your resource at that exact moment. You can restore from any of these points later. Think of them like iPhone backups: you can roll back to yesterday's or last week's.

ARN = Amazon Resource Name (a unique ID)

You'll see ARNs everywhere in AWS. They're just unique identifiers for any resource, like arn:aws:rds:us-east-1:123456789:db:my-database. When the docs say "Recovery Point ARN", they mean the unique ID of a specific backup snapshot.

Region = Physical AWS Data Center Location

AWS has data centers worldwide (us-east-1 = N. Virginia, eu-west-1 = Ireland, etc.). "Cross-region copy" means sending a backup to a different geography so that if an entire region fails, you still have your data elsewhere.

IAM Role = Permission Pass for AWS Backup

AWS Backup needs permission to access your databases, EC2 instances, etc. An IAM Role grants those permissions. AWS provides a default one called AWSBackupDefaultServiceRole that works for most cases — just use that to start.

The 30-second summary: You tag your AWS resources (databases, servers, file systems) with Backup=true. You create a Backup Plan that says "back these up nightly". AWS Backup runs automatically, stores snapshots in a Vault, optionally copies them to another region for safety, and you can restore any snapshot to a brand-new resource whenever needed.

01 / Core Concepts

The 5 Building Blocks

AWS Backup is a fully managed service that centralizes and automates data protection across AWS services. Before diving into flows, understand each primitive.

VAULT

Backup Vault

An encrypted container that stores recovery points (backups). Each vault has an access policy and an optional vault lock (WORM). You can have multiple vaults per region/account.

PLAN

Backup Plan

A policy document that defines when to back up (schedule), how long to retain (lifecycle), and where to send copies (copy rules). Assigned to resources via tags or ARNs.

JOB

Backup Job

The actual execution: takes a snapshot or continuous backup of a resource and stores a recovery point in the target vault. Triggered by a plan rule or manually.

COPY JOB

Copy Job

Copies an existing recovery point from one vault to another — either in the same region, a different region, or even a different AWS account. Useful for DR and compliance.

RESTORE JOB

Restore Job

Recreates a resource from a recovery point stored in a vault. You specify the target configuration; AWS Backup handles provisioning the new resource.

02 / Architecture

How Everything Connects

The diagram below shows the top-level relationships between AWS Backup components and the protected resources.

graph TD subgraph ACCOUNT["AWS Account (us-east-1)"] direction TB PLAN["Backup Plan ───────────── Schedule: cron(0 2 * * ? *) Retention: 30 days warm / 365 delete Copy rule to DR vault"] SEL["Resource Selection ───────────── Tag: Backup=true or specific ARNs"] subgraph RESOURCES["Protected Resources"] EC2["EC2 Instance"] RDS["RDS Database"] EFS["EFS File System"] DDB["DynamoDB Table"] end subgraph PRIMARY_VAULT["Primary Vault (us-east-1)"] RP1["Recovery Point 1 2024-01-15 02:00"] RP2["Recovery Point 2 2024-01-16 02:00"] RP3["Recovery Point 3 2024-01-17 02:00"] end end subgraph DR_ACCOUNT["DR Account / DR Region (eu-west-1)"] DR_VAULT["DR Vault Cross-region copy"] RP_DR["Recovery Points copied from primary"] end PLAN --> SEL SEL --> RESOURCES RESOURCES -->|"Backup Job"| PRIMARY_VAULT PRIMARY_VAULT -->|"Copy Job"| DR_VAULT DR_VAULT --> RP_DR style ACCOUNT fill:#111827,stroke:#1e2d45,color:#e2e8f0 style DR_ACCOUNT fill:#0f1e35,stroke:#1e2d45,color:#e2e8f0 style RESOURCES fill:#0a0e1a,stroke:#1e2d45,color:#e2e8f0 style PRIMARY_VAULT fill:#1a1a2e,stroke:#3b82f6,color:#e2e8f0 style DR_VAULT fill:#1a1a2e,stroke:#8b5cf6,color:#e2e8f0 style PLAN fill:#1e2d45,stroke:#3b82f6,color:#93c5fd style SEL fill:#1e2d45,stroke:#f59e0b,color:#fbbf24

A single Backup Plan can protect hundreds of resources simultaneously — AWS Backup runs one Backup Job per resource per rule execution.

03 / Backup Plans

Anatomy of a Backup Plan

A Backup Plan contains one or more rules, and is associated with resources via selections. Here's a real-world example plan JSON:

// Example: Production Backup Plan
{
  "BackupPlanName": "prod-daily-backup-plan",
  "Rules": [
    {
      "RuleName":             "DailyToUsEast1",
      "TargetBackupVaultName": "prod-primary-vault",
      "ScheduleExpression":   "cron(0 2 * * ? *)",  // 2 AM UTC daily
      "StartWindowMinutes":   60,
      "CompletionWindowMinutes": 180,
      "Lifecycle": {
        "MoveToColdStorageAfterDays": 30,
        "DeleteAfterDays": 365
      },
      "CopyActions": [         // triggers a Copy Job after backup
        {
          "DestinationBackupVaultArn": "arn:aws:backup:eu-west-1:DR_ACCOUNT_ID:backup-vault:dr-vault",
          "Lifecycle": {
            "DeleteAfterDays": 90
          }
        }
      ]
    },
    {
      "RuleName":             "WeeklyToUsEast1",
      "TargetBackupVaultName": "prod-primary-vault",
      "ScheduleExpression":   "cron(0 3 ? * SUN *)",  // Sunday 3 AM
      "Lifecycle": { "DeleteAfterDays": 1825 } // 5 years
    }
  ],

  "Selections": [  // what gets backed up by this plan
    {
      "SelectionName": "all-tagged-resources",
      "IamRoleArn": "arn:aws:iam::ACCOUNT:role/service-role/AWSBackupDefaultServiceRole",
      "ListOfTags": [
        { "ConditionType": "STRINGEQUALS",
          "ConditionKey":  "Backup",
          "ConditionValue":"true" }
      ]
    }
  ]
}

KEY FIELDS EXPLAINED

Field	Purpose	Example
ScheduleExpression	Cron expression for when jobs fire	cron(0 2 * * ? *) = 2 AM UTC daily
StartWindowMinutes	Window during which job must start, or it becomes EXPIRED (min 60 min, default 8 hrs)	60 = job must start within 1 hour
CompletionWindowMinutes	Time from scheduled start by which job must complete, or it is cancelled (default 7 days)	180 = 3 hours max runtime
MoveToColdStorageAfterDays	Auto-transition to cold storage (cheaper)	30 = after 30 days → cold tier
DeleteAfterDays	Auto-delete the recovery point	365 = deleted after 1 year
CopyActions	Cross-region/account copy after backup completes	Copy to EU DR vault

04 / Backup Jobs

Backup Job Lifecycle

When a plan rule fires, AWS Backup creates a Job for each matching resource. Each job goes through these states:

stateDiagram-v2 [*] --> CREATED : Schedule triggers CREATED --> PENDING : Resource being prepared PENDING --> RUNNING : Snapshot started RUNNING --> COMPLETED : Backup written to vault RUNNING --> PARTIAL : Completed with partial results RUNNING --> FAILED : Error occurred RUNNING --> ABORTING : Cancellation requested ABORTING --> ABORTED : Cancelled CREATED --> EXPIRED : Not started within StartWindow COMPLETED --> [*] PARTIAL --> [*] FAILED --> [*] ABORTED --> [*] EXPIRED --> [*] note right of COMPLETED Recovery Point now in Vault. Copy Job triggers if CopyActions defined. end note

STEP-BY-STEP FLOW

1

Schedule Fires

EventBridge rule triggers at the cron time. AWS Backup evaluates which resources match the plan's selection criteria (tags or ARNs).

2

Job Created → PENDING

One Backup Job per resource is created. The job enters PENDING while AWS Backup coordinates with the service (e.g. creates an EBS snapshot or RDS snapshot).

3

Job RUNNING — data transfer

The actual backup data is written to the vault. For EBS this is a snapshot. For EFS/DynamoDB it uses AWS Backup's native transfer. Progress is trackable via GetBackupJobStatus.

4

Recovery Point Created

On COMPLETED, a recovery point ARN is generated and stored in the vault. Metadata (creation time, resource type, encryption) is attached.

5

Copy Job Triggered (if configured)

If the rule has CopyActions, a Copy Job is automatically spawned to replicate the recovery point to the target vault.

05 / Copy Jobs

Copy Job — Cross-Region & Cross-Account

Copy Jobs replicate recovery points between vaults. They are the backbone of multi-region DR strategies and compliance isolation.

flowchart LR subgraph SOURCE["Source — us-east-1, Account A"] VAULT_SRC["prod-primary-vault Recovery Point: arn:...rp/abc123"] end subgraph DEST1["Same-Region Vault"] VAULT_SAME["prod-compliance-vault Copied Recovery Point locked / WORM"] end subgraph DEST2["DR Region — eu-west-1, Account B"] VAULT_DR["dr-vault Copied Recovery Point 90-day retention"] end VAULT_SRC -->|"Copy Job 1 — same-region"| VAULT_SAME VAULT_SRC -->|"Copy Job 2 — cross-region / cross-account"| VAULT_DR style SOURCE fill:#111827,stroke:#3b82f6 style DEST1 fill:#111827,stroke:#10b981 style DEST2 fill:#111827,stroke:#8b5cf6

COPY JOB REQUIREMENTS

Scenario	Requirement	Notes
Same-region copy	Source & dest vault in same region	Good for compliance vault isolation
Cross-region copy	IAM role with cross-region permissions	Both regions must be enabled in account
Cross-account copy	Dest vault must add source account to access policy	Use AWS Organizations for easier setup
Cross-account + cross-region	Both org policies and vault access policies updated	Recommended for DR isolation

// Destination vault access policy (allows Account A to copy in)
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect":    "Allow",
    "Principal": { "AWS": "arn:aws:iam::SOURCE_ACCOUNT_ID:root" },
    "Action": [
      "backup:CopyIntoBackupVault"
    ],
    "Resource":  "*"
  }]
}

06 / Restore Jobs

Restore Job — Recovering Resources

A Restore Job recreates an AWS resource from a recovery point stored in a vault. Think of it like pulling a saved game — AWS rebuilds the resource from that snapshot. For most services, it creates a brand-new resource and never touches the original. (Exceptions: S3 can restore to an existing bucket; EFS supports item-level restore into an existing file system.)

Do I need to spin up a new database?

Yes — always. AWS Backup cannot restore "in place". When you restore an RDS database, you get:
• A new DB instance with a brand-new hostname (e.g. prod-db-restored.abc123.rds.amazonaws.com)
• A new ARN and resource ID
• Your original database is left completely untouched

After the restore completes, you must manually redirect your app to the new database — by updating your connection string, Secrets Manager secret, or environment variables. This is by design: AWS protects you from accidentally overwriting a healthy database.

flowchart TD A["Operator / Automation"] --> B["1. List Recovery Points aws backup list-recovery-points-by-backup-vault"] B --> C["2. Choose a Recovery Point ARN e.g. from last night 02:30 UTC"] C --> D["3. Get required restore metadata aws backup get-recovery-point-restore-metadata"] D --> E["4. Start Restore Job aws backup start-restore-job"] E --> F{Job Status} F -->|"RUNNING 15-30 min"| G["AWS provisions new RDS instance..."] G --> F F -->|"COMPLETED"| H["New DB available new-prod-db.xyz.rds.amazonaws.com"] F -->|"FAILED"| I["Check CloudWatch Logs + DescribeRestoreJob"] H --> J["5. Validate data run queries / smoke tests"] J --> K{OK?} K -->|Yes| L["6. Cut over traffic Update Secrets Manager or Route 53 CNAME"] K -->|No| M["Try earlier recovery point"] L --> N["7. Delete old instance or keep for rollback"] style A fill:#1e2d45,stroke:#3b82f6,color:#93c5fd style H fill:#1a2e1a,stroke:#10b981,color:#6ee7b7 style I fill:#2e1a1a,stroke:#ef4444,color:#fca5a5 style L fill:#1a2e1a,stroke:#10b981,color:#6ee7b7 style N fill:#1e1a2e,stroke:#8b5cf6,color:#c4b5fd

Service	What's restored	New resource?	Cutover needed?
RDS / Aurora	New DB instance / cluster from snapshot	New endpoint + ARN	Yes — update connection string
EC2 (EBS)	New EC2 instance from AMI created from snapshot	New Instance ID + Volume IDs	Update target groups / DNS
EFS	New EFS file system	New FS ID + DNS	Yes — remount or update mount target
DynamoDB	New table (point-in-time)	New table name	Yes — update app table reference
S3	Objects to same or different bucket	Same or new bucket	Only if new bucket name
Aurora	New Aurora cluster	New cluster ARN + endpoint	Yes — update connection string

CODE SAMPLES

Click a tab to see the restore code for each service.

What happens when you run this?
AWS creates a completely new RDS instance with a new hostname like prod-mysql-restored-20240117.abc.us-east-1.rds.amazonaws.com.
Your original database keeps running untouched. Your app still talks to the old one until you change the connection string.

Prerequisites: AWS CLI installed & configured (aws configure), and your IAM user must have backup:* and rds:* permissions.

# ─────────────────────────────────────────────────────────────────────
# STEP 1 — Find available recovery points (backups) in your vault
#   This lists all RDS backups. Look at "Created" to find the one
#   from the date/time you want to restore from.
# ─────────────────────────────────────────────────────────────────────
aws backup list-recovery-points-by-backup-vault \
  --backup-vault-name "prod-primary-vault" \
  --by-resource-type "RDS" \
  --query 'RecoveryPoints[*].{ARN:RecoveryPointArn,Created:CreationDate,Status:Status}' \
  --output table

# Example output:
# -----------------------------------------------------------------------
# |            ListRecoveryPointsByBackupVault                          |
# +------------------------------+-----------+---------------------------+
# | ARN                          | Created   | Status                    |
# +------------------------------+-----------+---------------------------+
# | arn:aws:rds:...:awsbackup-.. | 2024-01-17| COMPLETED                 |
# | arn:aws:rds:...:awsbackup-.. | 2024-01-16| COMPLETED                 |
# +------------------------------+-----------+---------------------------+
#   ↑ Copy the ARN of the backup you want to restore from

# ─────────────────────────────────────────────────────────────────────
# STEP 2 — Ask AWS what parameters are needed to restore this backup.
#   AWS Backup returns a JSON object with all the config of the
#   original DB (instance class, engine, subnet group, etc.)
#   You'll use this in Step 3 — just change the DB name.
# ─────────────────────────────────────────────────────────────────────
aws backup get-recovery-point-restore-metadata \
  --backup-vault-name "prod-primary-vault" \
  --recovery-point-arn "arn:aws:rds:us-east-1:123456789:snapshot:awsbackup-2024-01-17-02-30"

# Returns something like:
# {
#   "DBInstanceIdentifier": "prod-mysql",      ← original DB name
#   "DBInstanceClass":      "db.t3.medium",    ← instance size
#   "Engine":               "mysql",           ← database engine
#   "MultiAZ":              "false",           ← high-availability setting
#   "DBSubnetGroupName":    "prod-subnet-group",
#   "VpcSecurityGroupIds":  "sg-0abc123"
# }
#   ↑ Copy this output. You'll paste it into Step 3, changing only
#     DBInstanceIdentifier to a new unique name.

# ─────────────────────────────────────────────────────────────────────
# STEP 3 — Start the Restore Job.
#   IMPORTANT: Change "DBInstanceIdentifier" to a NEW name.
#   If you use the same name as the original, it will FAIL because
#   a DB with that name already exists.
# ─────────────────────────────────────────────────────────────────────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:rds:us-east-1:123456789:snapshot:awsbackup-2024-01-17-02-30" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "RDS" \
  --metadata '{
    "DBInstanceIdentifier": "prod-mysql-restored-20240117",
    "DBInstanceClass":      "db.t3.medium",
    "Engine":               "mysql",
    "MultiAZ":              "false",
    "DBSubnetGroupName":    "prod-subnet-group",
    "VpcSecurityGroupIds":  "sg-0abc123"
  }'

# Returns: { "RestoreJobId": "ABCDEF123456" }
#   ↑ Save this ID — you need it to check progress in Step 4

# ─────────────────────────────────────────────────────────────────────
# STEP 4 — Monitor restore progress (takes 15-30 min for most DBs)
#   Run this every few minutes. Status goes:
#   PENDING → RUNNING → COMPLETED (or FAILED)
# ─────────────────────────────────────────────────────────────────────
aws backup describe-restore-job \
  --restore-job-id "ABCDEF123456"

# When COMPLETED, you'll see:
# { "Status": "COMPLETED", "CreatedResourceArn": "arn:aws:rds:...:db:prod-mysql-restored-20240117" }

# ─────────────────────────────────────────────────────────────────────
# STEP 5 — Get the hostname of the new database
#   This is the address your app needs to connect to.
# ─────────────────────────────────────────────────────────────────────
aws rds describe-db-instances \
  --db-instance-identifier "prod-mysql-restored-20240117" \
  --query 'DBInstances[0].Endpoint.Address'

# Output: "prod-mysql-restored-20240117.abc123.us-east-1.rds.amazonaws.com"
#   ↑ This is your new database hostname

# ─────────────────────────────────────────────────────────────────────
# STEP 6 — Update Secrets Manager so your app picks up the new host
#   (If you store DB credentials in Secrets Manager — recommended)
#   After this, restart your app containers/servers to reconnect.
# ─────────────────────────────────────────────────────────────────────
aws secretsmanager update-secret \
  --secret-id "prod/db/connection" \
  --secret-string '{"host":"prod-mysql-restored-20240117.abc123.us-east-1.rds.amazonaws.com","port":3306,"username":"admin","password":"your-password"}'

After step 6: Restart your application servers so they re-read the secret and connect to the new database. The old database is still running — you can delete it once you've confirmed everything works.

ℹ️ Aurora vs RDS — what's different?
Aurora is a cluster: it has one writer node and optional reader nodes. When you restore Aurora, AWS creates a new cluster (not just one instance). The writer endpoint will have a new hostname. Reader instances are not automatically recreated — you add them manually after the cluster is available.

# ── Step 1: Find Aurora recovery point ───────────────────────────────
aws backup list-recovery-points-by-backup-vault \
  --backup-vault-name "prod-primary-vault" \
  --by-resource-type "Aurora" \
  --query 'RecoveryPoints[*].{ARN:RecoveryPointArn,Created:CreationDate}' \
  --output table

# ── Step 2: Get restore metadata ─────────────────────────────────────
aws backup get-recovery-point-restore-metadata \
  --backup-vault-name "prod-primary-vault" \
  --recovery-point-arn "arn:aws:rds:us-east-1:123456789:cluster-snapshot:awsbackup-aurora-2024-01-17"

# ── Step 3: Start restore ─────────────────────────────────────────────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:rds:us-east-1:123456789:cluster-snapshot:awsbackup-aurora-2024-01-17" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "Aurora" \
  --metadata '{
    "DBClusterIdentifier":  "prod-aurora-restored-20240117",
    "Engine":               "aurora-mysql",
    "EngineVersion":        "8.0.mysql_aurora.3.04.0",
    "DBSubnetGroupName":    "prod-subnet-group",
    "VpcSecurityGroupIds":  "sg-0abc123",
    "Port":                 "3306"
  }'

# ── Step 4: Add reader instance after cluster is available ────────────
aws rds create-db-instance \
  --db-instance-identifier "prod-aurora-restored-reader-1" \
  --db-cluster-identifier  "prod-aurora-restored-20240117" \
  --db-instance-class       "db.r6g.large" \
  --engine                  "aurora-mysql"

# ── Step 5: Get cluster endpoints ────────────────────────────────────
aws rds describe-db-clusters \
  --db-cluster-identifier "prod-aurora-restored-20240117" \
  --query 'DBClusters[0].{Writer:Endpoint,Reader:ReaderEndpoint}'

ℹ️ What gets backed up for EC2?
AWS Backup takes a snapshot of the EC2 instance's disk (EBS volume). When you restore, you get a new AMI (Amazon Machine Image — like a disk image), and from that AMI you launch a brand-new EC2 instance. The restored server will have the same OS, files, and configuration as when the backup was taken. Its IP address and instance ID will be different.

# ── Step 1: Find EC2 recovery point ──────────────────────────────────
aws backup list-recovery-points-by-backup-vault \
  --backup-vault-name "prod-primary-vault" \
  --by-resource-type "EC2" \
  --output table

# ── Step 2: Get restore metadata ─────────────────────────────────────
aws backup get-recovery-point-restore-metadata \
  --backup-vault-name "prod-primary-vault" \
  --recovery-point-arn "arn:aws:ec2:us-east-1::image/ami-recovered-abc123"

# ── Step 3: Start EC2 restore job ─────────────────────────────────────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:ec2:us-east-1::image/ami-recovered-abc123" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "EC2" \
  --metadata '{
    "InstanceType":           "t3.medium",
    "SubnetId":               "subnet-0abc123",
    "SecurityGroupIds":       "sg-0abc123",
    "IamInstanceProfileName": "MyProfile",
    "KeyName":                "my-keypair"
  }'

# ── Result: A new AMI is created, then a new EC2 instance launched ────
# ── Step 4: Get the new instance ID from the restore job ──────────────
aws backup describe-restore-job \
  --restore-job-id "RESTORE_JOB_ID" \
  --query 'CreatedResourceArn'

# ── Step 5: Update target group / ASG to include new instance ─────────
aws elbv2 register-targets \
  --target-group-arn "arn:aws:elasticloadbalancing:..." \
  --targets 'Id=i-0newinstanceid123'

ℹ️ DynamoDB restore = a new table with a new name.
DynamoDB restores to a new table — you pick the name. Your indexes (GSIs/LSIs), TTL settings, and billing mode are all preserved automatically. After restore, update your app's table name reference (usually an env variable or SSM parameter). AWS also supports Point-in-Time Recovery (PITR) natively — that's a separate feature where you can restore to any second within the last 35 days without needing AWS Backup.

# ── Step 1: Find DynamoDB recovery point ─────────────────────────────
aws backup list-recovery-points-by-backup-vault \
  --backup-vault-name "prod-primary-vault" \
  --by-resource-type "DynamoDB" \
  --output table

# ── Step 2: Start restore ─────────────────────────────────────────────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:dynamodb:us-east-1:123456789:table/orders/backup/01234567890123-abc" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "DynamoDB" \
  --metadata '{
    "targetTableName":  "orders-restored-20240117"
  }'

# Note: The restored table inherits capacity/billing settings from the backup.
# To change billing mode or capacity, modify the table after restoration:

aws dynamodb update-table \
  --table-name "orders-restored-20240117" \
  --billing-mode "PAY_PER_REQUEST"

# ── Step 3: Verify the restored table ────────────────────────────────
aws dynamodb describe-table \
  --table-name "orders-restored-20240117" \
  --query 'Table.{Status:TableStatus,Items:ItemCount,Size:TableSizeBytes}'

# ── Step 4: Update app config to point to new table name ─────────────
# (update SSM Parameter Store or env variable)

aws ssm put-parameter \
  --name "/prod/app/dynamo-table-name" \
  --value "orders-restored-20240117" \
  --overwrite

ℹ️ EFS = Elastic File System (a shared network drive).
EFS gives you two restore options:
Option A — Full restore: creates a brand-new EFS file system with all your files. You then update your EC2 instances to mount the new FS instead of the old one.
Option B — Selective restore: restores specific folders/files into a subfolder of your existing EFS. Useful when you only need to recover one directory without disrupting anything else.

# ── Option A: Full restore to a new EFS file system ──────────────────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:backup:us-east-1:123456789:recovery-point:efs-rp-abc123" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "EFS" \
  --metadata '{
    "newFileSystem":        "true",
    "PerformanceMode":      "generalPurpose",
    "Encrypted":            "true",
    "KmsKeyId":             "arn:aws:kms:us-east-1:123456789:key/my-key",
    "CreationToken":        "restored-efs-20240117"
  }'

# ── Option B: Selective restore into existing EFS (specific path) ─────
aws backup start-restore-job \
  --recovery-point-arn "arn:aws:backup:us-east-1:123456789:recovery-point:efs-rp-abc123" \
  --iam-role-arn "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole" \
  --resource-type "EFS" \
  --metadata '{
    "newFileSystem":           "false",
    "file-system-id":          "fs-0existingefs123",
    "ItemsToRestore":          "[\\"/uploads/2024/january\\"]"
  }'

# Note: AWS Backup automatically creates a restore directory named
# aws-backup-restore_<datetime> in the root of the target file system.

# ── Step 3: Remount EFS with new File System ID ───────────────────────
# On EC2 instance, update /etc/fstab:
# fs-0newefs456.efs.us-east-1.amazonaws.com:/ /mnt/efs efs defaults,_netdev 0 0
# Then:

sudo umount /mnt/efs
sudo mount -a

What is boto3? boto3 is the official Python SDK (library) for AWS. Instead of typing CLI commands manually, you write Python code that calls the same AWS APIs — great for automation, runbooks, or Lambda functions.

Install it with: pip install boto3
Configure credentials with: aws configure (same as CLI)

This script does the entire RDS restore flow end-to-end: find latest backup → restore → wait → update Secrets Manager. Run it during an incident.

import boto3, time, json
from datetime import datetime

backup   = boto3.client('backup',          region_name='us-east-1')
sm       = boto3.client('secretsmanager',  region_name='us-east-1')
rds      = boto3.client('rds',             region_name='us-east-1')

VAULT_NAME      = "prod-primary-vault"
SECRET_ARN      = "arn:aws:secretsmanager:us-east-1:123456789:secret:prod/db"
IAM_ROLE_ARN    = "arn:aws:iam::123456789:role/service-role/AWSBackupDefaultServiceRole"
NEW_DB_ID       = f"prod-mysql-restored-{datetime.now().strftime('%Y%m%d-%H%M')}"

# ── 1. Find the latest successful RDS recovery point ──────────────────
def get_latest_recovery_point():
    resp = backup.list_recovery_points_by_backup_vault(
        BackupVaultName=VAULT_NAME,
        ByResourceType="RDS",
    )
    points = [p for p in resp["RecoveryPoints"] if p["Status"] == "COMPLETED"]
    if not points:
        raise Exception("No completed recovery points found")
    # Sort by creation date descending, take the most recent
    points.sort(key=lambda x: x["CreationDate"], reverse=True)
    return points[0]["RecoveryPointArn"]

# ── 2. Get restore metadata from AWS ─────────────────────────────────
def get_restore_metadata(rp_arn):
    resp = backup.get_recovery_point_restore_metadata(
        BackupVaultName=VAULT_NAME,
        RecoveryPointArn=rp_arn,
    )
    meta = resp["RestoreMetadata"]
    # Override the DB identifier so we get a NEW instance
    meta["DBInstanceIdentifier"] = NEW_DB_ID
    return meta

# ── 3. Start the restore job ──────────────────────────────────────────
def start_restore(rp_arn, metadata):
    resp = backup.start_restore_job(
        RecoveryPointArn=rp_arn,
        Metadata=metadata,
        IamRoleArn=IAM_ROLE_ARN,
        ResourceType="RDS",
    )
    return resp["RestoreJobId"]

# ── 4. Poll until complete ────────────────────────────────────────────
def wait_for_restore(job_id, poll_interval=30):
    while True:
        resp   = backup.describe_restore_job(RestoreJobId=job_id)
        status = resp["Status"]
        pct    = resp.get("PercentDone", "?")
        print(f"  [{datetime.now().strftime('%H:%M:%S')}] Status: {status} ({pct}% done)")
        if status == "COMPLETED":
            return resp["CreatedResourceArn"]   # new RDS ARN
        if status in ("FAILED", "ABORTED"):
            raise Exception(f"Restore job failed: {resp.get('StatusMessage')}")
        time.sleep(poll_interval)

# ── 5. Get new endpoint & update Secrets Manager ──────────────────────
def cutover(new_db_arn):
    db = rds.describe_db_instances(DBInstanceIdentifier=NEW_DB_ID)
    endpoint = db["DBInstances"][0]["Endpoint"]["Address"]
    print(f"  New endpoint: {endpoint}")

    # Update Secrets Manager secret with new host
    current = json.loads(sm.get_secret_value(SecretId=SECRET_ARN)["SecretString"])
    current["host"] = endpoint
    sm.update_secret(SecretId=SECRET_ARN, SecretString=json.dumps(current))
    print("   Secrets Manager updated — app will pick up new endpoint on next restart")

# ── Main ──────────────────────────────────────────────────────────────
if __name__ == "__main__":
    print(" Finding latest recovery point...")
    rp_arn   = get_latest_recovery_point()
    print(f"   → {rp_arn}")

    print(" Fetching restore metadata...")
    metadata = get_restore_metadata(rp_arn)

    print(" Starting restore job...")
    job_id   = start_restore(rp_arn, metadata)
    print(f"   → Job ID: {job_id}")

    print("⏳ Waiting for restore to complete (this takes 15-30 min for RDS)...")
    new_arn  = wait_for_restore(job_id)

    print(" Cutting over to new instance...")
    cutover(new_arn)
    print(" Restore complete!")

07 / Coordination

End-to-End Coordination

Here is how all components interact in a complete backup + DR + restore scenario, from schedule fire to successful restore.

sequenceDiagram participant SCHED as EventBridge Scheduler participant BACKUP as AWS Backup Service participant RESOURCE as Resource (RDS) participant VAULT_P as Primary Vault participant VAULT_DR as DR Vault (eu-west-1) participant OPS as Operator Note over SCHED,BACKUP: Backup Plan triggers at cron(0 2 * * ? *) SCHED->>BACKUP: Rule fires — create Backup Jobs BACKUP->>RESOURCE: Request snapshot / backup RESOURCE-->>VAULT_P: Stream backup data BACKUP-->>BACKUP: Backup Job: RUNNING RESOURCE-->>BACKUP: Snapshot complete BACKUP->>VAULT_P: Store Recovery Point (RP-001) BACKUP-->>BACKUP: Backup Job: COMPLETED Note over BACKUP,VAULT_DR: CopyAction defined in rule — auto-spawn Copy Job BACKUP->>VAULT_P: Read RP-001 BACKUP->>VAULT_DR: Write copy of RP-001 (cross-region) BACKUP-->>BACKUP: Copy Job: COMPLETED Note over OPS,VAULT_DR: Incident: Production DB corrupted OPS->>VAULT_DR: Browse recovery points VAULT_DR-->>OPS: List: [RP-001, RP-002, RP-003...] OPS->>BACKUP: Start Restore Job from RP-001 BACKUP->>VAULT_DR: Retrieve recovery point data BACKUP->>RESOURCE: Provision new RDS instance BACKUP-->>OPS: Restore Job: COMPLETED — new-rds-arn OPS->>OPS: Validate, then update DNS / app config

08 / Real-World Example

Complete Example: 3-Tier Web App

A production web app with EC2, RDS, and EFS — backed up daily with cross-region DR copies. Here's the full setup and what happens each night:

flowchart TB subgraph APP["Production App Stack (us-east-1)"] EC2["EC2 Auto Scaling Tag: Backup=true"] RDS["RDS MySQL Tag: Backup=true"] EFS["EFS Tag: Backup=true"] end subgraph PLAN_BOX["prod-backup-plan"] RULE1["Rule: DailyBackup cron 02:00 UTC Retention: 30 days CopyTo eu-west-1"] RULE2["Rule: WeeklyBackup cron SUN 03:00 Retention: 5 years No copy"] end subgraph JOBS["Nightly Jobs (02:00 UTC)"] JOB1["Backup Job EC2 — EBS snapshot"] JOB2["Backup Job RDS snapshot"] JOB3["Backup Job EFS backup"] end subgraph VAULT1["Primary Vault (us-east-1)"] RP_EC2["RP: EC2 daily"] RP_RDS["RP: RDS daily"] RP_EFS["RP: EFS daily"] end subgraph COPY_JOBS["Copy Jobs (auto-triggered)"] CJ1["Copy: EC2 RP to eu-west-1"] CJ2["Copy: RDS RP to eu-west-1"] CJ3["Copy: EFS RP to eu-west-1"] end subgraph VAULT_DR["DR Vault (eu-west-1)"] DR_EC2["Copy: EC2 RP"] DR_RDS["Copy: RDS RP"] DR_EFS["Copy: EFS RP"] end PLAN_BOX --> APP APP --> JOBS JOB1 --> RP_EC2 JOB2 --> RP_RDS JOB3 --> RP_EFS RP_EC2 --> CJ1 RP_RDS --> CJ2 RP_EFS --> CJ3 CJ1 --> DR_EC2 CJ2 --> DR_RDS CJ3 --> DR_EFS style APP fill:#111827,stroke:#3b82f6 style PLAN_BOX fill:#111827,stroke:#f59e0b style JOBS fill:#111827,stroke:#10b981 style VAULT1 fill:#111827,stroke:#3b82f6 style COPY_JOBS fill:#111827,stroke:#8b5cf6 style VAULT_DR fill:#111827,stroke:#ef4444

NIGHTLY TIMELINE

02:00 UTC — Schedule fires CREATED

EventBridge triggers the DailyBackup rule. AWS Backup evaluates all resources tagged Backup=true — finds EC2, RDS, EFS. Creates 3 Backup Jobs.

02:01 — Jobs start running RUNNING

Each job begins. RDS creates a native snapshot; EBS snapshot taken for EC2 volumes; EFS backup streamed to vault. These run in parallel.

02:30 — Backup jobs complete COMPLETED

3 recovery points now stored in prod-primary-vault in us-east-1. Retention lifecycle: warm for 30 days, then auto-deleted.

02:31 — Copy Jobs auto-spawn COPY RUNNING

Because CopyActions is defined, 3 Copy Jobs are automatically created. They stream the recovery points to dr-vault in eu-west-1 under the DR account.

03:15 — Copy Jobs complete COMPLETED

All 3 recovery points are now replicated to EU. DR account has 90-day retention. Primary backups remain independent in us-east-1.

09:00 (next day) — Incident: RDS corruption detected

Ops team decides to restore RDS from the 02:30 recovery point in the DR vault in eu-west-1.

09:05 — Restore Job started RESTORE RUNNING

Restore Job provisions a new RDS instance in eu-west-1 from the copied recovery point. Parameters: new DB identifier, same instance class, target VPC.

09:25 — Restore complete COMPLETED

New RDS instance available at new endpoint. Ops validates data integrity, then updates application config / Route 53 to point to new DB. RPO: ~7 hours. RTO: ~20 minutes.

09 / Supported Services

What Can AWS Backup Protect?

AWS Backup supports a wide range of services. Not all features are available for every service — check the matrix below.

Category	Service	Continuous / PITR	Cold Storage	Cross-Region Copy	Cross-Account Copy
Compute	Amazon EC2 (incl. VSS-enabled Windows)	No	Yes	Yes	Yes
Block Storage	Amazon EBS	No	Yes	Yes	Yes
File Storage	Amazon EFS	No	Yes	Yes	Yes
File Storage	Amazon FSx (ONTAP: no cross-region/account)	No	No	Yes	Yes
Object Storage	Amazon S3	Yes	No	Yes	Yes
Relational DB	Amazon RDS (all engines)	Yes	No	Yes	Yes
Relational DB	Amazon Aurora	Yes	No	Yes	Yes
NoSQL DB	Amazon DynamoDB (advanced features required)	Yes	Yes	Yes	Yes
Document DB	Amazon DocumentDB	No	No	Yes	Yes
Graph DB	Amazon Neptune	No	No	Yes	Yes
Data Warehouse	Amazon Redshift	No	No	No	No
Time Series	Amazon Timestream	No	Yes	Yes	Yes
Containers	Amazon EKS	No	No	Yes	Yes
Hybrid	AWS Storage Gateway (Volume)	No	No	Yes	Yes
Hybrid	VMware VMs (via Backup Gateway)	No	Yes	Yes	Yes
SAP	SAP HANA on EC2	Yes	Yes	Yes	Yes
IaC	AWS CloudFormation (stacks)	No	No	Yes	Yes

Continuous Backup / PITR allows restoring to any second within the last 35 days. Supported for RDS, Aurora, DynamoDB, S3, and SAP HANA.

10 / Security

Vault Lock, Encryption & Access Control

AWS Backup provides multiple layers of protection for your recovery points — from encryption and access policies to immutable WORM locks and legal holds.

ENCRYPTION

KMS Encryption

Every vault is tied to an AWS KMS key. Recovery points are encrypted at rest and in transit. You can use the AWS-managed key or your own customer-managed key (CMK) for full control.

ACCESS POLICY

Vault Access Policies

Resource-based policies on each vault control who can create, copy, or delete recovery points. Use these to restrict cross-account access or prevent unauthorized deletion.

CLOUDTRAIL

Audit Trail

All AWS Backup API calls are logged to CloudTrail. Every backup, copy, restore, and deletion is recorded with who did it, when, and from where — critical for compliance audits.

VAULT LOCK (WORM PROTECTION)

Vault Lock enforces a Write-Once, Read-Many (WORM) model on a vault. Once locked, recovery points cannot be deleted before their retention period expires — not even by the root account.

flowchart LR subgraph GOV["Governance Mode"] G1["Lock applied"] G2["Can be removed by privileged IAM users"] G3["Good for testing before compliance"] G1 --> G2 --> G3 end subgraph COMP["Compliance Mode"] C1["Lock applied"] C2["Minimum 72-hour cooling-off period"] C3["After cooling-off: IMMUTABLE forever"] C4["Cannot be removed by anyone — incl. AWS"] C1 --> C2 --> C3 --> C4 end style GOV fill:#111827,stroke:#f59e0b style COMP fill:#111827,stroke:#ef4444

Feature	Governance Mode	Compliance Mode
Removable?	Yes, by users with sufficient IAM permissions	No — immutable after minimum 72-hour cooling-off period
Delete recovery points early?	No (unless lock is removed first)	No — never, by anyone
Change retention?	Only to extend (not shorten)	Only to extend (not shorten)
Regulatory compliance	Partial — good for internal policies	SEC 17a-4, CFTC, FINRA compliant
Use case	Test before committing to compliance	Production compliance vaults

LEGAL HOLD

A Legal Hold preserves specific recovery points from deletion regardless of their lifecycle or retention policies. Unlike Vault Lock (which protects all backups in a vault), Legal Hold is applied to individual recovery points. Use it when regulatory or legal proceedings require you to preserve specific backups indefinitely until the hold is released.

LOGICALLY AIR-GAPPED VAULTS

Air-gapped vaults provide enhanced ransomware resilience. They are isolated from source accounts, automatically locked in compliance mode, and require multi-party approval for critical recovery operations. Recovery points inside cannot be manually deleted. These vaults can be shared across accounts via AWS Resource Access Manager (RAM).

# ── Enable Vault Lock in Governance mode ──────────────────────────────
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name "prod-compliance-vault" \
  --min-retention-days 30 \
  --max-retention-days 365

# ── Switch to Compliance mode (IRREVERSIBLE after 72h) ────────────────
aws backup put-backup-vault-lock-configuration \
  --backup-vault-name "prod-compliance-vault" \
  --min-retention-days 30 \
  --max-retention-days 365 \
  --changeable-for-days 3

# After 3 days, this lock becomes permanent and cannot be removed.

11 / Lifecycle

Lifecycle Management — Warm & Cold Storage

AWS Backup can automatically transition recovery points between storage tiers to optimize costs. Older backups that you rarely access can move to cold storage at a fraction of the price.

flowchart LR A["Backup Job completes"] --> B["Recovery Point created in WARM storage"] B -->|"After N days (MoveToColdStorageAfterDays)"| C["Recovery Point moved to COLD storage"] C -->|"After N days (DeleteAfterDays)"| D["Recovery Point DELETED"] style A fill:#1e2d45,stroke:#3b82f6,color:#93c5fd style B fill:#1a2235,stroke:#f59e0b,color:#fbbf24 style C fill:#1a2235,stroke:#3b82f6,color:#60a5fa style D fill:#1a2235,stroke:#ef4444,color:#f87171

Tier	Cost	Retrieval	Minimum Duration
Warm Storage	Standard pricing per GB/month	Immediate — restore anytime	None
Cold Storage	Up to ~80% cheaper than warm	Slower retrieval, higher restore cost	90 days minimum (charged even if deleted earlier)

Cold storage minimum: Recovery points in cold storage are charged for a minimum of 90 days. If DeleteAfterDays is set to less than 90 days after the cold transition, you still pay for the full 90 days. Plan your lifecycle rules accordingly.

SERVICES SUPPORTING COLD STORAGE

Not all services support cold storage transitions. Currently supported: EC2 (EBS), EFS, DynamoDB, Timestream, SAP HANA, and VMware VMs. Services like RDS, Aurora, S3, and Storage Gateway do not support cold storage — their backups remain in warm storage for the entire retention period.

// Example lifecycle in a backup rule
"Lifecycle": {
  "MoveToColdStorageAfterDays": 30,   // warm for 30 days
  "DeleteAfterDays":             365   // deleted after 1 year
}
// Cold storage duration: 365 - 30 = 335 days in cold
// Total retention: 365 days (30 warm + 335 cold)

12 / Monitoring

Monitoring & Alerts

Backups are only useful if they actually succeed. AWS Backup integrates with EventBridge, CloudWatch, and SNS so you always know when something goes wrong.

EVENTBRIDGE

Amazon EventBridge

Real-time event stream for all backup state changes — job started, completed, failed, copy jobs, restore jobs. Route events to Lambda, SNS, SQS, or any EventBridge target.

CLOUDWATCH

CloudWatch Metrics & Alarms

AWS Backup emits metrics every 5 minutes. Set CloudWatch Alarms on backup job failure counts, restore durations, or recovery point creation rates to get alerted proactively.

SNS

SNS Notifications

Configure per-vault SNS notifications for backup job completed, failed, or expired events. Delivers alerts directly to email, Slack (via Lambda), PagerDuty, or any SNS subscriber.

EVENTBRIDGE RULE — ALERT ON BACKUP FAILURE

The most common monitoring setup: an EventBridge rule that triggers an SNS notification whenever a backup job fails.

# ── Create SNS topic for backup alerts ────────────────────────────────
aws sns create-topic --name "backup-failure-alerts"

aws sns subscribe \
  --topic-arn "arn:aws:sns:us-east-1:123456789:backup-failure-alerts" \
  --protocol "email" \
  --notification-endpoint "ops-team@company.com"

# ── Create EventBridge rule to catch backup job failures ──────────────
aws events put-rule \
  --name "backup-job-failed" \
  --event-pattern '{
    "source": ["aws.backup"],
    "detail-type": ["Backup Job State Change"],
    "detail": {
      "state": ["FAILED", "ABORTED", "EXPIRED"]
    }
  }'

# ── Connect the rule to the SNS topic ─────────────────────────────────
aws events put-targets \
  --rule "backup-job-failed" \
  --targets '[{
    "Id": "sns-target",
    "Arn": "arn:aws:sns:us-east-1:123456789:backup-failure-alerts"
  }]'

BACKUP AUDIT MANAGER

For compliance-heavy environments, Backup Audit Manager continuously monitors your backup activity against a set of controls. You define a framework (e.g., "all resources must have a backup plan", "recovery points must be encrypted", "backups must run at least daily") and AWS generates daily compliance reports to S3. Integrates with AWS Audit Manager for SOC, HIPAA, and PCI audits.

RESTORE TESTING

Backups you never test are backups you can't trust. AWS Backup Restore Testing lets you create automated restore testing plans that periodically restore recovery points, run a validation Lambda function (e.g., connectivity check, data integrity query), and clean up afterward. This ensures your backups are actually restorable and tracks restore duration for SLA reporting.

AWS BackupDeep Dive