dev-tools

3-2-1 Backup Strategy: Implementation Guide for DevOps Teams

Server room with backup storage systems and blinking indicators

The 3-2-1 backup rule is deceptively simple: keep three copies of your data, on two different media types, with one copy offsite. It was coined in the era of tape drives and offsite vaults, but the principle is more relevant than ever — just applied differently in a world of Kubernetes clusters, managed databases, and infrastructure as code.

I learned to respect backups the hard way. A staging database got accidentally promoted to production during a migration script gone wrong. We had "backups" — which turned out to be daily snapshots from three weeks ago because nobody had verified the backup job was actually running. Three weeks of customer data, gone. That's the kind of experience that makes you religious about backup verification.

Why DevOps Teams Get Backups Wrong

Traditional IT treats backups as a scheduled task: back up the server nightly, test restore quarterly, file the compliance report. DevOps teams face a different reality. Infrastructure is ephemeral. Databases are managed services. Application state lives across dozens of microservices. The "server" you'd back up doesn't exist in the traditional sense.

The result? Many DevOps teams have excellent infrastructure automation and terrible data protection. They can rebuild an entire Kubernetes cluster from scratch in 30 minutes but can't recover last Tuesday's database state because nobody configured point-in-time recovery on the RDS instance.

Here's a reality check: if your disaster recovery plan starts with "we'll just redeploy from Git," you're only protecting one category of data. What about database state? User uploads? Configuration that lives in environment variables? Secrets in your vault? The classic 3-2-1 framework needs adaptation for modern infrastructure, not abandonment.

Applying 3-2-1 to Modern Infrastructure

Copy 1: Production Data (Live)

Your production system is the first copy. This seems obvious, but in cloud environments, you need to understand what "production data" actually encompasses:

Databases: RDS/Cloud SQL instances, DynamoDB tables, Redis caches (if persistent), MongoDB Atlas clusters. Each has different backup mechanisms and RPO (Recovery Point Objective) capabilities. RDS supports continuous backup with point-in-time recovery to within 5 minutes. DynamoDB's point-in-time recovery covers the last 35 days. Know your RPO per data store.

Object Storage: S3 buckets, GCS buckets, Azure Blob containers. Enable versioning — it's cheap insurance against accidental deletion or corruption. Cross-region replication for critical buckets adds geographic redundancy.

Infrastructure State: Terraform state files, Kubernetes etcd backups, Vault data. These are the blueprints for rebuilding everything. Lose them and you're reverse-engineering your own infrastructure.

Secrets and Configuration: HashiCorp Vault snapshots, AWS Secrets Manager (automatically backed up by AWS), environment variables in your deployment platform. Rotating secrets without backing up the old ones first is a common source of outages.

Copy 2: Automated Snapshots (Same Region, Different Storage)

Automated snapshots provide near-instant recovery for the most common failure scenarios: accidental deletion, corruption, failed deployments. Key considerations:

Database snapshots: Configure automated daily snapshots with appropriate retention (30 days minimum for production). AWS RDS, Cloud SQL, and Azure SQL all support this natively. For self-managed databases, use pg_dump (PostgreSQL), mysqldump (MySQL), or mongodump (MongoDB) on a cron schedule, pushing to object storage.

Volume snapshots: EBS snapshots, Persistent Disk snapshots. Useful for stateful workloads that aren't in managed databases. Schedule via AWS Backup, GCP backup plans, or Velero for Kubernetes persistent volumes.

Terraform state: If using Terraform Cloud or an S3 backend with versioning, you already have this. If using local state (please don't), at least push it to version-controlled object storage after every apply.

Copy 3: Cross-Region or Cross-Provider (Offsite)

The "offsite" copy protects against regional outages, cloud provider incidents, and — increasingly relevant — ransomware that targets cloud resources within a single account.

Cross-region replication: S3 cross-region replication, Cloud SQL cross-region replicas, Azure Geo-redundant storage. The simplest offsite option, staying within your cloud provider. Protects against regional outages but not against account-level compromise.

Cross-account backup: AWS Backup can copy backups to a separate AWS account. This is critical for ransomware protection — if an attacker compromises your production account, they can delete your backups unless those backups live in a separate account with independent credentials.

Cross-provider backup: The nuclear option. Back up critical data to a different cloud provider entirely. Tools like Restic, Duplicity, or cloud-native solutions make this feasible. Most teams only do this for their most critical datasets.

Implementation Checklist

Data CategoryCopy 1 (Live)Copy 2 (Snapshots)Copy 3 (Offsite)RPO Target
Primary DatabaseRDS Multi-AZAutomated daily snapshotsCross-region read replica5 minutes
Object StorageS3 (versioned)Same-region lifecycleCross-region replication15 minutes
Terraform StateS3 + DynamoDB lockS3 versioningCross-account copyPer-apply
Kubernetes Configetcd (managed)Velero dailyS3 cross-region24 hours
Secrets VaultVault HA clusterVault snapshots (4hr)Cross-region snapshot4 hours
User UploadsS3S3 versioningCross-region replicationNear-realtime

Automating Backup Verification

Backups that aren't tested are wishes, not backups. Automate verification with these approaches:

Restore testing: Weekly automated job that restores the latest database backup to a temporary instance, runs a health check query, and destroys the instance. If the restore fails, alert immediately. This can run in your CI/CD pipeline as a scheduled workflow.

Checksum validation: For file-based backups, compute and compare checksums between source and backup. Corrupted backups are worse than no backups because they give false confidence.

Disaster recovery drills: Quarterly exercise where you attempt to rebuild production from backups alone. Document every step, every missing piece, every assumption that was wrong. The first drill is always humbling. By the third, you'll have a genuinely reliable process.

Tools for DevOps Backup Automation

Velero: The standard for Kubernetes backup. Backs up cluster resources and persistent volumes, supports scheduled backups, and can restore to a different cluster. Essential for any production Kubernetes deployment. Works well alongside your container orchestration setup.

AWS Backup: Centralized backup management across AWS services. Supports cross-account and cross-region copies, compliance policies, and lifecycle management. The single-pane-of-glass approach is genuinely useful for organizations with dozens of AWS resources to protect.

Restic: Open-source, encrypted, deduplicated backup tool. Supports S3, GCS, Azure Blob, and many other backends. Excellent for custom backup workflows and cross-provider backups. The deduplication alone can cut storage costs by 50-70%.

pgBackRest: Purpose-built for PostgreSQL. Supports incremental backups, parallel backup/restore, and point-in-time recovery. If PostgreSQL is your primary database and you're self-managing it, pgBackRest is non-negotiable.

FAQ

Is the 3-2-1 rule still relevant for cloud-native teams?

The principle is absolutely relevant — the implementation just looks different. "Three copies" means live data plus snapshots plus offsite. "Two media types" translates to different storage classes or providers. "One offsite" means cross-region or cross-account. The underlying wisdom — don't put all your eggs in one failure domain — never goes out of style.

How do I back up a Kubernetes cluster?

Use Velero for cluster resources and persistent volumes. For managed Kubernetes (EKS, GKE, AKS), the control plane is backed up by the provider — you're responsible for etcd state (workload definitions) and persistent data. Don't forget ConfigMaps and Secrets, which often contain configuration not captured in your Git repo.

What RPO should I target?

It depends on the cost of data loss. For a SaaS product: 5 minutes or less for the primary database, 1 hour for supporting services. For internal tools: 24 hours is usually acceptable. For compliance-regulated data: whatever the regulation specifies, which is often 1 hour or less.

Should I use my cloud provider's backup tools or third-party solutions?

Use native tools for convenience within a single provider (AWS Backup, GCP backup plans). Use third-party tools (Restic, Veeam, Commvault) when you need cross-provider backups, more granular control, or compliance reporting that native tools don't provide. Most teams end up using both.