AWS DMS for Live Database Replication into Your Lakehouse
Learn how AWS DMS replicates operational databases into lakehouses in real-time. Step-by-step guide for engineers building analytics infrastructure.
Understanding AWS DMS and the Lakehouse Architecture
AWS Database Migration Service (DMS) is a managed service that replicates data from operational databases into cloud storage—typically Amazon S3—where it forms the foundation of a data lakehouse. For engineering teams building analytics infrastructure, DMS solves a critical problem: how to move live transactional data from production systems into analytics without slowing down your database or building brittle custom ETL pipelines.
The lakehouse architecture itself is the convergence of data lakes and data warehouses. A data lake stores raw, unstructured data at scale and low cost. A data warehouse provides structured schemas, strong consistency, and query optimization. A lakehouse combines both: you get the cost efficiency and flexibility of a lake with the query performance and governance of a warehouse. AWS DMS feeds the raw layer of this architecture, capturing every change from your operational database and landing it in S3 in near-real-time.
Why does this matter for analytics teams? Traditional approaches—full database exports, weekly batch jobs, or custom Kafka producers—create latency, operational burden, and risk. DMS handles the heavy lifting: connection pooling, schema inference, error recovery, and continuous replication. For mid-market companies and scale-ups adopting managed platforms like D23 for embedded analytics and self-serve BI, having a reliable, real-time data pipeline into your lakehouse is non-negotiable.
How AWS DMS Works: The Replication Engine
DMS operates in two phases: full load and change data capture (CDC). Understanding both is essential for production deployments.
Full Load Phase
During full load, DMS connects to your source database (RDS, Aurora, on-premises MySQL, PostgreSQL, SQL Server, Oracle, etc.), scans the tables you’ve specified, and writes the data to S3 in batches. This phase runs once and creates your initial dataset. DMS parallelizes this operation across multiple threads, so even large tables (millions or billions of rows) complete in reasonable time.
Key parameters you control:
- Table mappings: Which tables and columns to replicate
- Batch size: How many rows to write per S3 object
- Parallel load settings: How many threads to use (higher = faster, but more database load)
- LOB handling: How to deal with large objects (BLOBs, CLOBs) — inline, limited, or separate files
During full load, your source database experiences increased read load. For production systems, you’ll want to run this during low-traffic windows or use read replicas to avoid impacting user-facing queries.
Change Data Capture (CDC) Phase
After full load completes, DMS switches to CDC mode. It continuously monitors your source database’s transaction logs (or equivalent) and replicates every INSERT, UPDATE, and DELETE to S3. This is where “live” replication happens.
How CDC works depends on your source engine:
- Aurora/RDS MySQL: DMS reads the binary log (binlog)
- PostgreSQL: DMS uses logical replication slots
- SQL Server: DMS reads the transaction log
- Oracle: DMS uses LogMiner or native log reader
Each source has different retention requirements. For example, MySQL’s binlog might rotate every 24 hours, so if DMS falls behind, you lose data. That’s why monitoring replication lag is critical.
CDC writes changes to S3 in parquet or CSV format, typically partitioned by table and timestamp. This creates a continuous stream of change events that downstream processes (like Apache Spark jobs, Databricks Delta Live Tables, or AWS Glue) consume to keep your lakehouse in sync.
Setting Up DMS: Source, Target, and Task Configuration
A DMS replication task requires three components: a source endpoint, a target endpoint, and a task definition.
Source Endpoint Configuration
Your source is your operational database. DMS needs:
- Hostname, port, username, password
- Network connectivity (security groups, VPN, or public internet)
- Appropriate database permissions (SELECT on source tables, access to logs)
For production databases, create a dedicated read-only user with minimal privileges. This isolates DMS from your main application credentials and limits blast radius if credentials leak.
Example for MySQL:
CREATE USER 'dms_user'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'dms_user'@'%';
The REPLICATION CLIENT and REPLICATION SLAVE privileges allow DMS to read binlogs for CDC.
Target Endpoint Configuration
Your target is Amazon S3. DMS needs:
- S3 bucket name and prefix (e.g.,
s3://my-data-lake/dms-replication/) - IAM role with permissions to write to that bucket
- Optional: KMS key for encryption at rest
Create an S3 bucket with versioning disabled (unnecessary for analytics) and a lifecycle policy to transition old objects to Glacier if you need long-term retention.
IAM policy example:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-data-lake",
"arn:aws:s3:::my-data-lake/*"
]
}
]
}
Task Definition
A DMS task specifies:
- Which tables to replicate (via table mappings)
- Full load settings (parallelism, batch size)
- CDC settings (enable/disable, start position)
- Data transformation rules (optional)
- Target metadata (table names, column mappings)
Table mappings use JSON schema. A basic example:
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-all-tables",
"object-locator": {
"schema-name": "public",
"table-name": "%"
},
"rule-action": "include"
}
]
}
For large deployments, you might exclude certain tables (logs, temporary tables) or apply transformation rules (rename columns, filter rows, change data types).
Real-World Architecture: From Database to Lakehouse Analytics
Let’s walk through a concrete example. Imagine you’re a SaaS company with an Aurora MySQL database holding customer accounts, subscriptions, usage metrics, and transactions. You want to replicate this into a lakehouse for analytics without impacting production.
Architecture Flow:
- Source: Aurora MySQL (prod database, 500 GB, 10,000 writes/sec)
- DMS Replication: Full load to S3 (12 hours), then CDC continuous
- S3 Raw Layer: Parquet files, partitioned by table and date
- AWS Glue / Spark Jobs: Transform raw data, apply business logic, deduplicate CDC changes
- S3 Curated Layer: Delta Lake tables (via Delta Lake or Iceberg), optimized for analytics
- Query Engine: Athena, Redshift Spectrum, or Databricks SQL
- BI Platform: D23 dashboards and embedded analytics connected to the curated layer
As outlined in the AWS blog on creating source-to-lakehouse replication pipes using Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift, this serverless approach scales automatically and costs only for data scanned and processed.
DMS writes changes to S3 in the CDC folder:
s3://my-data-lake/dms-replication/
├── accounts/
│ ├── 20240101/
│ │ ├── FULL_LOAD_COMPLETE
│ │ ├── data-00001.parquet
│ │ └── data-00002.parquet
│ ├── 20240102/
│ │ ├── cdc-00001.parquet (CDC changes)
│ │ └── cdc-00002.parquet
├── subscriptions/
├── transactions/
Your Glue or Spark job processes these files, handles deduplication (CDC can produce duplicates if DMS restarts), applies business logic (e.g., calculate customer lifetime value), and writes to Delta Lake:
s3://my-data-lake/curated/
├── customers/ (Delta Lake table)
├── subscriptions/ (Delta Lake table)
├── transactions/ (Delta Lake table)
Once in Delta Lake, your data is queryable via Athena, Redshift, or Databricks SQL. From there, D23 can connect directly to query your lakehouse data and power dashboards, embedded analytics, or self-serve BI for your teams.
Change Data Capture: The Engine of Live Analytics
CDC is where DMS earns its keep for analytics. Without CDC, you’d need to re-export your entire database every night—wasteful and slow. CDC captures only changes, so your lakehouse stays in sync with your operational database in near-real-time (typically 1–10 seconds latency, depending on transaction volume).
CDC Challenges and Solutions
Duplicates: When DMS restarts or network hiccups occur, it may re-send the same change. Your downstream processing must be idempotent (applying the same change twice produces the same result). Use a deduplication key (source table + primary key + timestamp) in your Spark job.
Out-of-order changes: If you have high transaction volume, changes might arrive out of order. Use timestamps or sequence numbers to reorder before applying to your curated layer.
Large transactions: If your source database runs a bulk update (e.g., UPDATE accounts SET status='active' WHERE created_at < '2024-01-01'), DMS may write millions of change records. This can overwhelm your CDC pipeline. Consider running bulk operations during maintenance windows or filtering them at the DMS task level.
Log retention: Your source database’s transaction log has finite retention. If DMS falls behind (e.g., due to S3 throttling or network issues), it might lose changes. Monitor replication lag and set up CloudWatch alarms to alert if lag exceeds your SLA (e.g., 5 minutes).
As detailed in the Databricks blog on using Streaming Delta Live Tables and AWS DMS for Change Data Capture from MySQL, integrating DMS CDC with Delta Live Tables creates a reliable, auto-scaling pipeline that handles these challenges gracefully.
Hands-On: Replicating an RDS Database to S3
Let’s build a concrete example. We’ll replicate an RDS MySQL database to S3 and then query it.
Step 1: Create DMS Replication Instance
The replication instance is the compute engine that runs your DMS tasks. Create one in the AWS Console:
- Open AWS DMS
- Click “Create replication instance”
- Configure:
- Instance class:
dms.c5.xlarge(good balance of CPU, memory, network) - Allocated storage: 100 GB (adjust based on task complexity)
- VPC: Same VPC as your RDS instance (or use VPC peering)
- Multi-AZ: Yes (for production)
- Publicly accessible: No (access via VPC)
- Instance class:
- Click “Create”
Wait 5–10 minutes for the instance to provision.
Step 2: Create Source Endpoint (RDS MySQL)
- Click “Endpoints” → “Create endpoint”
- Configure:
- Endpoint type: Source
- Engine: MySQL
- Server name: Your RDS endpoint (e.g.,
mydb.c9akciq32.us-east-1.rds.amazonaws.com) - Port: 3306
- Username:
dms_user(create this user in RDS first) - Password: (strong password)
- Click “Test connection” (should succeed if networking is correct)
- Click “Create endpoint”
Step 3: Create Target Endpoint (S3)
- Click “Endpoints” → “Create endpoint”
- Configure:
- Endpoint type: Target
- Engine: Amazon S3
- Service access role ARN: (IAM role with S3 write permissions)
- Bucket name:
my-data-lake - Bucket folder:
dms-replication/
- Click “Create endpoint”
Step 4: Create Replication Task
- Click “Database migration tasks” → “Create task”
- Configure:
- Task identifier:
mysql-to-s3-full-load-cdc - Replication instance: (select the one you created)
- Source endpoint: (RDS MySQL)
- Target endpoint: (S3)
- Migration type: “Migrate existing data and replicate ongoing changes” (this enables CDC)
- Task identifier:
- Configure table mappings:
{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "include-all", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "include" } ] } - Configure task settings (JSON):
{ "TargetMetadata": { "ParallelLoadThreads": 4, "BatchApplyEnabled": true }, "FullLoadSettings": { "TargetSchema": "", "CreatePkAfterFullLoad": false, "StopTaskCachedSourceNotApplied": false, "StopTaskCachedSourceApplied": false, "MaxFullLoadSubTasks": 8, "TransactionConsistencyTimeout": 600, "CommitRate": 50000 }, "ChangeProcessingDdlHandlingPolicy": { "HandleSourceTableDropped": true, "HandleSourceTableTruncated": true, "HandleSourceTableAltered": true }, "ValidationSettings": { "EnableValidation": false } } - Click “Create task”
The task will start immediately. Monitor progress in the task details page. Full load typically takes hours for large tables; CDC begins automatically once full load completes.
Monitoring and Troubleshooting DMS Replication
Once your task is running, you need visibility into its health and performance.
Key Metrics to Monitor
- Replication lag (CDC): Time between a change on the source and its appearance in S3. Aim for < 1 minute for most use cases. If lag grows, investigate bottlenecks (database log retention, S3 throttling, network bandwidth).
- Full load progress: Percentage of tables completed. If stuck on one table, the table might be too large or the DMS instance under-resourced.
- Network transmit/receive: High utilization suggests you need a larger replication instance or should parallelize across multiple tasks.
- Source database CPU/connections: DMS is reading from your prod database. Monitor impact and adjust parallelism if needed.
- S3 PutObject errors: If DMS can’t write to S3, check IAM permissions and bucket policies.
CloudWatch Alarms
Set up alarms for:
- Replication lag > 5 minutes
- Task status = “failed”
- Source database connections > 80% of max
- S3 errors in CloudTrail
Common Issues and Fixes
“Cannot connect to source database”: Check security group rules. DMS replication instance needs inbound access to RDS on port 3306 (or your database port). Verify the security group rule:
Source: DMS replication instance security group
Port: 3306
Protocol: TCP
“Replication lag is growing”: DMS can’t keep up with changes. Options:
- Increase replication instance size
- Reduce parallelism (fewer threads = less source database load, but slower)
- Check S3 throttling (if writing > 3,500 PutObject/sec, you’ll hit limits; request increase or use S3 Intelligent-Tiering)
- Increase source database log retention
“CDC stopped after full load”: Binary log might have rotated and DMS lost its position. Enable binlog retention on RDS:
CALL mysql.rds_set_configuration('binlog retention hours', 24);
“Duplicate rows in S3”: DMS restarts can re-send changes. Your downstream processing must deduplicate. Use Spark SQL with ROW_NUMBER() partitioned by primary key and ordered by timestamp.
Integrating with Your Analytics Stack
Once data lands in S3, you need to process it and make it queryable. This is where your lakehouse truly comes alive.
AWS Glue and Spark for Transformation
Write a Glue job (Python or Scala) that:
- Reads CDC files from S3
- Deduplicates based on primary key + timestamp
- Applies transformations (business logic, aggregations)
- Writes to Delta Lake or Iceberg
Example PySpark job:
from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("DMS-CDC-Process").getOrCreate()
# Read CDC changes
cdc_df = spark.read.parquet("s3://my-data-lake/dms-replication/accounts/*/cdc-*.parquet")
# Deduplicate: keep latest version of each row
window = Window.partitionBy("pk").orderBy(col("timestamp").desc())
deduped = cdc_df.withColumn("rn", row_number().over(window)).filter(col("rn") == 1).drop("rn")
# Write to Delta Lake
deduped.write.mode("overwrite").option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/accounts")
Schedule this job to run every 5 minutes (or your desired refresh cadence) using Glue triggers.
Querying with Athena or Redshift Spectrum
Once in Delta Lake, query directly with Athena:
SELECT
account_id,
customer_name,
subscription_status,
COUNT(*) as transaction_count
FROM delta.`s3://my-data-lake/curated/accounts`
GROUP BY 1, 2, 3;
Or use Redshift Spectrum for faster queries on larger datasets:
CREATE EXTERNAL TABLE accounts_ext
STORED AS PARQUET
LOCATION 's3://my-data-lake/curated/accounts';
SELECT * FROM accounts_ext WHERE created_at > CURRENT_DATE - 30;
Connecting to D23 for BI and Dashboards
Once your lakehouse is queryable, D23 connects directly to your data to power dashboards, embedded analytics, and self-serve BI. D23 works with Athena, Redshift, Databricks, and other query engines, so you can build charts and dashboards on top of your DMS-replicated data without additional ETL.
For engineering teams embedding analytics into products, this is powerful: your product can query the lakehouse directly via D23’s API-first architecture to display real-time metrics, customer data, or operational KPIs.
Cost Optimization and Best Practices
DMS and S3 can become expensive if not managed carefully. Here are strategies to optimize.
DMS Costs
DMS charges per replication instance hour. A dms.c5.xlarge costs roughly $1.50/hour. A 24/7 instance runs ~$1,000/month. To reduce:
- Right-size your instance: Start with
dms.c5.large($0.75/hour) and upgrade only if needed - Schedule tasks: If you don’t need real-time CDC, run full loads on a schedule (daily/weekly) and stop the instance between runs
- Use table filtering: Replicate only tables you need for analytics
- Optimize parallelism: More threads = faster but higher CPU cost. Find the sweet spot
S3 Costs
S3 storage costs $0.023/GB/month for standard storage. For 1 TB of replicated data, that’s ~$23/month. But if you’re writing CDC changes continuously, you might accumulate data quickly. To optimize:
- Partition by date: Old partitions can be moved to Glacier ($0.004/GB/month) after 30 days
- Enable S3 Intelligent-Tiering: Automatically moves objects between access tiers
- Compress data: Parquet compression (snappy, gzip) reduces storage 50–80%
- Clean up old CDC files: After processing, delete raw CDC files to keep only curated data
Best Practices
- Test on non-production data first: Run DMS on a read replica or staging database before touching production
- Monitor replication lag continuously: Set CloudWatch alarms to catch issues before they impact analytics
- Version your transformation logic: Store Glue jobs in version control; track which version produced which data
- Document schema mappings: Keep a record of how source tables map to lakehouse tables
- Use IAM roles, not access keys: Secure DMS endpoints with IAM roles instead of hardcoded credentials
- Enable encryption: Use KMS for S3 encryption at rest; use SSL/TLS for DMS connections
- Plan for growth: As your database grows, replication lag may increase. Monitor and scale proactively
As referenced in the AWS documentation on migrating RDS SQL Server databases to S3 data lakes, AWS provides detailed guidance on security, performance, and cost optimization for DMS-based pipelines.
Advanced Patterns: Multi-Source and Heterogeneous Replication
For larger organizations, a single DMS task might not be enough. You might need to replicate from multiple source databases (MySQL, PostgreSQL, Oracle) into a unified lakehouse.
Multi-Source Architecture
Create separate DMS tasks for each source database:
- Task 1: RDS MySQL → S3
dms-replication/mysql/ - Task 2: RDS PostgreSQL → S3
dms-replication/postgres/ - Task 3: On-premises Oracle → S3
dms-replication/oracle/
Your transformation layer (Glue/Spark) reads from all three locations, applies consistent business logic, and writes to a unified curated layer.
Handling Schema Evolution
When your source database schema changes (new columns, renamed tables), DMS propagates those changes to S3. Your transformation job should handle this gracefully:
# Read with schema inference to handle new columns
df = spark.read.option("inferSchema", "true").parquet("s3://my-data-lake/dms-replication/*/")
# Write with mergeSchema to allow new columns
df.write.option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/")
Cross-Database Joins
With data from multiple sources in a unified lakehouse, you can join across databases:
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM delta.`s3://my-data-lake/curated/customers` c
JOIN delta.`s3://my-data-lake/curated/orders` o
ON c.customer_id = o.customer_id
WHERE o.created_at > CURRENT_DATE - 30;
This is impossible with traditional BI tools querying separate databases; the lakehouse makes it seamless.
Comparing DMS to Alternatives
DMS isn’t the only way to replicate data to a lakehouse. Here’s how it compares:
DMS vs. Kafka + Connectors
- DMS: Managed, simpler setup, CDC built-in, no operational overhead
- Kafka: More flexible, enables real-time streaming to multiple targets, higher operational complexity
For most analytics teams, DMS is sufficient. Kafka shines if you need to stream to multiple destinations (analytics, search, ML pipelines) simultaneously.
DMS vs. Custom Python/Pandas Scripts
- DMS: Reliable, handles failures, scales automatically
- Custom scripts: Flexible, but fragile, requires monitoring and maintenance
Custom scripts are fine for one-off migrations; DMS is better for ongoing replication.
DMS vs. Native Cloud Tools (Fivetran, Stitch)
- DMS: AWS-native, lower cost, but requires more setup
- Fivetran/Stitch: Easier setup, pre-built connectors, higher cost ($1,000+/month)
Choose based on your engineering bandwidth and budget.
Real-World Example: E-Commerce Company
Let’s walk through a realistic scenario. TechStore is an e-commerce company with:
- Production database: RDS Aurora MySQL, 200 GB, 50,000 writes/sec
- Analytics needs: Real-time customer dashboards, order analytics, inventory reporting
- Current pain: Weekly batch exports, 2-day lag between order and visibility
Solution with DMS:
-
DMS Task: Replicate entire Aurora database to S3 with CDC
- Full load: 4 hours
- CDC lag: 30 seconds
- Cost: $1.50/hour (dms.c5.xlarge instance)
-
Glue Job: Every 5 minutes, process CDC changes
- Deduplicate orders, customers, inventory
- Calculate derived metrics (daily revenue, top products)
- Write to Delta Lake
- Cost: ~$0.40 per run (5 minutes of Glue compute)
-
Query Layer: Athena queries Delta Lake
- Cost: $5 per TB scanned
- Queries complete in < 5 seconds
-
BI Layer: D23 dashboards connect to Athena
- Real-time order dashboard
- Customer lifetime value analysis
- Inventory alerts
- Embedded analytics in customer portal
Before DMS:
- Weekly exports: 1 hour manual work
- Data freshness: 2 days
- Cost: $2,000/month (database engineer time)
After DMS:
- Fully automated
- Data freshness: 30 seconds
- Cost: ~$1,200/month (DMS + Glue + Athena + D23)
The result: faster insights, better customer experience, lower total cost of ownership.
Conclusion: Building Modern Analytics Infrastructure
AWS DMS is a foundational tool for modern analytics infrastructure. It solves the hard problem of moving live transactional data into a lakehouse without custom engineering or operational overhead.
For data leaders at scale-ups and mid-market companies, DMS enables:
- Real-time analytics: Data freshness in seconds, not days
- Cost efficiency: Pay only for compute and storage, not for proprietary BI licenses
- Flexibility: Query the lakehouse with any tool (Athena, Redshift, Databricks, D23)
- Scalability: Automatically handles growing data volumes
The lakehouse architecture—raw data in S3, transformation with Glue/Spark, query with Athena, BI with D23 embedded analytics—is becoming the standard for modern companies. DMS is the reliable plumbing that makes it work.
If you’re evaluating managed analytics platforms like D23, ensure your data pipeline is solid first. DMS + S3 + Delta Lake + D23 dashboards create a powerful, cost-effective analytics stack that rivals Looker, Tableau, and Power BI without the licensing overhead or platform lock-in.
For engineering teams embedding analytics into products, this architecture is especially powerful. Your product can query the lakehouse directly, display real-time metrics, and provide self-serve BI capabilities without building custom reporting infrastructure.
Start with a single DMS task on a non-production database. Measure replication lag, transformation time, and query performance. Scale from there. The foundation is simple; the possibilities are limitless.