Amazon S3 + Iceberg + Trino: A Cost-Effective Lakehouse
Build a production-grade lakehouse on S3 with Iceberg and Trino. Learn architecture, cost optimization, and how to query petabyte-scale data efficiently.
Why Organizations Are Moving to S3 + Iceberg + Trino
Data infrastructure at scale has a problem: traditional data warehouses like Redshift or Snowflake charge you per query, per compute hour, and per data scanned. As your analytics workload grows—dashboards multiplying, teams spinning up ad-hoc queries, machine learning pipelines running hourly—your bill doesn’t just grow; it accelerates.
That’s where the S3 + Iceberg + Trino stack comes in. This combination gives you a lakehouse architecture: the flexibility of a data lake (cheap object storage, schema evolution, open formats) combined with the query performance and ACID semantics of a data warehouse. You store data in Amazon S3 at pennies per gigabyte, use Apache Iceberg to manage metadata and transactions, and query everything with Trino—an open-source distributed SQL engine that handles petabyte-scale analytics without vendor lock-in.
For organizations managing Apache Superset or building embedded analytics, this stack is particularly compelling. You get a cost-effective, scalable foundation for dashboards and self-serve BI without the overhead of proprietary platforms. As explained in D23’s approach to managed Apache Superset, production-grade analytics requires both the right infrastructure and the right tools—and this lakehouse pattern delivers both.
This post walks through the architecture, explains why each component matters, and shows you how to build and optimize it.
Understanding the Lakehouse: Lake Meets Warehouse
Before diving into implementation, it helps to understand what a lakehouse actually is and why it matters.
Traditional data lakes store raw data in object storage (like S3) using open formats (Parquet, ORC). The problem: no transactions, no schema enforcement, no way to handle updates or deletes efficiently. You end up with data quality issues, duplicate rows, and slow queries because the engine has to scan everything.
Traditional data warehouses (Redshift, Snowflake, BigQuery) solve those problems with proprietary formats, strict schemas, and ACID guarantees. The trade-off: you’re locked into their pricing model and compute architecture.
A lakehouse splits the difference. You keep data in open formats on cheap object storage, but you add a metadata layer (Iceberg) that enforces schemas, manages transactions, and tracks changes. Then you query with a distributed SQL engine (Trino) that understands both the metadata and the data format.
The result: lake economics with warehouse semantics.
Amazon S3: The Foundation
Amazon S3 is the backbone of this architecture. At its core, S3 is a simple key-value store optimized for durability and availability—not for analytics. But that simplicity is its strength.
Why S3 for analytics data:
- Cost: S3 Standard costs roughly $0.023 per GB per month. A terabyte of data costs about $23 per month. Compare that to Redshift (which charges per node, minimum ~$1/hour) or Snowflake (per credit, typically $2–4 per credit), and the math is obvious.
- Scalability: S3 has no size limits and no performance degradation as your data grows. Store a gigabyte or a petabyte; the API behaves the same.
- Durability: S3 provides 11 nines of durability (99.999999999%) through automatic replication across availability zones. Your data is safer on S3 than on most on-premises systems.
- Ecosystem: Virtually every data tool (Trino, Spark, Athena, Presto, dbt) can read from S3. It’s the de facto standard for data lakes.
The downside: S3 doesn’t understand data semantics. It doesn’t know about schemas, transactions, or data quality. It just stores bytes. That’s where Iceberg comes in.
Apache Iceberg: Metadata Management and ACID Guarantees
Apache Iceberg is an open-table format that layers metadata and transaction semantics on top of object storage. Think of it as a filing system for your S3 data.
What Iceberg does:
-
Schema Management: Iceberg tables have explicit schemas (column names, types, nullability). When you evolve your schema—add a column, rename a field—Iceberg tracks the change without rewriting existing data.
-
Partitioning and Hidden Partitioning: Traditional partitioning (splitting data into folders by date or region) requires you to know your access patterns upfront. Iceberg supports “hidden partitioning,” where the system manages partitions transparently. Query engines can prune partitions automatically without scanning unnecessary data.
-
ACID Transactions: Iceberg guarantees atomicity, consistency, isolation, and durability. Concurrent writes don’t corrupt data. Reads see consistent snapshots. Updates and deletes work correctly, even at scale.
-
Metadata Efficiency: Instead of scanning all files to understand table structure, Iceberg maintains a metadata layer (stored as JSON files in S3) that engines can read instantly. This dramatically speeds up query planning.
-
Time Travel: Iceberg snapshots let you query data as it existed at any point in the past. Useful for audits, debugging, and incremental processing.
As detailed in the Apache Iceberg Trino Integration Documentation, the Trino connector fully supports Iceberg’s feature set. When you query Amazon S3 tables from open source Trino using Apache Iceberg REST endpoint, you get all these benefits without proprietary tooling.
Iceberg file layout:
Under the hood, an Iceberg table on S3 looks like this:
s3://my-bucket/warehouse/my_table/
├── data/
│ ├── 00000-1-a1b2c3d4.parquet
│ ├── 00001-2-e5f6g7h8.parquet
│ └── ...
├── metadata/
│ ├── 00000.json (version 0 metadata)
│ ├── 00001.json (version 1 metadata)
│ └── ...
└── metadata.json (current version pointer)
When you write to the table, Iceberg:
- Writes data files (Parquet) to the
data/directory. - Creates a new metadata file in
metadata/that references those data files. - Updates the
metadata.jsonpointer to point to the latest version.
This design ensures atomicity: if the metadata file write fails, the transaction rolls back. Readers always see a consistent snapshot.
Trino: Distributed Query Engine
Trino (formerly Presto) is an open-source distributed SQL query engine designed for analytics across heterogeneous data sources. In this stack, Trino is your query layer.
Why Trino for this architecture:
- Distributed Execution: Trino parallelizes queries across multiple worker nodes. A single query can leverage 10, 100, or 1,000 machines, letting you scan terabytes in seconds.
- Connector Architecture: Trino has connectors for S3, Iceberg, Parquet, PostgreSQL, MySQL, and dozens of other sources. You can write a single SQL query that joins data across multiple systems.
- Cost-Efficient Querying: Trino doesn’t require you to load data into a proprietary format. It reads Parquet files directly from S3, applying filters and projections early to minimize data scanned.
- Open Source: No licensing fees, no vendor lock-in. You control the deployment, scaling, and configuration.
- SQL Standard: Trino speaks standard SQL. If you know SQL, you can query your lakehouse.
The Trino Iceberg Connector Documentation provides comprehensive guidance on setup and configuration. When combined with AWS Iceberg Tables: Amazon Athena, AWS lakehouse, AWS Glue, you see how the ecosystem has converged around these technologies.
Trino architecture in your lakehouse:
Trino runs as a cluster with one coordinator and multiple workers:
- Coordinator: Receives SQL queries, parses them, optimizes the execution plan, and distributes work to workers.
- Workers: Execute fragments of the query in parallel, reading from S3 and Iceberg metadata.
- Connectors: Each worker has a connector (e.g., Iceberg connector) that knows how to read Iceberg tables from S3.
When you run a query like SELECT * FROM orders WHERE date > '2024-01-01', here’s what happens:
- The coordinator parses the query and checks the Iceberg metadata (stored on S3) to understand the table schema and partitions.
- It identifies which data files are needed (using partition pruning and Iceberg’s metadata).
- It creates a plan that assigns reading those files to workers.
- Workers read Parquet files from S3 in parallel, apply the WHERE clause, and stream results back to the coordinator.
- The coordinator aggregates results and returns them to the client.
Because workers read directly from S3 and prune data early, you only pay for the data you actually scan—not the entire table.
Building Your First Lakehouse: Architecture Overview
Here’s a practical architecture for a mid-market organization:
Components:
- Data Source: Your operational databases (PostgreSQL, MySQL), data warehouses (Redshift), or event streams (Kafka).
- Ingestion Layer: Tools like Apache Spark, dbt, or AWS Glue jobs that read from sources and write Iceberg tables to S3.
- S3 Bucket: Stores all Iceberg table data and metadata.
- Trino Cluster: Deployed on Kubernetes or EC2. Queries Iceberg tables on S3.
- Analytics Layer: Tools like D23’s managed Apache Superset that connect to Trino and serve dashboards and self-serve BI.
Data flow:
Operational DB → Ingestion Job → Iceberg Table (S3) → Trino → Superset → Dashboards
Your ingestion job (running hourly or daily) reads from your source, applies transformations, and writes to Iceberg. Trino queries the Iceberg table. Superset connects to Trino, letting analysts and business users explore data without touching SQL.
This architecture scales from millions to billions of rows without architectural changes. You add more Trino workers as query volume grows. You add more data to S3; Iceberg partitioning and Trino’s parallelism keep query performance constant.
Cost Optimization: Why This Stack Is Cheaper
Let’s quantify the cost advantage. Assume you have 10 TB of analytics data and run 1,000 queries per day.
Snowflake (on-demand):
- Compute: 1 large warehouse (8 credits/hour) × 24 hours × 30 days = 5,760 credits/month
- Credits cost ~$4 each = $23,040/month
- Storage: 10 TB × $40/TB/month = $400/month
- Total: ~$23,440/month
S3 + Iceberg + Trino (self-managed):
- S3 storage: 10 TB × $0.023/GB/month = $230/month
- Trino cluster: 10 m5.2xlarge workers (8 vCPU, 32 GB RAM each) + 1 coordinator = ~$2,000/month (on-demand)
- Data transfer: ~500 GB scanned per day × 30 days = 15 TB × $0.09/GB (S3 to EC2 in same region is free; cross-region is $0.02/GB) = $0 (same region)
- Total: ~$2,230/month
That’s a 90% cost reduction. And if you use reserved instances or spot instances for Trino workers, you cut that further to ~$800/month.
As documented in Why Your S3 Bill Jumped After You Started Doing Data Engineering, the key to cost efficiency is avoiding redundant scans. Iceberg’s metadata layer and Trino’s intelligent query planning ensure you only scan the data you need.
Setting Up Iceberg Tables: Practical Steps
Let’s walk through creating your first Iceberg table.
Prerequisites:
- An S3 bucket (e.g.,
s3://my-data-lake) - Trino cluster running (instructions below)
- Data to ingest (e.g., a CSV file or database table)
Step 1: Create an Iceberg table from Parquet data
Assuming you’ve already written Parquet files to S3:
CREATE TABLE iceberg.default.customers (
customer_id INT,
name VARCHAR,
email VARCHAR,
created_at TIMESTAMP
)
WITH (
format = 'PARQUET',
partitioning = ARRAY['year(created_at)', 'month(created_at)']
)
AS SELECT * FROM parquet.'s3://my-data-lake/raw/customers/*.parquet';
This creates an Iceberg table with hidden partitioning by year and month. Trino automatically partitions data based on the created_at column, but you don’t need to pre-create partitions.
Step 2: Insert new data
INSERT INTO iceberg.default.customers
VALUES (1001, 'Alice', 'alice@example.com', CURRENT_TIMESTAMP);
Iceberg handles the write atomically. If the insert fails partway through, the table remains in a consistent state.
Step 3: Update and delete (with ACID guarantees)
UPDATE iceberg.default.customers
SET email = 'newemail@example.com'
WHERE customer_id = 1001;
DELETE FROM iceberg.default.customers
WHERE created_at < CURRENT_DATE - INTERVAL '2' YEAR;
These operations are transactional. No corruption, no data loss.
Step 4: Time travel (query historical data)
SELECT * FROM iceberg.default.customers
FOR TIMESTAMP AS OF TIMESTAMP '2024-01-01 00:00:00';
This returns the table as it existed on January 1, 2024—useful for debugging or auditing.
Deploying Trino: A Minimal Setup
For a production deployment, you’ll want Trino running on Kubernetes or as an auto-scaling group on EC2. Here’s a minimal single-node setup for testing:
Docker-based setup:
docker run -d \
--name trino \
-p 8080:8080 \
-e CATALOG_ICEBERG_WAREHOUSE=s3://my-data-lake/warehouse \
-e CATALOG_ICEBERG_S3_ENDPOINT=https://s3.amazonaws.com \
-e CATALOG_ICEBERG_S3_REGION=us-east-1 \
-e AWS_ACCESS_KEY_ID=<your-key> \
-e AWS_SECRET_ACCESS_KEY=<your-secret> \
trinodb/trino:latest
Trino will start on http://localhost:8080. You can then connect via the web UI or via JDBC/ODBC.
Trino configuration (catalog file):
Create a file iceberg.properties in Trino’s etc/catalogs/ directory:
connector.name=iceberg
iceberg.catalog.type=glue
iceberg.glue.catalog.id=<your-aws-account-id>
ico.s3.endpoint=https://s3.amazonaws.com
ico.s3.region=us-east-1
This configures Trino to use AWS Glue as the Iceberg metadata store (Glue tracks which Iceberg tables exist and where they’re located). Alternatively, you can use a Hive metastore or Iceberg’s REST catalog.
For a deeper dive on configuration, the Trino Iceberg Connector Documentation covers all options.
Connecting Superset to Your Lakehouse
Once Trino is running, connecting D23’s Apache Superset platform is straightforward:
-
Add Trino as a database in Superset:
- Connection string:
trino://user:password@trino-host:8080/iceberg/default - Test the connection.
- Connection string:
-
Create datasets from Iceberg tables:
- Select the Trino database.
- Choose a table (e.g.,
customers). - Superset introspects the schema automatically.
-
Build dashboards using Superset’s drag-and-drop interface:
- Create charts (bar charts, line graphs, heatmaps).
- Combine charts into dashboards.
- Share with your team.
Because Trino is SQL-based and Superset speaks SQL, the integration is seamless. Analysts can also write custom SQL queries in Superset, which get executed against your lakehouse.
For teams building embedded analytics (embedding dashboards in products or customer-facing apps), Superset’s API and embedding capabilities integrate directly with this stack. As D23 specializes in embedded analytics on Apache Superset, the platform handles authentication, row-level security, and performance optimization out of the box.
Real-World Optimization Patterns
Once your lakehouse is running, these patterns will improve performance and reduce costs:
1. Partition Pruning
Iceberg’s hidden partitioning means Trino automatically skips irrelevant data files. A query like SELECT * FROM orders WHERE date > '2024-01-01' only scans files from January 2024 onward, not the entire table.
2. Projection Pushdown
If you query only three columns from a 50-column table, Trino tells the S3 connector to read only those columns. Parquet’s columnar format makes this efficient.
3. Caching
Trino can cache Parquet files locally on worker nodes. Frequently accessed tables (e.g., a customer dimension table) can be cached in memory, eliminating S3 reads.
4. Data Compaction
As you insert and update data, Iceberg creates many small files. Periodically rewrite these into larger files using Spark or Trino’s compaction commands:
ALTER TABLE iceberg.default.customers EXECUTE OPTIMIZE;
This reduces metadata overhead and improves query performance.
5. Z-order Clustering
For frequently filtered columns, use Z-order clustering to co-locate related rows:
ALTER TABLE iceberg.default.customers EXECUTE OPTIMIZE USING ZORDER BY customer_id;
Queries filtering by customer_id will be faster.
As explained in Apache Iceberg Trino: Modern Data Lakehouse Explained, these optimizations are essential for maintaining sub-second query latency at scale.
Handling Common Challenges
Challenge 1: Schema Evolution
Your data schema changes over time. A new column is added, a field is renamed, a type changes. Traditional data warehouses struggle with this.
Iceberg handles it gracefully:
ALTER TABLE iceberg.default.customers ADD COLUMN phone VARCHAR;
ALTER TABLE iceberg.default.customers RENAME COLUMN email TO contact_email;
ALTER TABLE iceberg.default.customers ALTER COLUMN phone SET DATA TYPE CHAR(20);
Old data files retain their original schema. New files use the updated schema. Trino reconciles the differences transparently.
Challenge 2: Concurrent Writes
Multiple ingestion jobs writing to the same table simultaneously can corrupt data in traditional systems. Iceberg’s ACID guarantees prevent this:
INSERT INTO iceberg.default.customers SELECT * FROM new_batch_1;
-- Simultaneously:
INSERT INTO iceberg.default.customers SELECT * FROM new_batch_2;
Both inserts complete successfully. The table remains consistent. No data loss.
Challenge 3: Query Performance Degradation
As your table grows to billions of rows, queries slow down. The fix: ensure proper partitioning and use Iceberg’s compaction and optimization features. As detailed in Create a Data Lakehouse with Trino, Iceberg, S3, Parquet, regular maintenance keeps performance predictable.
Challenge 4: S3 Cost Surprises
Unoptimized queries can scan enormous amounts of data, inflating your S3 bill. The fix: use Iceberg’s partitioning, enable Trino’s caching, and monitor query patterns. Querying Apache Iceberg Tables in Amazon Athena documents best practices for cost-conscious querying.
Comparing to Alternatives
How does S3 + Iceberg + Trino stack up against other approaches?
vs. Snowflake
- Snowflake: Fully managed, but expensive ($20–30K/month for mid-market). Proprietary format.
- S3 + Iceberg + Trino: DIY management, but 90% cheaper. Open formats, no lock-in.
- Winner for cost: S3 + Iceberg + Trino. Winner for simplicity: Snowflake.
vs. Redshift
- Redshift: AWS-native, good performance, but requires capacity planning. Expensive for ad-hoc queries.
- S3 + Iceberg + Trino: Auto-scaling, pay-per-query economics, open source.
- Winner: Tie. Choose Redshift if you’re already deep in the AWS ecosystem and value managed simplicity. Choose Trino if you want flexibility and cost control.
vs. Databricks
- Databricks: Managed Iceberg + Spark, excellent for ML workloads, but proprietary and expensive.
- S3 + Iceberg + Trino: Open source, lower cost, but requires more ops work.
- Winner: Databricks if you’re doing heavy ML. Trino if you prioritize cost and control.
vs. Preset (Superset SaaS)
- Preset: Managed Superset, takes infrastructure off your plate, but charges per user/dashboard.
- D23 + Trino: Self-managed Superset, lower per-user cost, but you manage the stack.
- Winner: Preset if you want zero ops. D23 if you want control and cost efficiency.
For data and analytics leaders evaluating managed Apache Superset alternatives to Looker, Tableau, and Power BI, the S3 + Iceberg + Trino foundation is uniquely cost-effective when paired with a platform like D23 that handles the operational complexity.
Advanced: Multi-Tenant Analytics with Row-Level Security
For SaaS companies and private equity firms standardizing analytics across portfolio companies, you can build a multi-tenant lakehouse:
Architecture:
- One Iceberg table with a
tenant_idcolumn. - Trino enforces row-level security (RLS) via Superset or Trino’s native RLS.
- Each tenant’s Superset dashboard queries only their data.
CREATE TABLE iceberg.default.orders (
order_id INT,
tenant_id INT,
amount DECIMAL(10, 2),
created_at TIMESTAMP
);
When a user from tenant 123 logs into Superset, Superset injects WHERE tenant_id = 123 into every query. Trino enforces this at the engine level, ensuring no data leakage.
This pattern scales to thousands of tenants with a single lakehouse.
Monitoring and Operations
Production lakehouses require monitoring:
Key metrics:
- Query latency: P50, P95, P99 query times. Alert if P95 > 30 seconds.
- Query volume: Queries per second. Helps you size your cluster.
- S3 API calls: Monitor LIST and GET operations. Excessive calls indicate inefficient queries.
- Iceberg metadata size: As metadata accumulates, it can slow down query planning. Periodically clean up old snapshots.
- Data freshness: How recently was data updated? Critical for operational dashboards.
Tools:
- Trino UI:
http://trino-host:8080shows running queries, query history, and worker status. - CloudWatch: Monitor S3 costs and API call patterns.
- Prometheus + Grafana: Scrape Trino metrics and visualize cluster health.
Getting Started: Step-by-Step Checklist
- Create an S3 bucket for your lakehouse data.
- Set up IAM roles granting Trino and your ingestion tools access to S3.
- Deploy Trino (Docker, Kubernetes, or EC2).
- Configure the Iceberg catalog (Glue, Hive, or REST).
- Create your first Iceberg table from existing data (Parquet, CSV, or database).
- Test queries in the Trino UI.
- Connect Superset to Trino.
- Build your first dashboard in Superset.
- Set up ingestion jobs to populate Iceberg tables regularly.
- Monitor costs and performance using CloudWatch and Trino’s built-in tools.
For teams already using D23’s managed Apache Superset platform, steps 7–8 are simplified—D23 handles Superset configuration and optimization, letting you focus on data and analytics strategy.
Conclusion: The Economics and Flexibility of Open Lakehouses
The S3 + Iceberg + Trino stack represents a fundamental shift in how organizations approach analytics infrastructure. Instead of paying premium prices for managed warehouses or proprietary BI platforms, you get production-grade analytics on open standards at a fraction of the cost.
The trade-off is operational responsibility: you manage Trino deployments, monitor cluster health, and handle Iceberg maintenance. But for engineering teams and data leaders with in-house ops capabilities, this is a worthwhile exchange.
The stack is particularly powerful when combined with a modern BI layer. D23’s approach to managing Apache Superset on top of this lakehouse architecture gives you the best of both worlds: open-source flexibility below, managed simplicity above. You get cost-effective storage and querying, ACID-compliant data management, and a polished analytics interface—without vendor lock-in.
For CTOs evaluating managed open-source BI as an alternative to Looker, Tableau, and Power BI, this architecture proves that open source doesn’t mean compromising on performance, scalability, or user experience. It means taking control of your data destiny—and your budget.
Review the D23 Privacy Policy and Terms of Service if you’re considering a managed Superset deployment. Start small with a test lakehouse, measure your cost savings, and scale from there.