The Last Mile to Apache Iceberg - Building a Basement Data Platform (8 minute read)
A developer shows how to build a basement-scale Apache Iceberg data lake for under $5/month using Cloudflare R2 and a simple HTTP ingestion proxy.
What: The article walks through building a minimal analytics platform combining Cloudflare R2 storage (no egress fees), R2 Data Catalog (managed Iceberg metadata), and a custom ~500-line Rust HTTP proxy that converts POSTed NDJSON events into atomic Iceberg commits, queryable via Trino or DuckDB.
Why it matters: Most data pipeline solutions (Kafka, Fivetran, Snowplow) cost hundreds per month and are designed for enterprise scale, but this approach strips the problem down to its essentials: for side projects generating a steady trickle of events, you only need to build the last mile between your app and object storage, not an entire platform.
Takeaway: The repo at github.com/etra/stateless-anchor includes a complete test harness with Trino config, synthetic event generator, and the Rust proxy - or implement the same pattern in 20 lines of Python with PyIceberg.
Deep dive
- Traditional data pipeline solutions (Kafka on EC2, managed services like Fivetran starting at $500/month) are prohibitively expensive for hobby-scale analytics tracking thousands rather than millions of events
- Cloudflare R2's zero egress fees change the economics for laptop-scale queries compared to AWS S3, where pulling data back for ad-hoc analysis incurs charges
- R2 Data Catalog provides managed Apache Iceberg REST catalog metadata sitting directly on R2 buckets, eliminating the need to run your own catalog service
- The key architectural insight is that if upstream logging infrastructure already handles at-least-once delivery, you only need to build the final hop: HTTP endpoint to Iceberg commits
- The ingestion contract is deliberately minimal: one POST request equals one atomic Iceberg transaction - parse NDJSON to Arrow, commit to Iceberg, return 200 with row count or error
- The author's Rust implementation (stateless-anchor) is ~500 lines handling POST /api/schema/{ns}/table/{table}/push endpoints, with each request taking ~4 seconds for the full round trip including catalog calls and S3 PUTs
- A functionally equivalent implementation in PyIceberg takes about 20 lines using FastAPI and pyarrow.json.read_json() - the pattern matters more than the implementation language
- Intentionally missing features include retries (delegated to Vector/fluent-bit/Datadog agents), authentication (add nginx proxy), deduplication (waiting for row-delta commits in iceberg-rust), and rate limiting
- The setup creates real Parquet files in R2 buckets organized by partition keys, with Iceberg metadata stored separately under __r2_data_catalog/, making the data durable independent of the ingestion service
- Performance characteristics show 100-record batches (~45KB) committing in ~4 seconds, which is appropriate for analytical ingestion but not suitable for low-latency transactional workloads
- Common gotcha: R2 requires s3.region=auto in configuration because it doesn't bind regions like AWS, and using us-east-1 will cause cryptic 403 errors during SigV4 signing
- The complete test harness in the repo includes server config templates, synthetic event generators, and pre-configured Trino containers to replicate the entire setup
- Alternative architectures include async logger → SQLite buffer → proxy for at-least-once delivery from applications, or skipping the proxy entirely for Python workloads that can call pyiceberg.Table.append() directly
- The pattern demonstrates that once cheap object storage and managed catalog are in place, the ingestion layer doesn't need to be a platform - just a small HTTP service doing one thing well
Decoder
- Apache Iceberg: Open table format for large analytic datasets that provides ACID transactions, schema evolution, and time travel on data stored in object storage like S3
- NDJSON: Newline Delimited JSON, a format where each line is a separate JSON object, commonly used for streaming event data
- Egress fees: Charges cloud providers impose when data is transferred out of their network, which can make frequent querying expensive on services like AWS S3
- REST catalog: HTTP API that stores and serves Apache Iceberg table metadata (schema, partition info, file locations) separately from the actual data files
- Parquet: Columnar storage file format optimized for analytics, widely used in data lakes because it compresses well and allows efficient column-based queries
- R2: Cloudflare's S3-compatible object storage service that notably charges zero egress fees, unlike AWS S3
- Trino: Distributed SQL query engine (formerly PrestoSQL) that can query data across multiple sources including Iceberg tables
- DuckDB: In-process analytical database engine designed for fast queries on local or remote data files
- Arrow: Apache Arrow is an in-memory columnar data format that iceberg libraries use as an intermediate representation when writing data
- Optimistic concurrency: Transaction strategy where commits assume no conflicts and retry if metadata has changed, rather than locking resources upfront
Original article
Cloudflare R2 plus R2 Data Catalog makes a cheap, laptop-scale Iceberg lake practical: no egress fees, S3-compatible storage, and managed catalog metadata for Trino/DuckDB. The missing piece is ingestion, solved here with a ~500-line Rust HTTP proxy that converts POSTed NDJSON into a single atomic Iceberg commit.