The Last Mile to Apache Iceberg - Building a Basement Data Platform (8 minute read)

Data infrastructureicebergrust Read original

A developer shows how to build a basement-scale Apache Iceberg data lake for under $5/month using Cloudflare R2 and a simple HTTP ingestion proxy.

What: The article walks through building a minimal analytics platform combining Cloudflare R2 storage (no egress fees), R2 Data Catalog (managed Iceberg metadata), and a custom ~500-line Rust HTTP proxy that converts POSTed NDJSON events into atomic Iceberg commits, queryable via Trino or DuckDB.

Why it matters: Most data pipeline solutions (Kafka, Fivetran, Snowplow) cost hundreds per month and are designed for enterprise scale, but this approach strips the problem down to its essentials: for side projects generating a steady trickle of events, you only need to build the last mile between your app and object storage, not an entire platform.

Takeaway: The repo at github.com/etra/stateless-anchor includes a complete test harness with Trino config, synthetic event generator, and the Rust proxy - or implement the same pattern in 20 lines of Python with PyIceberg.

Deep dive

Traditional data pipeline solutions (Kafka on EC2, managed services like Fivetran starting at $500/month) are prohibitively expensive for hobby-scale analytics tracking thousands rather than millions of events
Cloudflare R2's zero egress fees change the economics for laptop-scale queries compared to AWS S3, where pulling data back for ad-hoc analysis incurs charges
R2 Data Catalog provides managed Apache Iceberg REST catalog metadata sitting directly on R2 buckets, eliminating the need to run your own catalog service
The key architectural insight is that if upstream logging infrastructure already handles at-least-once delivery, you only need to build the final hop: HTTP endpoint to Iceberg commits
The ingestion contract is deliberately minimal: one POST request equals one atomic Iceberg transaction - parse NDJSON to Arrow, commit to Iceberg, return 200 with row count or error
The author's Rust implementation (stateless-anchor) is ~500 lines handling POST /api/schema/{ns}/table/{table}/push endpoints, with each request taking ~4 seconds for the full round trip including catalog calls and S3 PUTs
A functionally equivalent implementation in PyIceberg takes about 20 lines using FastAPI and pyarrow.json.read_json() - the pattern matters more than the implementation language
Intentionally missing features include retries (delegated to Vector/fluent-bit/Datadog agents), authentication (add nginx proxy), deduplication (waiting for row-delta commits in iceberg-rust), and rate limiting
The setup creates real Parquet files in R2 buckets organized by partition keys, with Iceberg metadata stored separately under __r2_data_catalog/, making the data durable independent of the ingestion service
Performance characteristics show 100-record batches (~45KB) committing in ~4 seconds, which is appropriate for analytical ingestion but not suitable for low-latency transactional workloads
Common gotcha: R2 requires s3.region=auto in configuration because it doesn't bind regions like AWS, and using us-east-1 will cause cryptic 403 errors during SigV4 signing
The complete test harness in the repo includes server config templates, synthetic event generators, and pre-configured Trino containers to replicate the entire setup
Alternative architectures include async logger → SQLite buffer → proxy for at-least-once delivery from applications, or skipping the proxy entirely for Python workloads that can call pyiceberg.Table.append() directly
The pattern demonstrates that once cheap object storage and managed catalog are in place, the ingestion layer doesn't need to be a platform - just a small HTTP service doing one thing well

Decoder

Apache Iceberg: Open table format for large analytic datasets that provides ACID transactions, schema evolution, and time travel on data stored in object storage like S3
NDJSON: Newline Delimited JSON, a format where each line is a separate JSON object, commonly used for streaming event data
Egress fees: Charges cloud providers impose when data is transferred out of their network, which can make frequent querying expensive on services like AWS S3
REST catalog: HTTP API that stores and serves Apache Iceberg table metadata (schema, partition info, file locations) separately from the actual data files
Parquet: Columnar storage file format optimized for analytics, widely used in data lakes because it compresses well and allows efficient column-based queries
R2: Cloudflare's S3-compatible object storage service that notably charges zero egress fees, unlike AWS S3
Trino: Distributed SQL query engine (formerly PrestoSQL) that can query data across multiple sources including Iceberg tables
DuckDB: In-process analytical database engine designed for fast queries on local or remote data files
Arrow: Apache Arrow is an in-memory columnar data format that iceberg libraries use as an intermediate representation when writing data
Optimistic concurrency: Transaction strategy where commits assume no conflicts and retry if metadata has changed, rather than locking resources upfront

Original article

Cloudflare R2 plus R2 Data Catalog makes a cheap, laptop-scale Iceberg lake practical: no egress fees, S3-compatible storage, and managed catalog metadata for Trino/DuckDB. The missing piece is ingestion, solved here with a ~500-line Rust HTTP proxy that converts POSTed NDJSON into a single atomic Iceberg commit.