Devoured - April 30, 2026
Rocky (GitHub Repo)

Rocky (GitHub Repo)

Data Read original

A Rust-based control plane for data warehouses that adds compile-time safety, branch testing, and column-level lineage to pipelines running on Databricks or Snowflake.

What: Rocky is an open-source layer that sits on top of existing data warehouses to provide features like schema drift detection, data contracts enforced at compile time, isolated branches for testing changes, and column-level lineage tracking that shows exactly which downstream models depend on specific columns.
Why it matters: Data pipelines traditionally fail at runtime with silent data corruption when schemas change or contracts break; Rocky brings software engineering practices like compile-time checks and git-style branches to data warehousing, catching errors before they corrupt production data.
Takeaway: Install Rocky locally with a single curl command and run the 60-second playground tutorial on DuckDB to test features like schema drift recovery and branch isolation without needing cloud credentials.
Deep dive
  • Automatically detects schema drift by diffing source versus target schemas on each run and recreating tables when upstream column types change, preventing silent data corruption that tools like dbt allow
  • Enforces data contracts at compile time by surfacing diagnostic codes for missing required columns, removed protected columns, or unsafe type changes before any data is written
  • Supports named branches that run against isolated schemas, allowing developers to test changes and inspect results before promoting to production
  • Provides column-level lineage that traces individual columns from downstream facts back through aggregations to source seeds, enabling precise blast-radius analysis when changing models
  • Includes AI model generation that describes transformations in plain English, generates Rocky DSL code, compiles it, and automatically retries on parse failures
  • Offers PR-time blast-radius analysis via rocky lineage-diff that compares git refs and generates per-changed-column reports of downstream consumers as Markdown for GitHub PR comments
  • Handles PII classification and masking by tagging columns in model sidecars, binding tags to environment-specific mask strategies, and failing CI builds when classified columns lack masking rules
  • Implements incremental loads with persistent watermark state by tracking high-water marks in an embedded state store and only inserting rows with timestamps beyond the watermark
  • Built as a multi-component system with a Rust CLI core, Python Dagster integration, TypeScript VS Code extension, and adapter SDK for adding new warehouse backends
  • Runs locally on DuckDB for testing without cloud credentials, making it easy to try all features in self-contained proof-of-concept demos
  • Released as open source under Apache 2.0 with independent versioning for each component (CLI, Dagster wheel, VS Code extension) using tag-namespaced releases
Decoder
  • DAG: Directed Acyclic Graph, the standard way to represent data pipeline dependencies where each node is a transformation and edges show the flow of data
  • dbt: Data Build Tool, a popular SQL-based transformation framework for data warehouses that Rocky positions itself as an alternative to
  • DuckDB: An embedded analytical database similar to SQLite but optimized for analytics queries, used here for local testing without cloud setup
  • Schema drift: When the structure of data tables changes over time (columns added/removed, types changed) causing pipeline failures or incorrect results
  • Data contracts: Explicit agreements about the structure and quality of data, including required columns, allowed types, and constraints
  • Lineage: Tracking how data flows from sources through transformations to final outputs, showing dependencies between datasets
  • Watermark: A timestamp marking the last successfully processed record in incremental data loads, used to avoid reprocessing old data
  • PII: Personally Identifiable Information, sensitive data like names or emails that requires special handling and masking
  • Blast radius: The scope of downstream systems affected by a change, used in impact analysis before deploying modifications
Original article

Rocky is a Rust-based tool that adds a control layer on top of data warehouses, helping teams manage pipelines with features like data contracts, lineage tracking, and safe testing through branches. It focuses on catching errors early, preventing data issues, and making data workflows more reliable and easier to understand.