Materialized Tables in Apache Flink (14 minute read)
Apache Flink's Materialized Tables embed query definitions within table metadata, simplifying pipeline management and schema evolution for streaming ETL.
What: Materialized Tables in Apache Flink (added in version 1.20, released 2024) let you define both a table's schema and its population query together in the catalog, making the refresh logic part of the table definition rather than a separate job to manage.
Why it matters: Traditional Flink SQL separates table definitions from the INSERT queries that populate them, creating lifecycle headaches where jobs don't automatically restart after cluster restarts and schema changes require manually stopping jobs, altering tables, and recreating INSERT statements. Materialized Tables solve this by persisting the query definition with the table metadata, enabling automatic job resurrection and streamlined schema evolution via simple ALTER statements.
Takeaway: Try Materialized Tables with Apache Paimon (currently one of the only production-ready catalogs supporting the feature) if you're running Flink SQL pipelines that need simplified lifecycle management.
Deep dive
- Traditional Flink SQL requires either CREATE TABLE + INSERT or CREATE TABLE AS SELECT (CTAS), both of which spawn separate jobs that have no persistent association with the table definition
- When task managers restart, INSERT jobs spawned by CTAS or standalone INSERTs are killed and not automatically restarted, while Materialized Table jobs resurrect automatically because the query definition is persisted in the catalog
- Catalog metadata for Materialized Tables includes the definition query, refresh mode (continuous or scheduled), execution details, and job ID—all stored alongside the standard schema information
- Schema evolution with traditional approaches requires stopping the INSERT job, altering the table, recreating the INSERT with updated columns, and potentially dealing with data type mismatches and NULL constraint violations from existing data
- Materialized Tables support schema evolution via ALTER MATERIALIZED TABLE with a new AS SELECT clause, which automatically stops the old job and starts a new one with the updated schema, though it starts from the beginning rather than restoring from previous state
- The feature requires a catalog that supports Materialized Tables (currently Apache Paimon or test-filesystem for testing) plus a scheduler for automated refreshes
- Materialized Tables can be paused with SUSPEND and resumed with RESUME, allowing you to temporarily halt processing without losing the job definition
- Flink's streaming nature means aggregate queries show changelog updates (insertions, updates, deletions) rather than final results, and queries over unbounded sources continue running indefinitely
- The test-filesystem catalog used in examples stores both catalog metadata and table data to disk, making it possible to inspect the internal representation of table definitions
- When a Materialized Table is resumed after being suspended, it picks up new data that arrived during the suspension period, demonstrating proper state management
Decoder
- Apache Flink: Distributed stream processing framework that can handle both batch and real-time data processing with SQL and Java/Scala APIs
- Materialized Table: A table object that includes both its schema definition and the query used to populate/refresh it, stored together in the catalog
- ETL: Extract, Transform, Load—the process of moving and transforming data from sources to destinations
- CTAS: CREATE TABLE AS SELECT—SQL syntax that creates a table and populates it with query results in a single statement
- Catalog: Metadata store in Flink that holds information about databases, tables, and other objects
- Changelog: Stream of data changes showing operations like inserts (+I), updates (-U/+U), and deletes (-D) rather than just final values
- Unbounded stream: Data source that continuously produces records without a defined end, like a Kafka topic, as opposed to finite batch data
- Upsert: Update-or-insert operation that updates a row if it exists or inserts it if it doesn't, based on a primary key
Original article
Materialized Tables in Apache Flink allows users to define a table directly with its population query, embedding both the schema and the continuous or scheduled refresh logic inside the catalog. This simplifies ETL pipelines by automatically handling job lifecycle, schema evolution, and refreshes.