BoltPipeline logo
Resources
EngineeringIndustryPlatformUse Case

The Schema Evolution Gap No Data Tool Is Solving

Every data transformation tool forces a choice: either you own the table or you're on your own with DDL. Nobody helps when you write SQL against existing tables and the schema doesn't match. Here's why that's a massive gap — and how BoltPipeline closes it.

Mar 28, 2026|Ashok Dheeravath|9 min read

Every data team hits the same wall. You write a SQL transformation. You deploy it. It fails at runtime because the target table is missing three columns your SQL expects. The schema doesn't match — and nobody told you before production broke.

This is the schema evolution problem — and every data tool on the market either ignores it or forces you into a tradeoff that doesn't work for real teams.

The Tradeoff Every Tool Forces

Today's transformation tools give you two options:

Option A: "I own the table." Tools like dbt and SQLMesh take full control of the target. CREATE TABLE AS SELECT. Rebuild every time. If you need schema evolution, dbt gives you `on_schema_change` with four modes: `ignore`, `append_new_columns`, `sync_all_columns`, or `fail`. SQLMesh gives you `on_destructive_change` policies per model with schema diffs at plan time.

This works — until your table is shared. Multiple pipelines writing to the same target. Other teams with their own processes. A reporting layer that can't tolerate a full rebuild. The "I own the table" model falls apart the moment someone else also writes to it.

Option B: "You're on your own." Tools like Matillion and legacy ETL platforms leave DDL entirely to you. You write transforms in a visual canvas. If the target schema doesn't match, you manually create DDL components and wire them into the orchestration flow. No parsing, no gap detection, no suggestions.

Neither option solves the actual problem: given the SQL you wrote and the table that exists, what's missing?

How Each Tool Handles Schema Evolution

Let's be specific about what's available today.

[dbt](https://docs.getdbt.com/docs/build/incremental-models) — dbt's model is CREATE TABLE AS SELECT or CREATE VIEW AS SELECT. For incremental models, `on_schema_change` controls what happens when columns change — but it's table-level, not column-level. And SCD is handled through snapshot macros: you configure strategy, unique_key, and updated_at per table, manually. Schema changes mean updating the macro. Key changes mean rewriting it. The dbt community has asked for years about handling DDL for existing tables. The answer is still: "dbt manages tables it creates."

[SQLMesh](https://sqlmesh.readthedocs.io/en/latest/concepts/plans/) — More sophisticated. `sqlmesh plan` shows schema diffs before applying changes, and its SCD Type 2 support is declarative (`MODEL (kind SCD_TYPE_2, unique_key [...])`). But it operates at model-level granularity — not column-level. And you must define models in SQLMesh's format. It doesn't analyze arbitrary SQL you bring from elsewhere. No DDL suggestions for existing tables it doesn't own.

[Dataform](https://cloud.google.com/dataform/docs/create-tables) (Google) — Compiles SQL for BigQuery with automatic schema detection for incremental models. But it's BigQuery-only. No multi-warehouse support. No SCD automation. No DDL suggestions.

[Fivetran Transformations](https://fivetran.com/docs/transformations) — Handles schema evolution on the ingestion side — connectors auto-propagate new columns from sources. But the transformation layer is outsourced to dbt Core, with all of dbt's limitations.

[Coalesce](https://docs.coalesce.io/docs/deploy-and-refresh/troubleshooting-deployments-and-refreshes/data-type-changes-snowflake-workflow) — Detects type changes and handles them via CLONE → ALTER → SWAP automatically. SCD is configured visually per node. But it's Snowflake-native (Databricks support is new), and it's metadata-driven — it doesn't parse your SQL to understand what columns you're referencing.

Database migration tools ([Atlas](https://atlasgo.io/), [Flyway](https://documentation.red-gate.com/fd/flyway-cherry-pick-setting-277578980.html), Liquibase) — Excellent at DDL governance — versioned migrations, selective apply, safety checks. But they don't understand business SQL. They don't generate SCD logic. They don't link DDL changes to transformation logic. They're schema management tools, not transformation tools.

The Five Things Nobody Does Together

Here's what it actually takes to solve the schema evolution problem end-to-end:

1. SQL Intelligence — Parse arbitrary user SQL to extract column references, target tables, and source-target mappings 2. Live Schema Introspection — Read what actually exists on the target warehouse: columns, types, defaults, identity columns 3. Gap Analysis — Diff what the SQL needs versus what the target has. Detect missing columns, type mismatches, missing audit columns 4. SCD-Aware DML Generation — Generate MERGE, INSERT, UPDATE with audit columns (hash, is_active, effective dates) based on SCD type 5. DDL Generation for the Gaps — ALTER TABLE ADD COLUMN, CREATE TABLE for missing targets — generated as suggestions, deployed with user approval

No tool on the market does all five. Most do zero or one. SQLMesh gets closest with plan-time diffs and model-level schema detection — roughly two to three of the five. But none of them start from your arbitrary SQL and work backward to figure out what's missing.

How BoltPipeline Solves This

BoltPipeline does all five — and each capability enables the next: SQL Intelligence (parse) → Schema Introspection (describe) → Gap Analysis (diff) → SCD DML Generation (generate) → DDL Generation (suggest).

Step 1: You write SQL. A standard SELECT statement — your business rule. You tag the target with SCD type (0, 1, or 2) and natural key. That's it.

Step 2: BoltPipeline parses it. SQL Intelligence (powered by SQLGLOT) extracts every column reference, every target table, every source-target mapping. This isn't regex pattern matching — it's full AST analysis of arbitrary SQL including CTEs, subqueries, and multi-table joins.

Step 3: The agent describes the target. A lightweight agent connected to your Snowflake instance runs DESCRIBE TABLE on every target. BoltPipeline now knows the exact current schema — columns, data types, defaults, identity columns — without ever reading your data.

Step 4: Gap analysis runs automatically. The platform diffs what your SQL expects against what the target actually has. Missing columns, type mismatches, columns that exist but need widening — all detected. For SCD targets, it also checks for missing audit columns (bolt_hash, bolt_is_active, bolt_effective_start, bolt_effective_end).

Step 5: DML is generated. Based on the SCD type tag, BoltPipeline generates the complete MERGE statement — change detection, hash computation, historical closures, effective date management. No macros. No boilerplate. Standard SQL that you can inspect, version-control, and run anywhere.

Step 6: DDL is suggested. ALTER TABLE ADD COLUMN for missing columns. CREATE TABLE for missing targets. These are suggestions — you review and approve them before anything runs. In dev environments, you can approve and deploy immediately. In production, the same DDL goes through your existing governance process.

A Real Scenario

You're building a SCD Type 2 pipeline for customer data. You write a SELECT — `SELECT customer_id, email, plan_type, mrr, region FROM raw.customers` — and tag the target `analytics.dim_customer` as SCD Type 2 with natural key `customer_id`.

BoltPipeline describes the target and finds it exists but is missing `mrr` and `region`. It also detects that the SCD audit columns (`bolt_hash`, `bolt_is_active`, `bolt_effective_start`, `bolt_effective_end`) don't exist yet.

The platform generates:

  • DDL: ALTER TABLE analytics.dim_customer ADD COLUMN mrr NUMBER, ADD COLUMN region VARCHAR, ADD COLUMN bolt_hash VARCHAR, ADD COLUMN bolt_is_active BOOLEAN DEFAULT TRUE, ADD COLUMN bolt_effective_start TIMESTAMP_NTZ, ADD COLUMN bolt_effective_end TIMESTAMP_NTZ
  • DML: A complete MERGE statement that handles inserts for new customers, updates for changed records (with hash comparison), and historical closures for SCD Type 2 — closing old records and inserting new versions with correct effective dates

You approve the DDL. It deploys. The DML runs. Done.

No macro configuration. No visual wiring. No manual ALTER statements. No 2am failures because someone added a column to the source and forgot to update the target.

Five Pain Points This Eliminates

1. "I can't use dbt because I don't own the table." Multiple pipelines, multiple teams, shared targets. dbt's CREATE-OR-REPLACE model doesn't work here. BoltPipeline works with tables it doesn't own — it suggests DDL, it doesn't take control.

2. "I wrote the SQL, now I need to figure out what DDL to run." User writes a transform. Target doesn't have the right columns. Normally you'd manually compare schemas and write ALTER statements. BoltPipeline does that comparison automatically.

3. "SCD means writing the same boilerplate for every table." MERGE logic, hash computation, effective date management, audit columns. It's the same pattern every time — just with different columns. BoltPipeline generates it from a tag.

4. "Schema drift breaks my pipeline at 2am." Source adds a column, target doesn't have it, pipeline fails. BoltPipeline detects the gap before deployment and suggests the fix — proactively, not reactively.

5. "I need governance for DDL in prod but speed in dev." All-or-nothing tools don't differentiate. BoltPipeline separates the two: instant deploy in dev, governed approval in prod. Same DDL, different controls.

Why This Is Hard to Build

The idea isn't novel. Every data engineer has wished for this. The reason it doesn't exist elsewhere is that it requires three capabilities that are independently hard and rarely co-located:

1. A SQL parser that handles arbitrary SQL — not just template-driven models, but any SELECT with CTEs, subqueries, window functions, and multi-table joins. This is why BoltPipeline built on SQLGLOT rather than regex or template engines.

2. Live warehouse connectivity with column-level metadata — you need an agent that can describe any table on demand, including data types, defaults, and identity columns. Not cached metadata from a catalog — the live schema.

3. A generation engine that understands SCD semantics — the DML generator needs to know that SCD Type 2 requires hash comparison, historical closures, and effective date management. It needs to produce clean SQL (not wrapped in BEGIN/END blocks) that works with the target's actual schema.

Building any one of these is a significant engineering effort. Building all three and connecting them into a single pipeline lifecycle — parse, introspect, diff, generate DML, suggest DDL — is why this gap has persisted.

Strengths and Considerations — Honest Assessment

Strengths: - Starts from arbitrary SQL — not templates, not models, not visual configs - Works with tables you don't own — suggests DDL, doesn't take control - SCD DML generation eliminates per-table boilerplate across your entire warehouse - DDL suggestions are reviewable and governed — nothing runs without approval - Column-level precision — gap analysis operates at individual column granularity, not table-level - All generated SQL is standard ANSI — inspectable, version-controllable, portable

Considerations: - Schema introspection requires a connected agent — offline environments need alternative workflows - Gap analysis accuracy depends on SQL parse coverage — highly exotic SQL dialects may need manual review - SCD generation currently supports Type 0, 1, and 2 — Type 3, 4, and 6 are not yet automated - DDL suggestions are non-destructive (ADD COLUMN, CREATE TABLE) — destructive DDL (DROP, ALTER TYPE) requires manual authoring for safety - Initial release targets Snowflake — multi-warehouse DDL generation is roadmap

The Bottom Line

Schema evolution shouldn't be your problem to solve manually. You wrote the SQL. You told the platform what type of change tracking you need. The platform should figure out the rest — what's missing, what to generate, and what to suggest.

That's what BoltPipeline does. Five capabilities, one pipeline lifecycle, zero macros.

BSee how BoltPipeline handles schema evolution and SCD automation →

Ready to see BoltPipeline in action?

SQL in. Governed pipelines out. Your data never leaves.

Turn SQL into Production-Ready Data Pipelines — Faster and Safer

SQL-first pipelines, validated and governed — executed directly inside your database.

No new DSLs. No fragile orchestration. Just SQL with built-in validation, lineage, and governance.