← dbt perspectives
Data Modeling

The script is not
your enemy

It doesn’t matter if it’s a stored procedure, a pandas pipeline, or a DAG full of Python operators. The real problem is the same across all of them: the whole thing was built to run, not to be looked inside.

Scripts work. That’s the thing. Most teams have been running them for years without serious issues. The pipeline runs every night, the data lands, dashboards update. The standard argument for moving away is usually about the code. It’s messy, not in version control, hard to test.

All of that’s true. But it frames the problem as a code cleanliness thing. Like if you just reorganized the logic and added some Git commits, you’d be fine.

There’s another dimension to it. If you were leaving a list of chores for someone, you’d write do the dishes and do the laundry. Not a 50-step breakdown of every micro-action involved. The high-level tasks are readable. Someone new can follow them. But hand them a dense multi-page list that breaks every action down to its individual components, and then multiply that across a large house, or a whole maid service running dozens of houses. The overhead of understanding what’s happening swamps the actual work. That’s what a script ends up feeling like at scale. Not just messy code. A wall of sequential instructions where you have to read all of it before any of it makes sense.

The actual problem is different. It’s not in the code at all.

The real problem shows up when something breaks. More specifically: what happens right after it breaks.

Whatever form your script takes — SQL procedure, Python file, DAG — it runs a series of steps, produces a final output, and ends. The intermediate state lives only as long as the process does. When the session ends or the process crashes, it’s gone. There’s nothing left to look at. To debug, you start the whole thing over from the top.

Think about household chores. You’re doing laundry, dishes, vacuuming, and cooking. You open the dryer at the end and your clothes are still damp. You wouldn’t respond by going back to recook dinner, rewash the dishes, and revacuum. A problem in one place doesn’t erase everything else. But that’s exactly what a script forces you to do. The process ends, the intermediate state vanishes, and a failure at step 4 wipes out steps 1, 2, and 3 right along with it. Every debug attempt is a full restart.

🧺 laundry ✓ done 🍽️ dishes ✓ done 🧹 vacuum ✓ done 🍳 cook ✗ failed ↺ restart everything from scratch

One failure wipes out every completed step, including the ones that worked fine.

pick your situation

Stored Procedure
Python Script
Airflow DAG
build_customer_order_summary — stored procedure 1 file · 5 steps
CREATE OR REPLACE PROCEDURE build_customer_order_summary()
AS $$ BEGIN

  CREATE TEMP TABLE tmp_customers AS  ⚠ temp — gone when session ends
  SELECT customer_id, TRIM(LOWER(email)) AS email,
         COALESCE(first_name,'Unknown') AS first_name,
         COALESCE(last_name, 'Unknown') AS last_name
  FROM raw.customers;  ⚠ hardcoded schema — rename breaks silently

  CREATE TEMP TABLE tmp_orders AS  ⚠ temp — same problem
  SELECT order_id, customer_id, order_date, total_amount
  FROM raw.orders WHERE status != 'cancelled';  ⚠ hardcoded schema

  CREATE TEMP TABLE tmp_payments AS  ⚠ temp — same problem
  SELECT payment_id, order_id, amount AS payment_amount
  FROM raw.payments WHERE amount > 0;  ⚠ hardcoded schema

  CREATE TEMP TABLE tmp_orders_with_payments AS  ⚡ must wait for steps 2 & 3 — serial only
  SELECT o.*, SUM(p.payment_amount) AS total_paid
  FROM tmp_orders o LEFT JOIN tmp_payments p ON o.order_id = p.order_id
  GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount;

  CREATE OR REPLACE TABLE analytics.fct_customer_order_summary AS  ⚠ full overwrite every run
  SELECT c.customer_id, c.email,
         COUNT(op.order_id) AS lifetime_orders,
         COALESCE(SUM(op.total_amount),0) AS lifetime_revenue
  FROM tmp_customers c
  LEFT JOIN tmp_orders_with_payments op ON c.customer_id = op.customer_id
  GROUP BY c.customer_id, c.email;

  -- Session ends. All 4 temp tables dropped. Nothing to inspect.  ⚠ debug from scratch
END; $$ LANGUAGE plpgsql;

One file. Five steps. Whether it succeeds or fails, the session ends and there’s nothing left to look at.

build_customer_order_summary.py — pandas pipeline 1 file · 6 steps
import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("postgresql://...")

# Step 1 — pull data to this machine  ⚠ data transfer — grows with table size
raw_customers = pd.read_sql("SELECT * FROM raw.customers", engine)  ⚠ hardcoded schema
raw_orders    = pd.read_sql("SELECT * FROM raw.orders",    engine)  ⚠ hardcoded schema
raw_payments  = pd.read_sql("SELECT * FROM raw.payments",  engine)  ⚠ hardcoded schema

# Step 2 — clean in memory  ⚡ DataFrame — gone if process crashes
clean_customers = raw_customers.copy()
clean_customers["email"]      = clean_customers["email"].str.strip().str.lower()
clean_customers["first_name"] = clean_customers["first_name"].fillna("Unknown")
clean_customers["last_name"]  = clean_customers["last_name"].fillna("Unknown")

# Step 3 — filter orders in memory  ⚡ not queryable after crash
clean_orders = (raw_orders
    .loc[raw_orders["status"] != "cancelled",
         ["order_id", "customer_id", "order_date", "total_amount"]])

# Step 4 — filter payments in memory  ⚡ serial — must wait for step 3
clean_payments = (raw_payments
    .loc[raw_payments["amount"] > 0]
    .rename(columns={"amount": "payment_amount"})
    [["payment_id", "order_id", "payment_amount"]])

# Step 5 — join
pmt_by_order  = clean_payments.groupby("order_id", as_index=False)["payment_amount"].sum()
orders_w_pmts = clean_orders.merge(pmt_by_order, on="order_id", how="left")

# Step 6 — aggregate and write back
summary = (clean_customers  ⚠ full replace every run
    .merge(orders_w_pmts, on="customer_id", how="left")
    .groupby(["customer_id", "email"], as_index=False)
    .agg(lifetime_orders=("order_id", "count"),
         lifetime_revenue=("total_amount", "sum")))
summary.to_sql("fct_customer_order_summary", engine,  ⚠ data travels back to warehouse
              schema="analytics", if_exists="replace", index=False)

# Process ends. All DataFrames dropped. Nothing survives a crash.

One file. Data travels to your machine, gets transformed in memory, travels back. Nothing survives if the process dies mid-run.

customer_order_summary_dag.py — Airflow 1 DAG · 5 tasks
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.hooks.postgres import PostgresHook
from datetime import datetime

def run_customers(**ctx):
    PostgresHook("warehouse").run("""
        CREATE OR REPLACE TABLE staging.customers AS
        SELECT customer_id, TRIM(LOWER(email)) AS email,
               COALESCE(first_name,'Unknown') AS first_name  ⚠ hardcoded schema in each function
        FROM raw.customers;
    """)

def run_orders(**ctx):
    PostgresHook("warehouse").run("""
        CREATE OR REPLACE TABLE staging.orders AS
        SELECT order_id, customer_id, order_date, total_amount
        FROM raw.orders WHERE status != 'cancelled';
    """)

def run_payments(**ctx):
    PostgresHook("warehouse").run("""
        CREATE OR REPLACE TABLE staging.payments AS
        SELECT payment_id, order_id, amount AS payment_amount
        FROM raw.payments WHERE amount > 0;
    """)

def run_orders_with_payments(**ctx):
    PostgresHook("warehouse").run("""
        CREATE OR REPLACE TABLE intermediate.orders_with_payments AS
        SELECT o.*, SUM(p.payment_amount) AS total_paid  ⚡ deps declared in DAG, not in this SQL
        FROM staging.orders o
        LEFT JOIN staging.payments p ON o.order_id = p.order_id
        GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount;
    """)

def run_summary(**ctx):
    PostgresHook("warehouse").run("""
        CREATE OR REPLACE TABLE analytics.fct_customer_order_summary AS
        SELECT c.customer_id, c.email,
               COUNT(op.order_id) AS lifetime_orders,
               COALESCE(SUM(op.total_amount),0) AS lifetime_revenue
        FROM staging.customers c
        LEFT JOIN intermediate.orders_with_payments op ON c.customer_id = op.customer_id
        GROUP BY c.customer_id, c.email;
    """)

with DAG("customer_order_summary", start_date=datetime(2024,1,1), schedule="@daily") as dag:
    t1 = PythonOperator(task_id="stg_customers",        python_callable=run_customers)
    t2 = PythonOperator(task_id="stg_orders",           python_callable=run_orders)
    t3 = PythonOperator(task_id="stg_payments",         python_callable=run_payments)
    t4 = PythonOperator(task_id="int_orders_w_payments", python_callable=run_orders_with_payments)
    t5 = PythonOperator(task_id="fct_order_summary",     python_callable=run_summary)

    [t1, t2, t3] >> t4 >> t5  ⚠ lineage lives here, not inside the SQL

Looks structured. But the dependencies live in the DAG definition, not the SQL. Add a new model and you update two places. The SQL still uses hardcoded table names a schema rename can silently break.

This isn’t bad code. It does exactly what it was built to do: run a transformation and produce an output. The problem is that’s all it was ever designed to do. Nobody built it to be inspected. So when it breaks, you’re starting from scratch.

The natural response depends on where you’re starting. If you’re on stored procedures, you break them into smaller procedures and chain them together — each one writes a real table. If you’re on a Python script, you add Airflow and turn each step into a task. If you’re already on Airflow, you add more structure: better retry logic, SLAs, sensors.

All of those help. But they leave the harder problems untouched.

Think about dividing chores among housemates. Person 1 does laundry, Person 2 folds, Person 3 cooks, Person 4 does dishes. Each person finishes their job and leaves the result. That’s an improvement over one person doing everything in one uninterruptible session. But someone still has to text the group chat: “Person 2, don’t start folding until Person 1 is done with the washer.” That coordination — the order, the dependencies — lives in the group chat. Not in the work.

That group chat is your orchestration layer — whether it’s a shell script chaining stored procs, an Airflow DAG, or a [t1, t2, t3] >> t4 >> t5 definition at the bottom of your DAG file. The dependencies between steps live outside the SQL. If you add a new model, you update two places. Nobody looking at any individual query can tell what it depends on or what depends on it.

There’s also still no shared definition. If two teams both need clean orders data, they each write their own version. Over time the logic drifts. And there’s still no testing — each step assumes the previous one’s output was correct.

With dbt, the folding task itself says “I need the washer to finish before I start” — that’s what ref() does. The dependency is declared in the code. There’s no group chat to maintain. The order emerges from the work itself.

The SQL barely changes. That’s not really what dbt does. What changes is what’s still there after the pipeline finishes running.

Instead of temp tables or in-memory DataFrames that disappear, each step becomes a real table in your warehouse. Hardcoded schema names become source(). Implicit ordering becomes explicit dependencies with ref().

Click any block to see what the actual change looks like.

CREATE TEMP TABLE tmp_customersTemp table
  CREATE TEMP TABLE tmp_customers AS  ⚠ temp — gone on session end
  SELECT customer_id, TRIM(LOWER(email)) AS email,
         COALESCE(first_name,'Unknown') AS first_name,
         COALESCE(last_name, 'Unknown') AS last_name
  FROM raw.customers;  ⚠ hardcoded schema name
becomes → models/staging/stg_customers.sql
-- No CREATE. No TEMP. Just SELECT.
SELECT
    customer_id,
    TRIM(LOWER(email))              AS email,
    COALESCE(first_name, 'Unknown') AS first_name,
    COALESCE(last_name,  'Unknown') AS last_name
FROM {{ source('raw', 'customers') }}
What changed: The SQL is identical. No more CREATE TEMP TABLE. dbt handles materialization separately. And raw.customers becomes source('raw','customers'), which tells dbt this is an external entry point it can track and test. The result is a real table in your warehouse, not session-scoped. It’s there after the run finishes, whether it succeeded or not.
CREATE TEMP TABLE tmp_ordersTemp table
  SELECT order_id, customer_id, order_date, total_amount
  FROM raw.orders WHERE status != 'cancelled';  ⚠ hardcoded schema
becomes → models/staging/stg_orders.sql
SELECT order_id, customer_id, order_date, total_amount
FROM {{ source('raw', 'orders') }}
WHERE status != 'cancelled'
What changed: Now it’s its own file. When another team needs clean orders, they use ref('stg_orders') instead of copying this filter logic. “What counts as a valid order” lives in one place. It doesn’t drift.
CREATE TEMP TABLE tmp_paymentsTemp table
  SELECT payment_id, order_id, amount AS payment_amount
  FROM raw.payments WHERE amount > 0;  ⚠ hardcoded schema
becomes → models/staging/stg_payments.sql
SELECT payment_id, order_id, amount AS payment_amount
FROM {{ source('raw', 'payments') }}
WHERE amount > 0
What changed: Steps 1, 2, and 3 are now independent. None of them reference each other. dbt sees that and runs all three at the same time.
CREATE TEMP TABLE tmp_orders_with_paymentsJoin step
  SELECT o.*, SUM(p.payment_amount) AS total_paid  ⚡ waits for steps 2 & 3 — but that’s implicit
  FROM tmp_orders o
  LEFT JOIN tmp_payments p ON o.order_id = p.order_id
  GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount
becomes → models/intermediate/int_orders_with_payments.sql
SELECT
    o.order_id, o.customer_id, o.order_date, o.total_amount,
    SUM(p.payment_amount) AS total_paid
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_payments') }} p ON o.order_id = p.order_id
GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount
What changed: Those ref() calls tell dbt what this model needs before it can run. You don’t write any orchestration logic. The ref() calls are the orchestration.
CREATE OR REPLACE TABLE fct_customer_order_summaryFull overwrite
  CREATE OR REPLACE TABLE analytics.fct_customer_order_summary AS  ⚠ full overwrite every run
  SELECT c.customer_id, c.email,
         COUNT(op.order_id) AS lifetime_orders,
         COALESCE(SUM(op.total_amount),0) AS lifetime_revenue
  FROM tmp_customers c
  LEFT JOIN tmp_orders_with_payments op ON c.customer_id = op.customer_id
  GROUP BY c.customer_id, c.email;
becomes → models/marts/fct_customer_order_summary.sql
-- Materialization (table, incremental, view) is config, not SQL.
SELECT
    c.customer_id, c.email,
    COUNT(op.order_id)                 AS lifetime_orders,
    COALESCE(SUM(op.total_amount), 0) AS lifetime_revenue
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('int_orders_with_payments') }} op
    ON c.customer_id = op.customer_id
GROUP BY c.customer_id, c.email
The pattern: CREATE TEMP TABLE → SQL file. raw.schemasource(). tmp_xref(). That’s the whole migration. The logic is the same. What changes is what the pipeline leaves behind when it runs.

The script had this exact structure inside it the whole time. The same dependencies existed — they just weren’t visible anywhere. Every ref() call you write is a dependency declaration. dbt reads them all and figures out the execution order. You don’t define the graph. It comes out of the code.

SOURCES STAGING INTERMEDIATE MARTS rawcustomers raworders rawpayments stg_customers stg_orders stg_payments int_orders_with_payments fct_customer_order_summary all run in parallel waits for stg_orders + stg_payments

All of this existed inside the script. It just wasn’t visible.

Once the dependencies are explicit, dbt knows which models don’t depend on each other. So it runs them at the same time. The script couldn’t do that. It ran everything in order, one step after another, because there was no other option.

Think of it like chores. You wouldn’t start the washing machine, wait for it to finish, move the clothes to the dryer, wait for that to finish, and only then start cooking. Some of those things are related, but most aren’t. You’d run the independent ones in parallel. That’s exactly what the lineage makes visible.

LAUNDRY 🧺 wash 🌀 dry 👕 fold COOKING 🍳 cook 🍽️ dishes

The dryer waits on the washer, but nothing about cooking waits on laundry. Both chains run at the same time.

❌ script: serial only

step 1
stg_cust
step 2
stg_orders
step 3
stg_pay
step 4
int_orders
step 5
fct_summary

Total time: 5 units

✅ dbt: parallel across threads

thread 1
stg_cust
thread 2
stg_orders
thread 3
stg_pay
thread 1
int_orders
thread 1
fct_summary

Total time: 3 units

At real scale (hundreds of models), this compounds. A batch of 40 independent staging models runs in roughly the time it takes to run one.

This is where it actually matters. Not when things go well. When they break. Click each scenario.

Pipeline fails at the join step
script

Step 4 failed. The temp tables or DataFrames from steps 1–3 died with the process. To figure out what happened you have to re-run the whole thing from the top. Every single debug attempt is a full re-run.

dbt

stg_orders and stg_payments are still in the warehouse. Run dbt run -s int_orders_with_payments to re-execute just the thing that failed.

The difference: dbt models are real warehouse tables. Script intermediates aren’t. That’s really the whole thing.
Wrong row count in the final output
script

Add debug statements, re-run everything, check the counts. Multiple cycles before you find where the rows got out of control. The intermediate state is gone — you have to recreate it every time.

dbt

Query stg_orders. How many rows? Query int_orders_with_payments. How many? The difference tells you exactly where the problem is. No re-run needed.

Same approach you’d use to bisect a code bug. Except you don’t have to re-run anything.
Upstream schema change: a column renamed
script

The script fails or silently returns NULLs. You trace back through the file to figure out which step got hit. If you have Airflow, you check three files instead of one. Then you hope nothing downstream was depending on it.

dbt

stg_payments is the only model that touches raw.payments. Fix it there. Everything downstream picks up the change automatically.

The stg_ layer is an isolation boundary. Raw schema changes only affect one file.
3am on-call alert fires
script

Nothing survived from the last run. You have to re-run the whole thing just to reproduce the failure. Then add logging. Re-run again. You’re 45 minutes in before you’ve even started fixing anything.

dbt

The run logs tell you exactly which model failed and why. The intermediate models are still in the warehouse. Fix the issue, re-run just that model and what depends on it.

The engineer on-call at 3am cares a lot about whether the intermediate state survived.

All the scenarios above start the same way: someone discovers a problem after the fact. A dashboard is broken. The row count is off. An alert fires at 3am. That’s the equivalent of opening the dryer and finding your clothes still wet. You only know because you went to use the output.

Most data workflows don’t have testing built in. So problems surface downstream, at the wrong time, in ways that affect real decisions. You come home hungry and don’t know the refrigerator broke, so you spend money on takeout, and then discover you also have to fix the fridge. Two problems instead of one, because the first one wasn’t visible when it would have been cheapest to catch.

dbt tests are that earlier detection point. A freshness check, a not-null assertion, a referential integrity test. These run before the data reaches the dashboard. If something upstream is wrong, it surfaces at the source, not in the numbers your stakeholders are using to make decisions.

✕ without tests 🧊 fridge 🍎 food spoils 😤 home hungry 💸 takeout $$$ 🔧 fix fridge too ✓ with tests 🧊 fridge 🔔 alert fires handled early

One broken fridge cascades into five problems without visibility. Tests catch it at the source, before anything downstream is affected.

A broken dashboard and wet laundry are the same thing: discovery after the fact. Tests move the detection point upstream — before the run completes, before the stakeholder sees the output, before the problem compounds.

The script is not your enemy.

The invisible pipeline is. The one that runs and disappears. Whether it’s a stored procedure that drops its temp tables, a Python script that loses its DataFrames, or an Airflow DAG whose tasks know their order but not their lineage — the problem is the same: debugging means starting over.

dbt fixes that not by being a better query runner. The SQL is basically the same. It fixes it by making every step a real table that sticks around. The pipeline runs. The tables survive. When something breaks, you start from evidence, not from memory.

When you’re evaluating any data pipeline, there’s one question worth asking: what does it leave behind when it runs?

If the answer is just the final output: you’re one failure away from starting from scratch.
If the answer is every intermediate step as a real table you can query: you’re debugging from evidence.
That’s what the migration is actually about. The SQL is almost beside the point.
← dbt perspectives