dbt 101

From stored procedures to dbt

An interactive walkthrough. Follow the steps — each one builds on the last so you can see exactly what changes and why.

💡

This is a complete stored procedure — the kind you'd find in production today. It builds a customer order summary. Highlighted lines are the things that create problems at scale. Read the inline labels, then move to the next tab to see how dbt solves each one.

build_customer_order_summary — full stored procedure 1 file · ~60 lines

-- ================================================================
-- Stored Procedure: build_customer_order_summary
-- Runs nightly. Touch with care.
-- ================================================================

CREATE OR REPLACE PROCEDURE build_customer_order_summary()
AS $$
BEGIN

  -- Step 1: clean customers
  CREATE TEMP TABLE tmp_customers AS  ⚠ temp — gone when session ends, nothing to inspect on failure
  SELECT
      customer_id,
      TRIM(LOWER(email))               AS email,
      COALESCE(first_name, 'Unknown')  AS first_name,
      COALESCE(last_name,  'Unknown')  AS last_name,
      CASE WHEN region IS NULL THEN 'Unassigned' ELSE UPPER(region) END AS region,
      created_at
  FROM raw.customers;  ⚠ hardcoded schema — rename silently breaks this

  -- Step 2: clean orders
  CREATE TEMP TABLE tmp_orders AS  ⚠ temp — same problem
  SELECT
      order_id, customer_id, order_date, status, total_amount
  FROM raw.orders
  WHERE order_id IS NOT NULL AND status != 'cancelled';

  -- Step 3: clean payments
  CREATE TEMP TABLE tmp_payments AS  ⚠ temp — same problem
  SELECT
      payment_id, order_id, payment_method,
      amount AS payment_amount, created_at AS payment_date
  FROM raw.payments
  WHERE payment_id IS NOT NULL AND amount > 0;

  -- Step 4: join orders + payments
  CREATE TEMP TABLE tmp_orders_with_payments AS  ⚡ must wait for steps 2 & 3 — no parallelism possible
  SELECT
      o.order_id, o.customer_id, o.order_date, o.status, o.total_amount,
      SUM(p.payment_amount)                            AS total_paid,
      COUNT(p.payment_id)                               AS payment_count,
      o.total_amount - COALESCE(SUM(p.payment_amount),0) AS outstanding_balance
  FROM tmp_orders o
  LEFT JOIN tmp_payments p ON o.order_id = p.order_id
  GROUP BY o.order_id, o.customer_id, o.order_date, o.status, o.total_amount;

  -- Step 5: final output
  CREATE OR REPLACE TABLE analytics.fct_customer_order_summary AS  ⚠ full overwrite every run — no incremental option
  SELECT
      c.customer_id, c.email,
      c.first_name || ' ' || c.last_name AS full_name, c.region,
      COUNT(op.order_id)                   AS lifetime_orders,
      COALESCE(SUM(op.total_amount),0)   AS lifetime_revenue,
      COALESCE(SUM(op.total_paid),0)     AS lifetime_payments,
      MAX(op.order_date)                   AS most_recent_order_date
  FROM tmp_customers c
  LEFT JOIN tmp_orders_with_payments op ON c.customer_id = op.customer_id
  GROUP BY c.customer_id, c.email, full_name, c.region;

  -- Proc ends. All 4 temp tables dropped automatically.
  -- If step 5 failed: no intermediate state survives to inspect.  ⚠ debug from scratch

END;
$$ LANGUAGE plpgsql;

What you're looking at: One file, 5 logical steps, no visibility into intermediate state, serial execution only, hardcoded schema names, and a full overwrite on every run. It works — until it doesn't. The next tab shows what dbt does differently.

🧩

The stored proc is one file. dbt breaks it into 5 files — one per logical step. Click each block to see how that step translates into dbt, and exactly what changes.

CREATE TEMP TABLE tmp_customersTemp table

  CREATE TEMP TABLE tmp_customers AS  ⚠ temp — gone on session end
  SELECT
      customer_id, TRIM(LOWER(email)) AS email,
      COALESCE(first_name,'Unknown') AS first_name,
      COALESCE(last_name, 'Unknown') AS last_name,
      CASE WHEN region IS NULL THEN 'Unassigned' ELSE UPPER(region) END AS region,
      created_at
  FROM raw.customers;  ⚠ hardcoded schema name

↓becomes → models/staging/stg_customers.sql

-- No CREATE. No TEMP. Just SELECT.
SELECT
    customer_id, TRIM(LOWER(email)) AS email,
    COALESCE(first_name,'Unknown') AS first_name,
    COALESCE(last_name, 'Unknown') AS last_name,
    CASE WHEN region IS NULL THEN 'Unassigned' ELSE UPPER(region) END AS region,
    created_at
FROM {{ source('raw', 'customers') }}

What changed: The SQL logic is identical. CREATE TEMP TABLE is gone — dbt handles whether to build a table or a view based on config. raw.customers becomes source('raw','customers') so dbt knows this is where raw data enters the pipeline and can test it, monitor freshness, and alert you if it stops arriving. The result is a real, persistent, queryable table in your warehouse — not a session-scoped temp that vanishes.

CREATE TEMP TABLE tmp_ordersTemp table

  CREATE TEMP TABLE tmp_orders AS  ⚠ temp — gone on session end
  SELECT order_id, customer_id, order_date, status, total_amount
  FROM raw.orders  ⚠ hardcoded schema name
  WHERE order_id IS NOT NULL AND status != 'cancelled';

↓becomes → models/staging/stg_orders.sql

SELECT order_id, customer_id, order_date, status, total_amount
FROM {{ source('raw', 'orders') }}
WHERE order_id IS NOT NULL AND status != 'cancelled'

What changed: Same logic, now a standalone file. If another team builds a revenue pipeline and needs clean orders, they reference ref('stg_orders') directly. In the stored proc world, they copy-paste tmp_orders into their own proc — and now "what is a valid order" is defined in two places.

CREATE TEMP TABLE tmp_paymentsTemp table

  CREATE TEMP TABLE tmp_payments AS  ⚠ temp — gone on session end
  SELECT payment_id, order_id, payment_method,
      amount AS payment_amount, created_at AS payment_date
  FROM raw.payments  ⚠ hardcoded schema name
  WHERE payment_id IS NOT NULL AND amount > 0;

↓becomes → models/staging/stg_payments.sql

SELECT payment_id, order_id, payment_method,
    amount AS payment_amount, created_at AS payment_date
FROM {{ source('raw', 'payments') }}
WHERE payment_id IS NOT NULL AND amount > 0

What changed: Same transformation. Now all three staging models have no dependency on each other — dbt can run all three at the same time. The proc had to run them one after another, simply because that's how a procedure works.

CREATE TEMP TABLE tmp_orders_with_paymentsJoin step

  CREATE TEMP TABLE tmp_orders_with_payments AS  ⚡ must wait for steps 2 & 3 — enforced by you, not the system
  SELECT
      o.order_id, o.customer_id, o.order_date, o.total_amount,
      SUM(p.payment_amount) AS total_paid,
      o.total_amount - COALESCE(SUM(p.payment_amount),0) AS outstanding_balance
  FROM tmp_orders o
  LEFT JOIN tmp_payments p ON o.order_id = p.order_id
  GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount;

↓becomes → models/intermediate/int_orders_with_payments.sql

SELECT
    o.order_id, o.customer_id, o.order_date, o.total_amount,
    SUM(p.payment_amount) AS total_paid,
    o.total_amount - COALESCE(SUM(p.payment_amount),0) AS outstanding_balance
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('stg_payments') }} p ON o.order_id = p.order_id
GROUP BY o.order_id, o.customer_id, o.order_date, o.total_amount

What changed: tmp_orders and tmp_payments become ref('stg_orders') and ref('stg_payments'). That ref() declares the dependency so dbt knows to run staging first, and it resolves the correct table name in any environment — dev, staging, or prod — automatically.

CREATE OR REPLACE TABLE fct_customer_order_summaryFull overwrite

  CREATE OR REPLACE TABLE analytics.fct_customer_order_summary AS  ⚠ full overwrite every run
  SELECT
      c.customer_id, c.email,
      c.first_name || ' ' || c.last_name AS full_name, c.region,
      COUNT(op.order_id) AS lifetime_orders,
      COALESCE(SUM(op.total_amount),0) AS lifetime_revenue,
      MAX(op.order_date) AS most_recent_order_date
  FROM tmp_customers c
  LEFT JOIN tmp_orders_with_payments op ON c.customer_id = op.customer_id
  GROUP BY c.customer_id, c.email, full_name, c.region;

  -- Session ends. All 4 temp tables dropped. Nothing to inspect.  ⚠ debug from scratch

↓becomes → models/marts/fct_customer_order_summary.sql

-- Materialization strategy (table, incremental, view) is config, not SQL.
SELECT
    c.customer_id, c.email,
    c.first_name || ' ' || c.last_name AS full_name, c.region,
    COUNT(op.order_id) AS lifetime_orders,
    COALESCE(SUM(op.total_amount),0) AS lifetime_revenue,
    MAX(op.order_date) AS most_recent_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('int_orders_with_payments') }} op
    ON c.customer_id = op.customer_id
GROUP BY c.customer_id, c.email, full_name, c.region

You've now seen all 5 steps. The SQL didn't change meaningfully. What changed is the structure: each step is isolated, independently runnable, and inspectable. The next tab explains what the naming conventions mean and how dbt tracks where data comes from.

The pattern: Same SQL logic throughout. What dbt changes is the wrapper — CREATE TEMP TABLE becomes a file, raw.x becomes source(), tmp_x becomes ref(). Those three substitutions unlock everything else.

📂

Before the lineage graph makes sense, it helps to understand what the prefixes mean and how dbt tracks where data enters and flows through the pipeline.

stg_

Staging — one file per source table. Cleans, renames, casts. Raw in, usable columns out. Nothing is joined here.

int_

Intermediate — joins and transforms that aren't the final output. Shared building blocks between mart models.

fct_ / dim_

Marts — what BI tools and business users actually query. One table per business question.

source() — where raw data enters

raw.customers→{{ source('raw', 'customers') }}

raw.orders→{{ source('raw', 'orders') }}

raw.payments→{{ source('raw', 'payments') }}

The stored proc hardcoded raw.customers. If that schema renames, the proc breaks silently. source() registers these as official entry points — dbt can test them, check freshness, and alert you before a bad data day becomes a bad dashboard day.

ref() — how models depend on each other

FROM tmp_orders→FROM {{ ref('stg_orders') }}

FROM tmp_payments→FROM {{ ref('stg_payments') }}

FROM tmp_customers, tmp_orders_with_payments→ref('stg_customers'), ref('int_orders_with_payments')

Every ref() is a declared dependency. dbt reads them all and builds the execution order automatically. You don't write orchestration code — the ref() calls are the orchestration.

This graph is the stored proc made visible. The stored procedure had this same structure — it just wasn't visible. You had to read 60 lines of SQL to understand that step 4 depended on steps 2 and 3. dbt makes it explicit, queryable, and documented automatically.

⚡

The stored proc runs all 5 steps serially — that's the only option. dbt reads the dependency graph and runs independent models simultaneously across threads.

❌ stored proc — serial only

step 1

stg_cust

step 2

stg_orders

step 3

stg_pay

step 4

int_orders

step 5

fct_summary

Execution time: 5 units

✅ dbt — parallel (3 threads)

thread 1

stg_cust

thread 2

stg_orders

thread 3

stg_pay

thread 1

int_orders

thread 1

fct_summary

Execution time: 3 units

How does dbt know it can parallelize? Because stg_customers, stg_orders, and stg_payments each only reference source() — they have no ref() edges between them. dbt reads the graph, sees no dependencies, and dispatches all three at once. The ref() declarations from tab 2 are the orchestration — you didn't write any extra config to make this happen.

At scale: This example has 5 models. Real projects have hundreds. A pipeline with 40 independent staging models runs in roughly the time it takes to run one, given enough threads.

🔬

The stored proc works until it doesn't. When it breaks, the debugging experience is fundamentally different. Click each scenario to compare.

Pipeline fails at the join step (step 4)▼

stored procedure

Step 4 failed. The temp tables from steps 1–3 are gone — they died with the session. Re-run the entire procedure from the top to reproduce the failure. Every debug cycle = full re-run.

dbt

Step 4 failed. stg_orders and stg_payments are still in the warehouse. Run dbt run -s int_orders_with_payments to re-execute only that model. Upstream work stays intact.

The difference comes from one thing: dbt models are real tables in your warehouse. Stored proc temp tables aren't. That's the whole game.

Wrong row count in the final output▼

stored procedure

Fan-out in the join? Bad filter in step 2? You add temp debug tables mid-proc, re-run everything, and manually query each one. Multiple cycles to isolate.

dbt

Query stg_orders — how many rows? Query int_orders_with_payments — how many? The difference shows exactly where the fan-out happened. No re-run needed.

Because every intermediate model is a real table, you can bisect a row count problem the same way you'd bisect a bug in code — without re-running anything.

Upstream schema change — a column got renamed▼

stored procedure

The proc either silently returns NULLs or throws an error. You find out when a BI dashboard breaks. Trace back through the proc to find which temp table was affected.

dbt

stg_payments is the only model that touches raw.payments. Fix it there. Every downstream model gets the fix automatically on the next run. One change point, not five.

The stg_ layer acts as an isolation boundary. Raw schema changes only affect one file — the staging model for that source.

3am on-call alert fires▼

stored procedure

Connect to the DB. No intermediate state anywhere. Re-run the full proc to reproduce. Add RAISE NOTICE statements. Re-run again. 45 minutes of archaeology before you even start fixing.

dbt

dbt run logs show the exact model that failed and the exact error. Intermediate models are still in the warehouse. Run only the failed model, fix, run downstream only.

This is where the value becomes visceral. The engineer on-call at 3am cares a lot about whether the intermediate state survived.

What else depends on orders data?▼

stored procedure

Grep the codebase and hope nothing uses tmp_orders in another proc you don't know about. No system-of-record for dependencies.

dbt

The lineage graph shows every downstream model that references stg_orders. Click on it in the dbt DAG and see the full impact of any change instantly.

A new engineer can answer "what depends on this?" in 5 seconds with the DAG instead of 30 minutes of codebase archaeology.

The root cause of every dbt debugging advantage: Intermediate models are real, persistent, queryable tables. Stored proc temp tables aren't. Every other difference flows from that one fact.

1 / 5