Skip to main content

Lakehouse - Warehouse + dbt Norms & Standards

This document defines enterprise-grade, scalable, and cost-efficient standards for a PySpark + S3 Iceberg + Snowflake External Volume + dbt architecture.


Architecture Overview

flowchart LR
A[PySpark<br/>Data Processing] -->|Write| B[Iceberg RAW<br/>Append-only]
B -->|Transform| C[dbt STG<br/>Incremental MERGE]
C -->|Model| D[dbt BASE / INT<br/>Views]
D -->|Serve| E[dbt MART<br/>Incremental Tables]

Reference: How we made dbt runs 30% faster – dbt Labs


Layer Definitions

The mandatory and recommended dbt layer order is:

raw → stg → base → int → mart

Rationale

  • stg: technical normalization of raw data (type casting, deduplication, latest state)
  • base: preparation layer for business logic
  • int: combines multiple base models for reuse

Some legacy dbt materials use raw → base → stg. In this standard, stg and base have strictly separated responsibilities.


Naming Conventions

File names. Creating a consistent pattern of file naming is crucial in dbt. File names must be unique and correspond to the name of the model when selected and created in the warehouse. We recommend putting as much clear information into the file name as possible, including a prefix for the layer the model exists in, important grouping information, and specific information about the entity or transformation in the model.

LayerPrefixExampleExplanation
Raw (source)raw_[source]__[entity]s.sqlraw_dce__orders.sqlthe double underscore between source system and entity helps visually distinguish the separate parts in the case of a source name having multiple words. For instance, google_analytics__campaigns is always understandable, whereas to somebody unfamiliar google_analytics_campaigns could be analytics_campaigns from the google source system as easily as campaigns from the google_analytics source system. Think of it like an oxford comma, the extra clarity is very much worth the extra punctuation.
Stagingstg_[source]__[entity]s.sqlstg_i2i__payments.sql-
Basebase_[source]__[entity]s.sqlbase_dce__customer.sql-
Intermediateint_[entity]s_[verb]s.sqlint_customer_orders.sql-
Martfct_, dim_fct_sales.sql, dim_customer.sql-

Layer Responsibilities and Standarts

LayerPurposeCharacteristics
rawImmutable source historyAppend-only, incremental
stgTechnical cleanup & current stateIncremental + merge
baseBusiness-ready entitiesMostly views
intReusable joins & enrichmentsViews (preferred)
martAnalytics & reportingTables / incremental

RAW Layer Standards (Iceberg)

Loading Strategy

  • Append-only incremental ingestion
  • UPDATE / DELETE is not allowed
  • Every change creates a new record (full history preserved)

Mandatory Technical Columns

ColumnTypeDescription
etl_dateDATEPartition key
etl_timeTIMESTAMPLoad timestamp
row_hashBIGINTHash generated with xxhash64
_batch_idSTRINGOptional batch identifier
_opSTRINGOptional operation flag (I / U / D)

Hashing Standard

  • Hash is generated on the Spark side
  • Algorithm: xxhash64
  • Input: business key columns (null-safe)

Purpose

  • Merge key in stg
  • Change detection

Partitioning Standard

PARTITION BY etl_date
  • Daily partitioning is mandatory
  • etl_time must not be used as a partition key

STG Layer Standards (dbt)

Materialization

materialized: incremental
incremental_strategy: merge
unique_key: row_hash

Incremental Filter (Sliding Window)

  • A 2-day buffer is mandatory for late-arriving data
{% if is_incremental() %}
where etl_date >= (
select dateadd(day, -2, max(etl_date)) from {{ this }}
)
{% endif %}

Responsibilities of STG Models

  • Latest record selection
  • Deduplication
  • Type casting
  • Soft-delete filtering

No business logic is allowed in the stg layer.


BASE & INT Layer Standards

Default Materialization

LayerDefault
baseview
intview

When to Use TABLE Instead of VIEW

  • Large or expensive join chains
  • Heavily reused intermediate datasets
  • Proven performance bottlenecks

Views are preferred unless there is a strong reason.


MART Layer Standards

Materialization Strategy

ScenarioRecommendation
Fact tablesincremental
Small dimensionstable
Large dimensionsincremental

Should NOT Be Incremental

  • Reference / lookup tables
  • Small full-snapshot tables

dbt Project Standards

Folder Structure

dbt_project/
├── models/
│ ├── raw/ # source.yml only
│ ├── stg/ # incremental + merge
│ ├── base/ # business-ready views
│ ├── int/ # joins / enrichments
│ └── mart/
│ ├── fct/
│ └── dim/
├── macros/
├── tests/
└── snapshots/ # optional

Default Materializations (dbt_project.yml)

models:
project_name:
stg:
+materialized: incremental
base:
+materialized: view
int:
+materialized: view
mart:
+materialized: table

Incremental mart models must explicitly override this.


dbt Variables Standard

vars:
raw_incremental_days: 2
default_timezone: 'UTC'

Usage example:

where etl_date >= (
select dateadd(day, -{{ var('raw_incremental_days') }}, max(etl_date))
from {{ this }}
)

dbt Tagging Standard

TagPurposeMandatory
layer:rawSource definitionyes
layer:stgStaging modelsyes
layer:baseBase modelsyes
layer:intIntermediate modelsyes
layer:martMart modelsyes
incrementalIncremental modeloptional
full_refresh_okFull refresh allowedoptional
heavyExpensive modeloptional
piiSensitive dataoptional

Example:

{{ config(tags=['layer:stg', 'incremental']) }}

Testing & Data Quality Standards

Mandatory Tests

  • row_hash → unique (stg)
  • Business key → not null
  • etl_date → not null
  • Accepted values
  • Freshness checks (raw)
  • Volume anomaly detection

Performance & Cost Rules

  • Full table scans are not allowed
  • Incremental models must use partition filters
  • row_number() only during stg build
  • Dashboards must not query raw or stg layers

Golden Rules

  • Raw is never a state table
  • History lives in raw, current state in stg
  • The most expensive transformation runs only once
  • Incremental models must be a conscious decision
  • Fix models before scaling warehouses