Skip to content
Star -

dbt Integration

Related Topics: Configuration (enable dbt) | SQL Endpoints (query dbt models) | DuckDB Integration (database engine) | Common Tasks (quick how-to)

Official Documentation: dbt Docs | dbt-duckdb Adapter

dbt (data build tool) is the data quality layer in MXCP’s production methodology. It transforms raw data into well-structured models that AI can consume reliably.

dbt serves as your data quality foundation:

  1. Data Modeling: Transform raw data into clean, structured models
  2. Quality Testing: Validate data before AI consumption
  3. Performance: Materialize views for fast query response
  4. Documentation: Generate clear documentation for data models
  5. Version Control: Track all transformations in Git

This ensures:

  • AI tools work with clean, validated data
  • Performance is optimized through materialization
  • Changes are tracked and tested before deployment
  • Data quality issues are caught early

In your mxcp-site.yml:

mxcp: 1
project: my-project
profile: dev
dbt:
enabled: true
model_paths: ["models"] # Paths to dbt model directories

Generate dbt configuration files:

Terminal window
# Generate dbt config files
mxcp dbt-config
# Preview without writing
mxcp dbt-config --dry-run
# Embed secrets (use with caution)
mxcp dbt-config --embed-secrets --force

This creates:

  • dbt_project.yml - Project configuration
  • profiles.yml - Connection profiles
  • models/ - Model directory structure

Manages dbt configuration:

Terminal window
# Generate config files
mxcp dbt-config
# Show what would be written
mxcp dbt-config --dry-run
# Embed secrets in profiles.yml
mxcp dbt-config --embed-secrets --force

Wrapper around dbt CLI with secret injection:

Terminal window
# Run all models
mxcp dbt run
# Run specific models
mxcp dbt run --select my_model
# Run models and dependencies
mxcp dbt run --select +my_model
# Run tests
mxcp dbt test
# Run specific tests
mxcp dbt test --select my_model
# Generate documentation
mxcp dbt docs generate
# Serve documentation
mxcp dbt docs serve

Recommended structure for dbt with MXCP:

my-project/
├── mxcp-site.yml
├── dbt_project.yml # Generated
├── profiles.yml # Generated
├── models/
│ ├── staging/ # Raw data cleanup
│ │ ├── stg_customers.sql
│ │ └── stg_orders.sql
│ ├── intermediate/ # Business logic
│ │ └── int_customer_orders.sql
│ └── marts/ # Final models for AI
│ ├── customer_summary.sql
│ └── order_analytics.sql
├── tests/ # Data tests
│ └── assert_valid_emails.sql
└── tools/ # MXCP endpoints
├── get_customer.yml
└── search_orders.yml

Clean raw data:

models/staging/stg_customers.sql
SELECT
id as customer_id,
TRIM(name) as customer_name,
LOWER(email) as email,
created_at
FROM raw_customers
WHERE email IS NOT NULL

Apply business logic:

models/intermediate/int_customer_orders.sql
SELECT
c.customer_id,
c.customer_name,
c.email,
COUNT(o.order_id) as order_count,
SUM(o.total) as total_spent,
MAX(o.created_at) as last_order_date
FROM {{ ref('stg_customers') }} c
LEFT JOIN {{ ref('stg_orders') }} o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.email

Final models for MXCP endpoints:

models/marts/customer_summary.sql
{{ config(materialized='table') }}
SELECT
customer_id,
customer_name,
email,
order_count,
total_spent,
last_order_date,
CASE
WHEN total_spent > 1000 THEN 'high_value'
WHEN total_spent > 100 THEN 'medium_value'
ELSE 'low_value'
END as customer_tier,
-- Format for LLM consumption
json_object(
'id', customer_id,
'name', customer_name,
'orders', order_count,
'value', customer_tier
) as customer_context
FROM {{ ref('int_customer_orders') }}

Control how models are stored:

Creates a database view:

models/marts/live_orders.sql
{{ config(materialized='view') }}
SELECT * FROM orders WHERE status = 'pending'

Use for: Real-time data, small datasets

Creates a physical table:

models/marts/customer_summary.sql
{{ config(materialized='table') }}
SELECT * FROM {{ ref('int_customer_orders') }}

Use for: Frequently accessed data, complex calculations

Updates only new rows:

models/marts/order_history.sql
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT * FROM orders
{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

Use for: Large datasets, append-only data

DuckDB automatically creates zonemaps (min-max indexes) for all columns. For highly selective queries, you can create ART indexes using post-hooks:

models/marts/customer_summary.sql
{{ config(
materialized='table',
post_hook="CREATE INDEX IF NOT EXISTS idx_customer_id ON {{ this }} (customer_id)"
) }}
SELECT * FROM {{ ref('int_customer_orders') }}

Note: DuckDB recommends avoiding explicit indexes unless you have highly selective queries. Indexes add overhead during inserts/updates and consume memory.

Run SQL after model creation:

models/marts/optimized_table.sql
{{ config(
materialized='table',
post_hook="PRAGMA optimize"
) }}
SELECT * FROM {{ ref('source_data') }}

Enforce schema contracts for critical models:

models/schema.yml
version: 2
models:
- name: customer_summary
config:
materialized: table
contract:
enforced: true
columns:
- name: customer_id
data_type: int
constraints:
- type: not_null
- name: email
data_type: string
- name: total_spent
data_type: decimal

Note: Contracts are supported for table and view materializations. They validate schema before data is written.

Use dbt packages for extended functionality:

packages.yml
packages:
- package: dbt-labs/dbt_utils
version: "1.3.0"

Install packages:

Terminal window
mxcp dbt deps

Use package tests:

models/schema.yml
version: 2
models:
- name: customer_summary
columns:
- name: total_spent
data_tests:
- not_null
- dbt_utils.expression_is_true:
expression: ">= 0"

Note: As of dbt 1.8+, use data_tests instead of tests (the old syntax still works but is deprecated).

dbt provides four built-in generic tests: unique, not_null, accepted_values, and relationships.

models/schema.yml
version: 2
models:
- name: customer_summary
columns:
- name: customer_id
data_tests:
- unique
- not_null
- name: email
data_tests:
- unique
- not_null
- name: customer_tier
data_tests:
- accepted_values:
values: ['high_value', 'medium_value', 'low_value']
tests/assert_valid_emails.sql
SELECT *
FROM {{ ref('customer_summary') }}
WHERE email NOT LIKE '%@%.%'
models:
- name: orders
columns:
- name: customer_id
data_tests:
- relationships:
arguments:
to: ref('customer_summary')
field: customer_id

Note: As of dbt 1.10+, test arguments should be nested under the arguments property.

models/schema.yml
version: 2
models:
- name: customer_summary
description: >
Customer profiles with order summaries.
Used by customer service AI tools.
columns:
- name: customer_id
description: Unique customer identifier
- name: customer_tier
description: >
Value tier based on total spending:
- high_value: > $1000
- medium_value: $100 - $1000
- low_value: < $100
Terminal window
# Generate documentation
mxcp dbt docs generate
# Serve locally
mxcp dbt docs serve

Reference dbt models in MXCP endpoints:

tools/get_customer.yml
mxcp: 1
tool:
name: get_customer
description: Get customer profile with order summary
parameters:
- name: customer_id
type: integer
description: Customer ID to lookup
return:
type: object
properties:
customer_id:
type: integer
customer_name:
type: string
email:
type: string
customer_tier:
type: string
source:
code: |
SELECT *
FROM customer_summary
WHERE customer_id = $customer_id
Terminal window
# 1. Run dbt models
mxcp dbt run
# 2. Test data quality
mxcp dbt test
# 3. Validate MXCP endpoints
mxcp validate
# 4. Test MXCP functionality
mxcp test
# 5. Start server
mxcp serve

dbt supports Python models (dbt 1.3+) for complex transformations that are difficult in SQL. This is useful for ML preprocessing, statistical analysis, or leveraging Python libraries.

models/marts/customer_segments.py
import pandas as pd
def model(dbt, session):
"""Segment customers using Python logic."""
# Reference upstream dbt model (returns DuckDB Relation)
# Convert to pandas DataFrame with .df()
customer_df = dbt.ref("customer_summary").df()
# Complex Python logic
def assign_segment(row):
if row['total_spent'] > 1000 and row['order_count'] > 10:
return 'champion'
elif row['total_spent'] > 500:
return 'loyal'
elif row['order_count'] > 5:
return 'potential'
else:
return 'new'
customer_df['segment'] = customer_df.apply(assign_segment, axis=1)
return customer_df

Note: dbt.ref() returns a DuckDB Relation. Use .df() to convert to pandas DataFrame, .pl() for Polars, or .arrow() for Arrow Table.

models/marts/customer_features.py
import pandas as pd
from sklearn.preprocessing import StandardScaler
def model(dbt, session):
"""Create ML-ready feature vectors."""
df = dbt.ref("customer_summary").df()
# Select numeric features
features = ['order_count', 'total_spent', 'days_since_first_order']
# Normalize features
scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])
return df
models/marts/product_embeddings.py
import pandas as pd
def model(dbt, session):
"""Generate text embeddings for products."""
products_df = dbt.ref("stg_products").df()
# Clean and combine text fields
products_df['search_text'] = (
products_df['name'].str.lower() + ' ' +
products_df['description'].str.lower().fillna('')
)
# Remove special characters
products_df['search_text'] = products_df['search_text'].str.replace(
r'[^\w\s]', '', regex=True
)
return products_df

Configure Python models in schema.yml:

models/schema.yml
version: 2
models:
- name: customer_segments
description: Customer segmentation using Python logic
config:
materialized: table
columns:
- name: customer_id
data_tests:
- unique
- not_null
- name: segment
data_tests:
- accepted_values:
values: ['champion', 'loyal', 'potential', 'new']
Use CaseSQL or Python?
Simple joins and aggregationsSQL
Window functionsSQL
Complex conditional logicPython
ML preprocessingPython
Text processingPython
Statistical calculationsPython
Using external librariesPython

Install required packages in your environment:

Terminal window
pip install dbt-duckdb pandas scikit-learn

Chain SQL and Python models:

Reference Python models in SQL:

models/marts/customer_summary.sql
SELECT
cs.*,
seg.segment,
seg.segment_score
FROM {{ ref('customer_summary_base') }} cs
LEFT JOIN {{ ref('customer_segments') }} seg -- Python model
ON cs.customer_id = seg.customer_id
models/marts/search_index.sql
{{ config(materialized='table') }}
SELECT
customer_id,
customer_name,
email,
-- Create search-friendly text
LOWER(customer_name || ' ' || email) as search_text
FROM {{ ref('customer_summary') }}
models/staging/remote_data_cache.sql
{{ config(materialized='table') }}
SELECT *
FROM read_parquet('https://example.com/data/*.parquet')
models/marts/event_log.sql
{{ config(
materialized='incremental',
unique_key='event_id'
) }}
SELECT *
FROM {{ source('events', 'raw_events') }}
{% if is_incremental() %}
WHERE event_time > (SELECT MAX(event_time) FROM {{ this }})
{% endif %}
name: dbt + MXCP
on: [push, pull_request]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
with:
python-version: '3.11'
- name: Install dependencies
run: pip install mxcp dbt-duckdb
- name: Run dbt
run: |
mxcp dbt-config
mxcp dbt run
mxcp dbt test
- name: Validate MXCP
run: |
mxcp validate
mxcp test
Terminal window
# Run tests after every model change
mxcp dbt test
models:
- name: customer_summary
description: Clear, detailed description
columns:
- name: each_column
description: What this column contains
  • view: Development, small data
  • table: Production, frequent queries
  • incremental: Large, append-only data
Terminal window
# Track all dbt changes
git add models/
git commit -m "Add customer summary model"
Terminal window
# Ensure dbt models are built
mxcp dbt run
# Check model exists
mxcp dbt ls
Terminal window
# Run with verbose output
mxcp dbt test --select model_name --debug
# Check test SQL
mxcp dbt compile --select model_name
Terminal window
# Regenerate config with secrets
mxcp dbt-config
# Or embed secrets (development only)
mxcp dbt-config --embed-secrets --force