dbt Integration
Related Topics: Configuration (enable dbt) | SQL Endpoints (query dbt models) | DuckDB Integration (database engine) | Common Tasks (quick how-to)
Official Documentation: dbt Docs | dbt-duckdb Adapter
dbt (data build tool) is the data quality layer in MXCP’s production methodology. It transforms raw data into well-structured models that AI can consume reliably.
Why dbt?
Section titled “Why dbt?”dbt serves as your data quality foundation:
- Data Modeling: Transform raw data into clean, structured models
- Quality Testing: Validate data before AI consumption
- Performance: Materialize views for fast query response
- Documentation: Generate clear documentation for data models
- Version Control: Track all transformations in Git
The dbt + MXCP Workflow
Section titled “The dbt + MXCP Workflow”This ensures:
- AI tools work with clean, validated data
- Performance is optimized through materialization
- Changes are tracked and tested before deployment
- Data quality issues are caught early
Configuration
Section titled “Configuration”Enable dbt
Section titled “Enable dbt”In your mxcp-site.yml:
mxcp: 1project: my-projectprofile: dev
dbt: enabled: true model_paths: ["models"] # Paths to dbt model directoriesGenerate Configuration
Section titled “Generate Configuration”Generate dbt configuration files:
# Generate dbt config filesmxcp dbt-config
# Preview without writingmxcp dbt-config --dry-run
# Embed secrets (use with caution)mxcp dbt-config --embed-secrets --forceThis creates:
dbt_project.yml- Project configurationprofiles.yml- Connection profilesmodels/- Model directory structure
Commands
Section titled “Commands”mxcp dbt-config
Section titled “mxcp dbt-config”Manages dbt configuration:
# Generate config filesmxcp dbt-config
# Show what would be writtenmxcp dbt-config --dry-run
# Embed secrets in profiles.ymlmxcp dbt-config --embed-secrets --forcemxcp dbt
Section titled “mxcp dbt”Wrapper around dbt CLI with secret injection:
# Run all modelsmxcp dbt run
# Run specific modelsmxcp dbt run --select my_model
# Run models and dependenciesmxcp dbt run --select +my_model
# Run testsmxcp dbt test
# Run specific testsmxcp dbt test --select my_model
# Generate documentationmxcp dbt docs generate
# Serve documentationmxcp dbt docs serveProject Structure
Section titled “Project Structure”Recommended structure for dbt with MXCP:
my-project/├── mxcp-site.yml├── dbt_project.yml # Generated├── profiles.yml # Generated├── models/│ ├── staging/ # Raw data cleanup│ │ ├── stg_customers.sql│ │ └── stg_orders.sql│ ├── intermediate/ # Business logic│ │ └── int_customer_orders.sql│ └── marts/ # Final models for AI│ ├── customer_summary.sql│ └── order_analytics.sql├── tests/ # Data tests│ └── assert_valid_emails.sql└── tools/ # MXCP endpoints ├── get_customer.yml └── search_orders.ymlCreating Models
Section titled “Creating Models”Staging Models
Section titled “Staging Models”Clean raw data:
SELECT id as customer_id, TRIM(name) as customer_name, LOWER(email) as email, created_atFROM raw_customersWHERE email IS NOT NULLIntermediate Models
Section titled “Intermediate Models”Apply business logic:
SELECT c.customer_id, c.customer_name, c.email, COUNT(o.order_id) as order_count, SUM(o.total) as total_spent, MAX(o.created_at) as last_order_dateFROM {{ ref('stg_customers') }} cLEFT JOIN {{ ref('stg_orders') }} o ON c.customer_id = o.customer_idGROUP BY c.customer_id, c.customer_name, c.emailMart Models (AI-Ready)
Section titled “Mart Models (AI-Ready)”Final models for MXCP endpoints:
{{ config(materialized='table') }}
SELECT customer_id, customer_name, email, order_count, total_spent, last_order_date, CASE WHEN total_spent > 1000 THEN 'high_value' WHEN total_spent > 100 THEN 'medium_value' ELSE 'low_value' END as customer_tier, -- Format for LLM consumption json_object( 'id', customer_id, 'name', customer_name, 'orders', order_count, 'value', customer_tier ) as customer_contextFROM {{ ref('int_customer_orders') }}Materialization
Section titled “Materialization”Control how models are stored:
View (Default)
Section titled “View (Default)”Creates a database view:
{{ config(materialized='view') }}
SELECT * FROM orders WHERE status = 'pending'Use for: Real-time data, small datasets
Creates a physical table:
{{ config(materialized='table') }}
SELECT * FROM {{ ref('int_customer_orders') }}Use for: Frequently accessed data, complex calculations
Incremental
Section titled “Incremental”Updates only new rows:
{{ config(materialized='incremental', unique_key='order_id') }}
SELECT * FROM orders
{% if is_incremental() %}WHERE created_at > (SELECT MAX(created_at) FROM {{ this }}){% endif %}Use for: Large datasets, append-only data
Advanced Configuration
Section titled “Advanced Configuration”Indexes (DuckDB)
Section titled “Indexes (DuckDB)”DuckDB automatically creates zonemaps (min-max indexes) for all columns. For highly selective queries, you can create ART indexes using post-hooks:
{{ config( materialized='table', post_hook="CREATE INDEX IF NOT EXISTS idx_customer_id ON {{ this }} (customer_id)") }}
SELECT * FROM {{ ref('int_customer_orders') }}Note: DuckDB recommends avoiding explicit indexes unless you have highly selective queries. Indexes add overhead during inserts/updates and consume memory.
Post-Hooks
Section titled “Post-Hooks”Run SQL after model creation:
{{ config( materialized='table', post_hook="PRAGMA optimize") }}
SELECT * FROM {{ ref('source_data') }}Data Contracts
Section titled “Data Contracts”Enforce schema contracts for critical models:
version: 2
models: - name: customer_summary config: materialized: table contract: enforced: true columns: - name: customer_id data_type: int constraints: - type: not_null - name: email data_type: string - name: total_spent data_type: decimalNote: Contracts are supported for table and view materializations. They validate schema before data is written.
dbt Packages
Section titled “dbt Packages”Use dbt packages for extended functionality:
packages: - package: dbt-labs/dbt_utils version: "1.3.0"Install packages:
mxcp dbt depsUse package tests:
version: 2
models: - name: customer_summary columns: - name: total_spent data_tests: - not_null - dbt_utils.expression_is_true: expression: ">= 0"Note: As of dbt 1.8+, use data_tests instead of tests (the old syntax still works but is deprecated).
Data Quality Tests
Section titled “Data Quality Tests”Built-in Tests
Section titled “Built-in Tests”dbt provides four built-in generic tests: unique, not_null, accepted_values, and relationships.
version: 2
models: - name: customer_summary columns: - name: customer_id data_tests: - unique - not_null - name: email data_tests: - unique - not_null - name: customer_tier data_tests: - accepted_values: values: ['high_value', 'medium_value', 'low_value']Custom Tests
Section titled “Custom Tests”SELECT *FROM {{ ref('customer_summary') }}WHERE email NOT LIKE '%@%.%'Relationship Tests
Section titled “Relationship Tests”models: - name: orders columns: - name: customer_id data_tests: - relationships: arguments: to: ref('customer_summary') field: customer_idNote: As of dbt 1.10+, test arguments should be nested under the arguments property.
Documentation
Section titled “Documentation”Model Documentation
Section titled “Model Documentation”version: 2
models: - name: customer_summary description: > Customer profiles with order summaries. Used by customer service AI tools. columns: - name: customer_id description: Unique customer identifier - name: customer_tier description: > Value tier based on total spending: - high_value: > $1000 - medium_value: $100 - $1000 - low_value: < $100Generate Docs
Section titled “Generate Docs”# Generate documentationmxcp dbt docs generate
# Serve locallymxcp dbt docs serveMXCP Integration
Section titled “MXCP Integration”Query dbt Models
Section titled “Query dbt Models”Reference dbt models in MXCP endpoints:
mxcp: 1tool: name: get_customer description: Get customer profile with order summary parameters: - name: customer_id type: integer description: Customer ID to lookup return: type: object properties: customer_id: type: integer customer_name: type: string email: type: string customer_tier: type: string source: code: | SELECT * FROM customer_summary WHERE customer_id = $customer_idDevelopment Workflow
Section titled “Development Workflow”# 1. Run dbt modelsmxcp dbt run
# 2. Test data qualitymxcp dbt test
# 3. Validate MXCP endpointsmxcp validate
# 4. Test MXCP functionalitymxcp test
# 5. Start servermxcp servePython Models
Section titled “Python Models”dbt supports Python models (dbt 1.3+) for complex transformations that are difficult in SQL. This is useful for ML preprocessing, statistical analysis, or leveraging Python libraries.
Basic Python Model
Section titled “Basic Python Model”import pandas as pd
def model(dbt, session): """Segment customers using Python logic."""
# Reference upstream dbt model (returns DuckDB Relation) # Convert to pandas DataFrame with .df() customer_df = dbt.ref("customer_summary").df()
# Complex Python logic def assign_segment(row): if row['total_spent'] > 1000 and row['order_count'] > 10: return 'champion' elif row['total_spent'] > 500: return 'loyal' elif row['order_count'] > 5: return 'potential' else: return 'new'
customer_df['segment'] = customer_df.apply(assign_segment, axis=1)
return customer_dfNote: dbt.ref() returns a DuckDB Relation. Use .df() to convert to pandas DataFrame, .pl() for Polars, or .arrow() for Arrow Table.
ML Preprocessing
Section titled “ML Preprocessing”import pandas as pdfrom sklearn.preprocessing import StandardScaler
def model(dbt, session): """Create ML-ready feature vectors."""
df = dbt.ref("customer_summary").df()
# Select numeric features features = ['order_count', 'total_spent', 'days_since_first_order']
# Normalize features scaler = StandardScaler() df[features] = scaler.fit_transform(df[features])
return dfText Processing
Section titled “Text Processing”import pandas as pd
def model(dbt, session): """Generate text embeddings for products."""
products_df = dbt.ref("stg_products").df()
# Clean and combine text fields products_df['search_text'] = ( products_df['name'].str.lower() + ' ' + products_df['description'].str.lower().fillna('') )
# Remove special characters products_df['search_text'] = products_df['search_text'].str.replace( r'[^\w\s]', '', regex=True )
return products_dfConfiguration
Section titled “Configuration”Configure Python models in schema.yml:
version: 2
models: - name: customer_segments description: Customer segmentation using Python logic config: materialized: table columns: - name: customer_id data_tests: - unique - not_null - name: segment data_tests: - accepted_values: values: ['champion', 'loyal', 'potential', 'new']When to Use Python Models
Section titled “When to Use Python Models”| Use Case | SQL or Python? |
|---|---|
| Simple joins and aggregations | SQL |
| Window functions | SQL |
| Complex conditional logic | Python |
| ML preprocessing | Python |
| Text processing | Python |
| Statistical calculations | Python |
| Using external libraries | Python |
Dependencies
Section titled “Dependencies”Install required packages in your environment:
pip install dbt-duckdb pandas scikit-learnCombining SQL and Python
Section titled “Combining SQL and Python”Chain SQL and Python models:
Reference Python models in SQL:
SELECT cs.*, seg.segment, seg.segment_scoreFROM {{ ref('customer_summary_base') }} csLEFT JOIN {{ ref('customer_segments') }} seg -- Python model ON cs.customer_id = seg.customer_idPerformance Optimization
Section titled “Performance Optimization”Materialize Frequently Used Data
Section titled “Materialize Frequently Used Data”{{ config(materialized='table') }}
SELECT customer_id, customer_name, email, -- Create search-friendly text LOWER(customer_name || ' ' || email) as search_textFROM {{ ref('customer_summary') }}Cache Remote Data
Section titled “Cache Remote Data”{{ config(materialized='table') }}
SELECT *FROM read_parquet('https://example.com/data/*.parquet')Use Incremental Models
Section titled “Use Incremental Models”{{ config( materialized='incremental', unique_key='event_id') }}
SELECT *FROM {{ source('events', 'raw_events') }}
{% if is_incremental() %}WHERE event_time > (SELECT MAX(event_time) FROM {{ this }}){% endif %}CI/CD Integration
Section titled “CI/CD Integration”GitHub Actions
Section titled “GitHub Actions”name: dbt + MXCP
on: [push, pull_request]
jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2
- uses: actions/setup-python@v2 with: python-version: '3.11'
- name: Install dependencies run: pip install mxcp dbt-duckdb
- name: Run dbt run: | mxcp dbt-config mxcp dbt run mxcp dbt test
- name: Validate MXCP run: | mxcp validate mxcp testBest Practices
Section titled “Best Practices”1. Layer Your Models
Section titled “1. Layer Your Models”2. Test Early
Section titled “2. Test Early”# Run tests after every model changemxcp dbt test3. Document Everything
Section titled “3. Document Everything”models: - name: customer_summary description: Clear, detailed description columns: - name: each_column description: What this column contains4. Use Materialization Wisely
Section titled “4. Use Materialization Wisely”view: Development, small datatable: Production, frequent queriesincremental: Large, append-only data
5. Version Control
Section titled “5. Version Control”# Track all dbt changesgit add models/git commit -m "Add customer summary model"Troubleshooting
Section titled “Troubleshooting””Model not found"
Section titled “”Model not found"”# Ensure dbt models are builtmxcp dbt run
# Check model existsmxcp dbt ls"Test failed"
Section titled “"Test failed"”# Run with verbose outputmxcp dbt test --select model_name --debug
# Check test SQLmxcp dbt compile --select model_name"Secret not found”
Section titled “"Secret not found””# Regenerate config with secretsmxcp dbt-config
# Or embed secrets (development only)mxcp dbt-config --embed-secrets --forceNext Steps
Section titled “Next Steps”- DuckDB Integration - SQL engine
- Testing - MXCP tests
- Configuration - Secrets management