DuckDB Integration
Related Topics: SQL Endpoints (tutorial) | SQL Reference (built-in functions) | Configuration (extensions setup) | Common Tasks (quick how-to)
Official Documentation: DuckDB Docs | Extensions | SQL Reference
DuckDB serves as MXCP’s SQL execution engine, providing fast, local-first data access with extensive connectivity options.
Overview
Section titled “Overview”DuckDB enables:
- Fast Analytics: In-process OLAP database
- Multi-Source: Connect to S3, HTTP, PostgreSQL, MySQL, and more
- Extensions: Rich ecosystem for additional functionality
- Zero Setup: No external database server required
Extensions
Section titled “Extensions”Core Extensions
Section titled “Core Extensions”Built-in extensions for common functionality:
extensions: - httpfs # HTTP/HTTPS file system - parquet # Parquet file support - json # JSON file supportCommunity Extensions
Section titled “Community Extensions”Community-maintained extensions:
extensions: - name: extension_name repo: communityNightly Extensions
Section titled “Nightly Extensions”Latest development versions:
extensions: - name: extension_name repo: core_nightlyCommon Extensions
Section titled “Common Extensions”| Extension | Purpose |
|---|---|
httpfs | Read from HTTP/HTTPS URLs |
parquet | Parquet file support |
json | JSON file handling |
postgres | PostgreSQL connectivity |
mysql | MySQL connectivity |
sqlite | SQLite file access |
spatial | Geospatial functions |
fts | Full-text search |
For a complete list of available extensions, see the DuckDB Extensions documentation.
Data Sources
Section titled “Data Sources”DuckDB extensions enable access to various data sources. Enable the appropriate extension before using its functions.
Local Files
Section titled “Local Files”Read from local filesystem:
-- Parquet filesSELECT * FROM read_parquet('data/*.parquet');
-- CSV filesSELECT * FROM read_csv('data/users.csv');
-- JSON filesSELECT * FROM read_json('data/events.json');
-- Glob patternsSELECT * FROM read_parquet('data/year=*/month=*/*.parquet');Remote Files (httpfs)
Section titled “Remote Files (httpfs)”Read from HTTP/HTTPS URLs:
-- HTTP CSVSELECT * FROM read_csv('https://example.com/data.csv');
-- HTTPS ParquetSELECT * FROM read_parquet('https://bucket.s3.amazonaws.com/data.parquet');S3 (httpfs)
Section titled “S3 (httpfs)”Read from Amazon S3:
-- S3 with public accessSELECT * FROM read_parquet('s3://public-bucket/data.parquet');
-- S3 with credentials (via secrets)SELECT * FROM read_parquet('s3://private-bucket/data.parquet');PostgreSQL
Section titled “PostgreSQL”Connect to PostgreSQL databases:
-- Attach PostgreSQL database (connection string format)ATTACH 'dbname=mydb host=localhost port=5432 user=postgres password=secret' AS pg (TYPE postgres);
-- Or use PostgreSQL URI formatATTACH 'postgresql://user:pass@host:5432/db' AS pg (TYPE postgres);
-- Query tablesSELECT * FROM pg.public.users;
-- Read-only connectionATTACH 'dbname=mydb host=localhost' AS pg_ro (TYPE postgres, READ_ONLY);Note: Avoid credentials in connection strings for production. Use DuckDB secrets instead.
Connect to MySQL databases:
-- Attach MySQL database (key=value format, NOT URI)ATTACH 'host=localhost user=root password=secret port=3306 database=mydb' AS mysql (TYPE mysql);
-- Query tablesSELECT * FROM mysql.users;
-- Read-only connectionATTACH 'host=localhost database=mydb' AS mysql_ro (TYPE mysql, READ_ONLY);Note: MySQL uses key=value connection strings, not URI format.
SQLite
Section titled “SQLite”Read SQLite databases:
-- Attach SQLite databaseATTACH 'path/to/database.sqlite' AS sqlite_db;
-- Query tablesSELECT * FROM sqlite_db.users;Secret Management
Section titled “Secret Management”Configuration
Section titled “Configuration”Configure secrets in ~/.mxcp/config.yml:
mxcp: 1projects: my_project: profiles: dev: secrets: # S3 credentials - name: s3_creds type: s3 parameters: KEY_ID: "${AWS_ACCESS_KEY_ID}" SECRET: "${AWS_SECRET_ACCESS_KEY}" REGION: "us-east-1"
# HTTP authentication - name: http_auth type: http parameters: BEARER_TOKEN: "${API_TOKEN}"
# PostgreSQL connection - name: pg_connection type: postgres parameters: HOST: "db.example.com" PORT: 5432 USER: "${DB_USER}" PASSWORD: "${DB_PASSWORD}" DATABASE: "production"Reference Secrets
Section titled “Reference Secrets”In mxcp-site.yml:
mxcp: 1project: my_projectprofile: dev
secrets: - s3_creds - http_auth - pg_connectionSecret Types
Section titled “Secret Types”| Type | Use Case | Parameters |
|---|---|---|
s3 | AWS S3 access | KEY_ID, SECRET, REGION, ENDPOINT |
gcs | Google Cloud Storage | BUCKET, KEY_FILE |
azure | Azure Blob Storage | CONNECTION_STRING |
http | HTTP authentication | BEARER_TOKEN, EXTRA_HTTP_HEADERS |
postgres | PostgreSQL | HOST, PORT, USER, PASSWORD, DATABASE |
mysql | MySQL | HOST, PORT, USER, PASSWORD, DATABASE |
HTTP Headers Example
Section titled “HTTP Headers Example”secrets: - name: custom_api type: http parameters: EXTRA_HTTP_HEADERS: Authorization: "Bearer ${API_TOKEN}" X-API-Key: "${API_KEY}" X-Custom-Header: "custom_value"MXCP Integration
Section titled “MXCP Integration”SQL Endpoints
Section titled “SQL Endpoints”Use DuckDB features in MXCP endpoints:
mxcp: 1tool: name: query_remote_data description: Query remote Parquet files parameters: - name: year type: integer description: Year to query return: type: array items: type: object source: code: | SELECT * FROM read_parquet( 's3://data-bucket/year=' || $year || '/*.parquet' ) LIMIT 100Multi-Source Joins
Section titled “Multi-Source Joins”Join data from multiple sources:
mxcp: 1tool: name: cross_source_report description: Join local and remote data return: type: array items: type: object source: code: | WITH local_customers AS ( SELECT * FROM customers ), remote_orders AS ( SELECT * FROM read_parquet('s3://orders/*.parquet') ) SELECT c.customer_id, c.name, COUNT(o.order_id) as order_count FROM local_customers c LEFT JOIN remote_orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.namePerformance Optimization
Section titled “Performance Optimization”Materialized Views
Section titled “Materialized Views”Pre-compute expensive queries:
-- Create materialized viewCREATE TABLE customer_stats ASSELECT customer_id, COUNT(*) as order_count, SUM(total) as total_spentFROM ordersGROUP BY customer_id;Partitioned Data
Section titled “Partitioned Data”Query partitioned data efficiently:
-- Only read relevant partitionsSELECT *FROM read_parquet('s3://bucket/year=*/month=*/*.parquet')WHERE year = 2024 AND month = 6;Parallel Processing
Section titled “Parallel Processing”DuckDB automatically parallelizes queries:
-- Configure thread countSET threads TO 8;
-- Configure memory limitSET memory_limit TO '4GB';Query Hints
Section titled “Query Hints”Optimize specific queries:
-- Force specific join orderSELECT /*+ MERGE_JOIN(a, b) */ *FROM table_a aJOIN table_b b ON a.id = b.a_id;Common Patterns
Section titled “Common Patterns”Caching Remote Data
Section titled “Caching Remote Data”Cache expensive remote queries:
mxcp: 1tool: name: cached_analytics description: Query cached remote data return: type: array items: type: object source: code: | -- Use dbt to materialize this SELECT * FROM cached_remote_dataWith dbt model:
{{ config(materialized='table') }}
SELECT *FROM read_parquet('https://example.com/large_dataset.parquet')Time-Based Filtering
Section titled “Time-Based Filtering”Query by time range:
SELECT *FROM read_parquet('s3://logs/*.parquet')WHERE event_time >= CURRENT_DATE - INTERVAL '7 days';JSON Processing
Section titled “JSON Processing”Work with JSON data:
-- Extract string values (use ->> or json_extract_string)SELECT id, data ->> '$.name' as name, data ->> '$.nested.field' as nested_valueFROM read_json('data.json');
-- Extract JSON values (use -> or json_extract, returns quoted strings)SELECT id, data -> '$.config' as config_jsonFROM read_json('data.json');
-- Unnest arraysSELECT id, unnest(data -> '$.items') as itemFROM events;Full-Text Search
Section titled “Full-Text Search”Search text columns using the FTS extension:
-- Create FTS index (parameters: table, id_column, text_columns...)PRAGMA create_fts_index('documents', 'doc_id', 'content');
-- Search using BM25 scoringSELECT doc_id, content, scoreFROM ( SELECT *, fts_main_documents.match_bm25(doc_id, 'search query') AS score FROM documents) sqWHERE score IS NOT NULLORDER BY score DESC;
-- Drop and recreate index after data changesPRAGMA drop_fts_index('documents');PRAGMA create_fts_index('documents', 'doc_id', 'content');Note: FTS indexes don’t auto-update. Rebuild after data changes.
Troubleshooting
Section titled “Troubleshooting””Extension not found"
Section titled “”Extension not found"”# Ensure extension is listedextensions: - httpfs # Must be listed to use HTTP URLs"Access denied”
Section titled “"Access denied””Check secrets configuration:
# Verify secrets are loadedmxcp validate
# Check secret valuesmxcp dbt-config --embed-secrets --dry-run“Connection timeout"
Section titled ““Connection timeout"”# Increase timeoutextensions: - name: httpfs config: http_timeout: 60000 # milliseconds"Out of memory”
Section titled “"Out of memory””-- Reduce memory usageSET memory_limit TO '2GB';
-- Or process in chunksSELECT * FROM large_table LIMIT 1000 OFFSET 0;Configuration Reference
Section titled “Configuration Reference”Extension Configuration
Section titled “Extension Configuration”extensions: # Simple extension - httpfs
# Extension with configuration - name: httpfs config: http_timeout: 30000
# Community extension - name: spatial repo: community
# Nightly extension - name: experimental_feature repo: core_nightlyDatabase Settings
Section titled “Database Settings”profiles: default: duckdb: path: data/app.duckdb # Persistent (default: data/db-{profile}.duckdb) readonly: false # Set true for read-only accessDuckDB settings like threads, memory_limit, and temp_directory can be configured via SQL pragmas in your queries or setup scripts:
-- In sql/setup.sql or at query timeSET threads = 4;SET memory_limit = '4GB';SET temp_directory = '/tmp/duckdb';Best Practices
Section titled “Best Practices”1. Enable Only Needed Extensions
Section titled “1. Enable Only Needed Extensions”extensions: - httpfs # Only if using HTTP - parquet # Only if using Parquet2. Use Secrets for Credentials
Section titled “2. Use Secrets for Credentials”Never hardcode credentials. See Configuration for details.
# Good: Use secretssecrets: - s3_creds
# Bad: Hardcoded values# Never do this!3. Materialize Frequent Queries
Section titled “3. Materialize Frequent Queries”Use dbt to materialize:
{{ config(materialized='table') }}SELECT * FROM expensive_query4. Partition Large Datasets
Section titled “4. Partition Large Datasets”Organize data by query patterns:
s3://bucket/ year=2024/ month=01/ data.parquet month=02/ data.parquet5. Monitor Query Performance
Section titled “5. Monitor Query Performance”-- Enable profilingPRAGMA enable_progress_bar;EXPLAIN ANALYZE SELECT * FROM large_table;6. Monitor Schema Changes
Section titled “6. Monitor Schema Changes”Use drift detection to catch unexpected schema changes:
# Create baselinemxcp drift-snapshot
# Check for changesmxcp drift-checkSee Drift Detection for details.
Next Steps
Section titled “Next Steps”- dbt Integration - Data transformation
- Configuration - Secrets management
- Monitoring - Performance tracking