Skip to content
Star -

DuckDB Integration

Related Topics: SQL Endpoints (tutorial) | SQL Reference (built-in functions) | Configuration (extensions setup) | Common Tasks (quick how-to)

Official Documentation: DuckDB Docs | Extensions | SQL Reference

DuckDB serves as MXCP’s SQL execution engine, providing fast, local-first data access with extensive connectivity options.

DuckDB enables:

  • Fast Analytics: In-process OLAP database
  • Multi-Source: Connect to S3, HTTP, PostgreSQL, MySQL, and more
  • Extensions: Rich ecosystem for additional functionality
  • Zero Setup: No external database server required

Built-in extensions for common functionality:

mxcp-site.yml
extensions:
- httpfs # HTTP/HTTPS file system
- parquet # Parquet file support
- json # JSON file support

Community-maintained extensions:

extensions:
- name: extension_name
repo: community

Latest development versions:

extensions:
- name: extension_name
repo: core_nightly
ExtensionPurpose
httpfsRead from HTTP/HTTPS URLs
parquetParquet file support
jsonJSON file handling
postgresPostgreSQL connectivity
mysqlMySQL connectivity
sqliteSQLite file access
spatialGeospatial functions
ftsFull-text search

For a complete list of available extensions, see the DuckDB Extensions documentation.

DuckDB extensions enable access to various data sources. Enable the appropriate extension before using its functions.

Read from local filesystem:

-- Parquet files
SELECT * FROM read_parquet('data/*.parquet');
-- CSV files
SELECT * FROM read_csv('data/users.csv');
-- JSON files
SELECT * FROM read_json('data/events.json');
-- Glob patterns
SELECT * FROM read_parquet('data/year=*/month=*/*.parquet');

Read from HTTP/HTTPS URLs:

-- HTTP CSV
SELECT * FROM read_csv('https://example.com/data.csv');
-- HTTPS Parquet
SELECT * FROM read_parquet('https://bucket.s3.amazonaws.com/data.parquet');

Read from Amazon S3:

-- S3 with public access
SELECT * FROM read_parquet('s3://public-bucket/data.parquet');
-- S3 with credentials (via secrets)
SELECT * FROM read_parquet('s3://private-bucket/data.parquet');

Connect to PostgreSQL databases:

-- Attach PostgreSQL database (connection string format)
ATTACH 'dbname=mydb host=localhost port=5432 user=postgres password=secret' AS pg (TYPE postgres);
-- Or use PostgreSQL URI format
ATTACH 'postgresql://user:pass@host:5432/db' AS pg (TYPE postgres);
-- Query tables
SELECT * FROM pg.public.users;
-- Read-only connection
ATTACH 'dbname=mydb host=localhost' AS pg_ro (TYPE postgres, READ_ONLY);

Note: Avoid credentials in connection strings for production. Use DuckDB secrets instead.

Connect to MySQL databases:

-- Attach MySQL database (key=value format, NOT URI)
ATTACH 'host=localhost user=root password=secret port=3306 database=mydb' AS mysql (TYPE mysql);
-- Query tables
SELECT * FROM mysql.users;
-- Read-only connection
ATTACH 'host=localhost database=mydb' AS mysql_ro (TYPE mysql, READ_ONLY);

Note: MySQL uses key=value connection strings, not URI format.

Read SQLite databases:

-- Attach SQLite database
ATTACH 'path/to/database.sqlite' AS sqlite_db;
-- Query tables
SELECT * FROM sqlite_db.users;

Configure secrets in ~/.mxcp/config.yml:

mxcp: 1
projects:
my_project:
profiles:
dev:
secrets:
# S3 credentials
- name: s3_creds
type: s3
parameters:
KEY_ID: "${AWS_ACCESS_KEY_ID}"
SECRET: "${AWS_SECRET_ACCESS_KEY}"
REGION: "us-east-1"
# HTTP authentication
- name: http_auth
type: http
parameters:
BEARER_TOKEN: "${API_TOKEN}"
# PostgreSQL connection
- name: pg_connection
type: postgres
parameters:
HOST: "db.example.com"
PORT: 5432
USER: "${DB_USER}"
PASSWORD: "${DB_PASSWORD}"
DATABASE: "production"

In mxcp-site.yml:

mxcp: 1
project: my_project
profile: dev
secrets:
- s3_creds
- http_auth
- pg_connection
TypeUse CaseParameters
s3AWS S3 accessKEY_ID, SECRET, REGION, ENDPOINT
gcsGoogle Cloud StorageBUCKET, KEY_FILE
azureAzure Blob StorageCONNECTION_STRING
httpHTTP authenticationBEARER_TOKEN, EXTRA_HTTP_HEADERS
postgresPostgreSQLHOST, PORT, USER, PASSWORD, DATABASE
mysqlMySQLHOST, PORT, USER, PASSWORD, DATABASE
secrets:
- name: custom_api
type: http
parameters:
EXTRA_HTTP_HEADERS:
Authorization: "Bearer ${API_TOKEN}"
X-API-Key: "${API_KEY}"
X-Custom-Header: "custom_value"

Use DuckDB features in MXCP endpoints:

tools/query_remote.yml
mxcp: 1
tool:
name: query_remote_data
description: Query remote Parquet files
parameters:
- name: year
type: integer
description: Year to query
return:
type: array
items:
type: object
source:
code: |
SELECT *
FROM read_parquet(
's3://data-bucket/year=' || $year || '/*.parquet'
)
LIMIT 100

Join data from multiple sources:

tools/cross_source_report.yml
mxcp: 1
tool:
name: cross_source_report
description: Join local and remote data
return:
type: array
items:
type: object
source:
code: |
WITH local_customers AS (
SELECT * FROM customers
),
remote_orders AS (
SELECT * FROM read_parquet('s3://orders/*.parquet')
)
SELECT
c.customer_id,
c.name,
COUNT(o.order_id) as order_count
FROM local_customers c
LEFT JOIN remote_orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name

Pre-compute expensive queries:

-- Create materialized view
CREATE TABLE customer_stats AS
SELECT
customer_id,
COUNT(*) as order_count,
SUM(total) as total_spent
FROM orders
GROUP BY customer_id;

Query partitioned data efficiently:

-- Only read relevant partitions
SELECT *
FROM read_parquet('s3://bucket/year=*/month=*/*.parquet')
WHERE year = 2024 AND month = 6;

DuckDB automatically parallelizes queries:

-- Configure thread count
SET threads TO 8;
-- Configure memory limit
SET memory_limit TO '4GB';

Optimize specific queries:

-- Force specific join order
SELECT /*+ MERGE_JOIN(a, b) */ *
FROM table_a a
JOIN table_b b ON a.id = b.a_id;

Cache expensive remote queries:

tools/cached_query.yml
mxcp: 1
tool:
name: cached_analytics
description: Query cached remote data
return:
type: array
items:
type: object
source:
code: |
-- Use dbt to materialize this
SELECT * FROM cached_remote_data

With dbt model:

models/marts/cached_remote_data.sql
{{ config(materialized='table') }}
SELECT *
FROM read_parquet('https://example.com/large_dataset.parquet')

Query by time range:

SELECT *
FROM read_parquet('s3://logs/*.parquet')
WHERE event_time >= CURRENT_DATE - INTERVAL '7 days';

Work with JSON data:

-- Extract string values (use ->> or json_extract_string)
SELECT
id,
data ->> '$.name' as name,
data ->> '$.nested.field' as nested_value
FROM read_json('data.json');
-- Extract JSON values (use -> or json_extract, returns quoted strings)
SELECT
id,
data -> '$.config' as config_json
FROM read_json('data.json');
-- Unnest arrays
SELECT
id,
unnest(data -> '$.items') as item
FROM events;

Search text columns using the FTS extension:

-- Create FTS index (parameters: table, id_column, text_columns...)
PRAGMA create_fts_index('documents', 'doc_id', 'content');
-- Search using BM25 scoring
SELECT doc_id, content, score
FROM (
SELECT *, fts_main_documents.match_bm25(doc_id, 'search query') AS score
FROM documents
) sq
WHERE score IS NOT NULL
ORDER BY score DESC;
-- Drop and recreate index after data changes
PRAGMA drop_fts_index('documents');
PRAGMA create_fts_index('documents', 'doc_id', 'content');

Note: FTS indexes don’t auto-update. Rebuild after data changes.

# Ensure extension is listed
extensions:
- httpfs # Must be listed to use HTTP URLs

Check secrets configuration:

Terminal window
# Verify secrets are loaded
mxcp validate
# Check secret values
mxcp dbt-config --embed-secrets --dry-run
# Increase timeout
extensions:
- name: httpfs
config:
http_timeout: 60000 # milliseconds
-- Reduce memory usage
SET memory_limit TO '2GB';
-- Or process in chunks
SELECT * FROM large_table LIMIT 1000 OFFSET 0;
mxcp-site.yml
extensions:
# Simple extension
- httpfs
# Extension with configuration
- name: httpfs
config:
http_timeout: 30000
# Community extension
- name: spatial
repo: community
# Nightly extension
- name: experimental_feature
repo: core_nightly
mxcp-site.yml
profiles:
default:
duckdb:
path: data/app.duckdb # Persistent (default: data/db-{profile}.duckdb)
readonly: false # Set true for read-only access

DuckDB settings like threads, memory_limit, and temp_directory can be configured via SQL pragmas in your queries or setup scripts:

-- In sql/setup.sql or at query time
SET threads = 4;
SET memory_limit = '4GB';
SET temp_directory = '/tmp/duckdb';
extensions:
- httpfs # Only if using HTTP
- parquet # Only if using Parquet

Never hardcode credentials. See Configuration for details.

# Good: Use secrets
secrets:
- s3_creds
# Bad: Hardcoded values
# Never do this!

Use dbt to materialize:

{{ config(materialized='table') }}
SELECT * FROM expensive_query

Organize data by query patterns:

s3://bucket/
year=2024/
month=01/
data.parquet
month=02/
data.parquet
-- Enable profiling
PRAGMA enable_progress_bar;
EXPLAIN ANALYZE SELECT * FROM large_table;

Use drift detection to catch unexpected schema changes:

Terminal window
# Create baseline
mxcp drift-snapshot
# Check for changes
mxcp drift-check

See Drift Detection for details.