Alembic & Materialized View Workflow

Overview

This document describes the architecture and workflow for managing materialized views in the ca-biositing project. The key principle is immutability: view definitions are frozen in Alembic migrations as raw SQL strings, never imported dynamically at upgrade time.

Architecture

Two-Part System

The project uses a dual-definition system for materialized views:

Python View Modules (src/ca_biositing/datamodels/data_portal_views/mv_*.py)
Pure SQLAlchemy select() expressions
Used for development, testing, and documentation
NOT used during migration/deployment
Can be freely modified and tested locally
Alembic Migrations (alembic/versions/*.py)
Immutable raw SQL strings frozen at the time of creation
Used during deployment and schema evolution
Define the actual database schema
Are the single source of truth for the live database

Why Two Definitions?

This separation prevents a critical class of deployment failures:

Problem: If migrations imported Python view definitions directly, upgrading would require running the entire ORM layer during deployment
Risk: Large imports can hang, timeout, or introduce unexpected behavior
Solution: Migrations contain the compiled SQL only, making them fast and deterministic

Current Materialized Views

The project has 10 data portal materialized views managed under this pattern:

View Name	Purpose	Key Columns
`mv_biomass_search`	Full-text search on resources	id, resource_id, search_vector
`mv_biomass_availability`	Seasonal availability data	resource_id, from_month, to_month
`mv_biomass_composition`	Analysis data aggregated by type	id, resource_id, geoid, county, analysis_type
`mv_biomass_county_production`	County-level production estimates	id, resource_id, geoid, scenario_name
`mv_biomass_end_uses`	Product end uses and trends	resource_id, use_case
`mv_biomass_fermentation`	Fermentation experiment results	id, resource_id, geoid, county, strain_name
`mv_biomass_gasification`	Gasification experiment results	id, resource_id, geoid, parameter_name
`mv_biomass_pricing`	Historical commodity pricing	id, resource_id, geoid, county
`mv_biomass_sample_stats`	Sample aggregation statistics	resource_id, sample_count
`mv_usda_county_production`	USDA census data aggregation	id, resource_id, geoid

Bold columns = Added during PR f989683 consolidation (geographic grouping with county)

File Organization

Python View Modules

src/ca_biositing/datamodels/ca_biositing/datamodels/data_portal_views/
├── __init__.py                    # Exports all view objects for backward compatibility
├── mv_biomass_search.py           # SQLAlchemy select() for search view
├── mv_biomass_availability.py     # SQLAlchemy select() for availability view
├── mv_biomass_composition.py      # SQLAlchemy select() for composition view
├── mv_biomass_county_production.py
├── mv_biomass_end_uses.py
├── mv_biomass_fermentation.py
├── mv_biomass_gasification.py
├── mv_biomass_pricing.py
├── mv_biomass_sample_stats.py
└── mv_usda_county_production.py

Each module contains:

SQLAlchemy select() expression (pure Python)
Comments documenting required indexes
Comments documenting geographic/temporal columns

Example structure:

# mv_biomass_composition.py
"""
mv_biomass_composition.py

Compositional analysis data aggregated across different analysis types
(compositional, proximate, ultimate, xrf, icp, calorimetry, xrd, ftnir, pretreatment).

Grouped by resource_id, analysis_type, parameter_name, unit, and geoid from field sample.

Required indexes:
    CREATE INDEX idx_mv_biomass_composition_resource_id ON data_portal.mv_biomass_composition (resource_id)
    CREATE INDEX idx_mv_biomass_composition_geoid ON data_portal.mv_biomass_composition (geoid)
    CREATE INDEX idx_mv_biomass_composition_county ON data_portal.mv_biomass_composition (county)
    CREATE INDEX idx_mv_biomass_composition_analysis_type ON data_portal.mv_biomass_composition (analysis_type)
    ... etc
"""

from sqlalchemy import select, func, union_all, literal
from ca_biositing.datamodels.models.resource_information.resource import Resource
# ... other imports ...

def get_composition_query(model, analysis_type):
    """Generate a select statement for a specific analysis record type with geoid from field sample."""
    return select(
        model.resource_id,
        literal(analysis_type).label("analysis_type"),
        Parameter.name.label("parameter_name"),
        Observation.value.label("value"),
        Unit.name.label("unit"),
        LocationAddress.geography_id.label("geoid")
    ).join(Observation, Observation.record_id == model.record_id)\
     .join(Parameter, Observation.parameter_id == Parameter.id)\
     # ... more joins ...

# ... view definition ...
mv_biomass_composition = select(
    func.row_number().over(...).label("id"),
    all_measurements.c.resource_id,
    # ... columns ...
).select_from(all_measurements)\
 .join(Resource, ...)\
 .group_by(...)

Alembic Migrations

alembic/versions/
├── 9e8f7a6b5c54_consolidated_pr_f989683_views_with_geoid.py  # Creates all 10 views with immutable SQL
├── 9e8f7a6b5c52_integrate_pr_f989683_indexes.py              # Creates 27 indexes
└── ... (other migrations)

Key migration: 9e8f7a6b5c54_consolidated_pr_f989683_views_with_geoid.py

Contains complete SQL for all 10 materialized views
Uses raw SQL strings (op.execute("""..."""))
Includes DROP statements for safe re-creation
Never imports Python view modules

Workflow: When You Need to Update a View

Scenario 1: Updating a View Definition

If you need to change a view's logic (e.g., add a column, change filters, fix a join):

Step 1: Edit the Python Module (For Development)

# src/ca_biositing/datamodels/data_portal_views/mv_biomass_composition.py
# Make changes to the SQLAlchemy select() expression

Step 2: Test Locally

# Test the view definition works
pixi run python3 << 'EOF'
from ca_biositing.datamodels.data_portal_views import mv_biomass_composition
from sqlalchemy.dialects import postgresql

# Compile to SQL for inspection
sql = str(mv_biomass_composition.compile(
    dialect=postgresql.dialect(),
    compile_kwargs={'literal_binds': True}
))
print(sql)
EOF

Step 3: Compile to PostgreSQL SQL

# Generate the compiled SQL string
pixi run python3 << 'EOF'
from ca_biositing.datamodels.data_portal_views import mv_biomass_composition
from sqlalchemy.dialects import postgresql

sql = str(mv_biomass_composition.compile(
    dialect=postgresql.dialect(),
    compile_kwargs={'literal_binds': True}
))

# Copy this output for use in the migration file
print(sql)
EOF

Step 4: Create a New Alembic Migration

pixi run alembic revision -m "Update mv_biomass_composition with [description of changes]"

This creates: alembic/versions/[new_id]_update_mv_biomass_composition_with_[description].py

Step 5: Fill in the Migration

Edit the migration file:

def upgrade() -> None:
    """Drop and recreate mv_biomass_composition with updated logic."""

    # Drop the old view
    op.execute("DROP MATERIALIZED VIEW IF EXISTS data_portal.mv_biomass_composition CASCADE")

    # Recreate with new SQL (copied from step 3)
    op.execute("""
        CREATE MATERIALIZED VIEW data_portal.mv_biomass_composition AS
        SELECT ... (paste the compiled SQL here) ...
    """)

    # Recreate indexes if columns changed
    op.execute("""CREATE INDEX idx_mv_biomass_composition_resource_id ON data_portal.mv_biomass_composition (resource_id)""")
    op.execute("""CREATE INDEX idx_mv_biomass_composition_geoid ON data_portal.mv_biomass_composition (geoid)""")
    # ... etc for all indexes ...

def downgrade() -> None:
    """Drop and restore previous version of mv_biomass_composition."""

    op.execute("DROP MATERIALIZED VIEW IF EXISTS data_portal.mv_biomass_composition CASCADE")

    # Recreate with previous SQL (keep this from git history or manual backup)
    op.execute("""
        CREATE MATERIALIZED VIEW data_portal.mv_biomass_composition AS
        SELECT ... (previous SQL) ...
    """)

    # Recreate previous indexes
    # ... etc ...

Step 6: Test the Migration

# Run migrations
POSTGRES_HOST=localhost pixi run migrate

# Verify view exists and has correct columns
POSTGRES_HOST=localhost pixi run access-db << 'EOF'
SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'data_portal'
  AND table_name = 'mv_biomass_composition'
ORDER BY ordinal_position;
EOF

# Verify data is correct
POSTGRES_HOST=localhost pixi run access-db << 'EOF'
SELECT * FROM data_portal.mv_biomass_composition LIMIT 5;
EOF

Step 7: Commit and Push

git add alembic/versions/[new_migration_file]
git add src/ca_biositing/datamodels/data_portal_views/mv_biomass_composition.py
git commit -m "Update mv_biomass_composition: [description]"
git push origin [branch]

Scenario 2: Adding a New Materialized View

Step 1: Create a Python Module

# src/ca_biositing/datamodels/data_portal_views/mv_new_view.py
"""
mv_new_view.py

Description of the view's purpose and use case.

Required indexes:
    CREATE UNIQUE INDEX idx_mv_new_view_id ON data_portal.mv_new_view (id)
    ... etc
"""

from sqlalchemy import select, func
from ca_biositing.datamodels.models import ...

mv_new_view = select(
    func.row_number().over(order_by=(...)).label("id"),
    # ... columns ...
).select_from(...)\
 .join(...)\
 .group_by(...)

Step 2: Update `init.py`

# src/ca_biositing/datamodels/data_portal_views/__init__.py
from .mv_new_view import mv_new_view

__all__ = [
    'mv_biomass_search',
    # ... existing views ...
    'mv_new_view',  # Add here
]

Step 3: Compile to SQL

pixi run python3 << 'EOF'
from ca_biositing.datamodels.data_portal_views import mv_new_view
from sqlalchemy.dialects import postgresql

sql = str(mv_new_view.compile(
    dialect=postgresql.dialect(),
    compile_kwargs={'literal_binds': True}
))
print(sql)
EOF

Step 4: Create Migration

pixi run alembic revision -m "Add mv_new_view materialized view"

Step 5: Fill in Migration

def upgrade() -> None:
    """Create mv_new_view materialized view."""

    op.execute("""
        CREATE MATERIALIZED VIEW data_portal.mv_new_view AS
        SELECT ... (compiled SQL) ...
    """)

    # Create indexes
    op.execute("""CREATE UNIQUE INDEX idx_mv_new_view_id ON data_portal.mv_new_view (id)""")
    op.execute("""CREATE INDEX idx_mv_new_view_resource_id ON data_portal.mv_new_view (resource_id)""")
    # ... etc ...

def downgrade() -> None:
    """Drop mv_new_view materialized view."""
    op.execute("DROP MATERIALIZED VIEW IF EXISTS data_portal.mv_new_view CASCADE")

Step 6: Test and Commit (as above)

Important Rules

✓ DO

Edit Python view modules freely - they are for development and testing
Compile to SQL before creating migrations - ensures the SQL is what you tested
Use raw SQL strings in migrations - immutability is the goal
Include DROP statements - allows safe re-creation during migration
Create separate migrations for view changes - one view per migration for clarity
Document required indexes in Python modules - helps future developers
Test migrations locally - run pixi run migrate before pushing

✗ DON'T

Do NOT import Python view modules in migrations - defeats the immutability purpose
Do NOT embed Python code in migrations - migrations must be deterministic
Do NOT modify migrations after they've been deployed - immutability is the contract
Do NOT manually craft SQL without testing - compile from Python first
Do NOT forget to test migrations locally - migrations are permanent

Example: The PR f989683 Consolidation

The recent migration consolidation (PR f989683) exemplifies this workflow:

Before:

3 separate migration files with broken/incomplete SQL
Syntax errors and truncated view definitions
Missing geographic (county) columns in some views

Solution:

Read all 10 Python view modules
Compiled each to PostgreSQL SQL
Created consolidated migration 9e8f7a6b5c54 with all 10 views as raw SQL
Fixed errors identified during compilation
Added missing columns (county) by extending the SQL
Created index migration 9e8f7a6b5c52 to handle all 27 indexes
Tested end-to-end: pixi run migrate
Verified all views exist and have correct data

This approach ensures:

All SQL is reviewed and tested before deployment
No dynamic imports during upgrade
Easy rollback via downgrade migrations
Clear audit trail of schema changes

Refreshing Materialized Views (Post-Migration)

After views are created or updated, refresh their data:

# Refresh all data portal views
pixi run refresh-views

# Or refresh manually
POSTGRES_HOST=localhost pixi run access-db << 'EOF'
REFRESH MATERIALIZED VIEW CONCURRENTLY data_portal.mv_biomass_search;
REFRESH MATERIALIZED VIEW CONCURRENTLY data_portal.mv_biomass_composition;
-- ... etc for all views ...
EOF

Note: Use CONCURRENTLY only if the view has a UNIQUE index (supports concurrent refresh without locking).

Migration Consolidation Summary: docs/pr/PR_f989683_migration_consolidation.md
Detailed Handoff Document: plans/migration_consolidation_handoff_phase6.md
Initial Refactor Plan: plans/data_portal_view_refactor_simple.md
Alembic Documentation: https://alembic.sqlalchemy.org/
SQLAlchemy Compilation: https://docs.sqlalchemy.org/en/20/faq/sql_expressions.html#how-do-i-construct-a-textual-sql-fragment-that-is-database-specific

FAQ

Q: Why can't I just modify the Alembic migration file to import the Python view? A: Because migrations run during deployment when imports can hang. Raw SQL is fast and deterministic.

Q: What if I make a mistake in the Python module? A: That's fine! Test it, fix it, then compile again and create a new migration. The Python module is for development.

Q: Do I have to manually compile to SQL every time? A: Yes, currently. This ensures you review the generated SQL before committing. Future enhancements could automate this.

Q: What if I forget to update the Python module when creating a migration? A: That's okay if you only changed the SQL. But for clarity, update both. The Python module documents the view's intended structure.

Q: How do I rollback a view change? A: Run pixi run alembic downgrade -1 to revert to the previous migration, which recreates the old view.

Q: Can I have two versions of a view? A: No, but you can create a new view with a new name and deprecate the old one over time.

Q: Do I need to refresh views after every migration? A: Not after creation/alteration (schema changes). But yes if the underlying data has changed and you need fresh results.

Summary

The dual-definition system (Python modules + Alembic migrations) provides:

Safety: Immutable migrations prevent runtime surprises
Clarity: Raw SQL is explicit and reviewable
Flexibility: Python modules let developers experiment locally
Maintainability: Clear separation of concerns
Scalability: Easy to add new views or update existing ones

Always remember: The Alembic migration is the source of truth for the live database.