CA Biositing Pipeline

ETL pipeline for the CA Biositing project — extracting biomass feedstock data from Google Sheets and external sources, transforming it with pandas, and loading it into PostgreSQL.

Workflows are orchestrated with Prefect and share database models from the companion ca-biositing-datamodels package.

Installation

pip install ca-biositing-pipeline

Quick Start

from ca_biositing.pipeline.flows.primary_ag_product import primary_ag_product_flow

# Run the primary agricultural product ETL flow
primary_ag_product_flow()

What's Included

Extract — Pull data from Google Sheets, shapefiles, and public datasets (USDA Census/Survey, LandIQ, Billion Ton)
Transform — Clean and reshape with pandas and pyjanitor
Load — Upsert into PostgreSQL with foreign-key resolution
Flows — Prefect flows combining extract/transform/load steps

Key Dependencies

ca-biositing-datamodels — shared database models
Prefect — workflow orchestration
pandas — data manipulation
gspread — Google Sheets integration
GeoPandas — geospatial data handling

Links

Contributors

Acknowledgement

We acknowledge software engineering support from the University of Washington Scientific Software Engineering Center (SSEC), as part of the Schmidt Sciences Virtual Institute for Scientific Software (VISS).

License

CA Biositing Pipeline is licensed under the open source BSD 3-Clause License.