ETL Pipeline Workflow

This guide explains the structure of the ETL (Extract, Transform, Load) pipeline and how to use Prefect to execute it.

Core Concepts

The ETL pipeline extracts data from Google Sheets, transforms it using pandas, and loads it into the PostgreSQL database.

Structure: The core logic is organized into three main packages within src/ca_biositing/pipeline/ca_biositing/pipeline/etl/:
extract: Functions to pull raw data from sources (e.g., Google Sheets).
transform: Functions to clean, process, and structure the raw data.
load: Functions to insert the transformed data into the database using SQLAlchemy.
Hierarchical Pipelines: Individual pipelines are nested within subdirectories reflecting the data they handle (e.g., products, biomass).

Running the ETL Pipelines

The ETL system runs in a containerized Prefect environment.

Step 1: Start Services

pixi run start-services

Step 2: Deploy Flows

pixi run deploy

Step 3: Run the Master Pipeline

pixi run run-etl

Step 4: Monitor Access the Prefect UI at http://localhost:4200.

How to Add a New ETL Flow

Step 1: Create the Task Files Create the three Python files for your extract, transform, and load logic in the appropriate subdirectories under src/ca_biositing/pipeline/ca_biositing/pipeline/etl/. Decorate each function with @task.

Step 2: Create the Pipeline Flow Create a new file in src/ca_biositing/pipeline/ca_biositing/pipeline/flows/ to define the flow.

from prefect import flow
from ca_biositing.pipeline.etl.extract.samples.new_type import extract
from ca_biositing.pipeline.etl.transform.samples.new_type import transform
from ca_biositing.pipeline.etl.load.samples.new_type import load

@flow
def new_type_flow():
    raw_data = extract()
    transformed_data = transform(raw_data)
    load(transformed_data)

Step 3: Register the New Flow Add your flow to the AVAILABLE_FLOWS dictionary in resources/prefect/run_prefect_flow.py.

Step 4: Deploy and Run

pixi run deploy
pixi run run-etl

Using Templates

Template files are available in src/ca_biositing/pipeline/ca_biositing/pipeline/etl/templates/ to help you get started with new tasks.