Skip to content

Prefect Workflow for BioCirV ETL

This document outlines the steps to start, run, and monitor the ETL pipeline using Prefect and Docker.

1. Start the Prefect Environment

All services (Postgres database, Prefect server, and Prefect worker) are managed via Docker Compose through Pixi tasks.

To start all services in detached mode:

pixi run start-services

This command will:

  • Build the Docker images if they don't exist.
  • Start the containers for the database, the Prefect server, and the Prefect worker.
  • The Prefect worker will automatically connect to the server and its work pool.

You can check the status of the containers with pixi run service-status.

2. Deploy the ETL Flow(s)

Before you can run the pipeline, you need to deploy your flow(s) to the Prefect server. This registers the flow and makes it available to be run by a worker.

To deploy the flow(s):

pixi run deploy

This command uses the configuration defined in resources/prefect/prefect.yaml.

3. Run the ETL Pipeline

Once the environment is running and the flow is deployed, you can trigger a pipeline run.

Running the Master Pipeline

To run the entire ETL process, which orchestrates all the sub-flows:

pixi run run-etl

This command instructs the Prefect server to create a new flow run for the master deployment. The Prefect worker will pick up this run and execute the entire pipeline.

4. Monitor the Pipeline

You can monitor the progress and view the logs of your pipeline runs in real-time through the Prefect UI.

From the UI, you can see:

  • The status of each task (running, completed, failed).
  • Detailed logs for each task.
  • A history of all flow runs.

5. Stopping the Environment

To stop all running Docker containers:

pixi run teardown-services

If you also want to remove the PostgreSQL data volume (deletes all data):

pixi run teardown-services-volumes