DBT / How-to

How to Create CI/CD Pipelines for dbt Core

Paul Fry
7 min readAug 15, 2023

This post dives into the use of CI/CD for dbt Core, providing insights on dbt Cloud’s ‘Slim CI’ CICD job pattern and how to implement this using dbt Core. We’ll also touch on other common options available to enhance your dbt CI/CD pipelines.

Agenda

  1. Using CI/CD to Verify Your dbt Changes
  2. What is the dbt ’Slim CI’ Job Pattern?
  3. How to Recreate the ‘Slim CI’ Pattern in dbt Core
  4. Enhancing Your dbt CI/CD Pipeline
  5. Summary
  6. Useful Resources

1. Using CI/CD to Verify Your dbt Changes

In the context of dbt, CI/CD can be used to streamline the testing and deployment processes for files within your dbt project repo.

For instance, you can ensure the reliability and deployment readiness of your data transformations by triggering a CI/CD pipeline upon code changes, thereby automating the testing of your data models.

2. What is the dbt Cloud ‘Slim CI’ Job Pattern?

The ‘Slim CI’ job pattern, native to dbt Cloud, is designed to test only modified dbt models when a pull request is created in your dbt Git repository. This is described in detail here.

Source: docs.getdbt.com/docs/deploy/continuous-integration

While it’s available in dbt Cloud, this pattern can also be implemented in dbt Core. To achieve this, you can leverage the dbt state:modified and defer functionality. These concepts allow dbt to identify and test only the models that have been changed. This is described further below.

What is dbt ‘State’ Processing?

We use the state:modified and deferflags to instruct our CI/CD pipeline to only run dbt models that have been modified since the last successful build. Here’s a simple example showing how the state:modified and defer flags can be used within a slim-ci job:

version: 2
jobs:
dbt_slim_ci:
docker:
- image: your_dbt_image:latest
steps:
- checkout
- run: dbt deps # Install dependencies
- run: |

OUTPUT=$(dbt ls --selector state:modified+ --defer --state=prod-run-artifacts) # Get a list of modified dbt models
if [[ -n "$OUTPUT" ]]; then
dbt run --models "$OUTPUT" # Run dbt only on modified models
else
echo "No modified models found. Skipping dbt run."
fi

In the above code block, we’ve used the dbt state:modified flag in combination with the ls command to list all modified dbt models. We can use the same idea to only run/test modified dbt models, for example, using commands similar to:

dbt run --select state:modified+ --defer --state=prod-run-artifacts

However, the key detail that is missing in the above is how does dbt determine what models have been modified?

How Does dbt Know What Has Been Modified?

For dbt to understand the ‘state’ in the above code example, it needs to have access to two versions of the dbt manifest file (i.e., manifest.json):

  1. One used to describe the dbt project on the main/ masterbranch
  2. Another to describe the dbt project on the incoming feature branch

Only if we have two versions of the manifest file can dbt make a comparison to understand/use state processing. Which begs the question, how/where should we put this second manifest file?

dbt ‘Defer’

As described here, the defer flag is commonly used in combination with state to "enable 'Slim CI'".

Source: https://docs.getdbt.com/reference/node-selection/defer

In the example in the above screenshot, the defer arg is used to point to/reference a second dbt manifest file. Where it uses this, in combination with the state flag, to understand what dbt models have changed — i.e., enable state processing!

3. How to Recreate the ‘Slim CI’ Pattern in dbt Core

This section provides a high-level, step-by-step guide to replicating the ‘Slim CI’ pattern in dbt Core using state processing:

Step 1: Generate dbt Manifests

To create the manifest for your main/master branch, you’ll need to run dbt compile. In addition, as mentioned on dbt's defer documentation, you'll need to specify an output location for these 'prod run artefacts' using the arg target-path. For example:

dbt compile --target-path=prod-run-artifacts

dbt commonly gives the example directory name ‘prod-run-artefacts’, but you can use whatever name you wish.

Step 2 — Push dbt Manifest

We’ll need this manifest.json file to be available within our CI/CD job. As such, it makes sense to push (only) this manifest.json file to your Git main or master branch.

Step 3 — Generate Feature Branch Manifest

Within your CI/CD job, we’ll need to generate a manifest file for your feature branch as a basis for comparison. For example:

version: 2
jobs:
dbt_slim_ci:
docker:
- image: your_dbt_image:latest
steps:
- checkout

- run: |
dbt compile # Run dbt only on modified models

Note: in the dbt compile command used within the CI/CD job, we don't use the --target-path= arg. That's only used for the main/master branch.

Step 4 — Applying ‘State Processing’

Continuing on from the above CI/CD code, we then use the defer and state flags to determine what models have been modified:

version: 2
jobs:
dbt_slim_ci:
docker:
- image: your_dbt_image:latest
steps:
- checkout # on our feature branch

- run: |

dbt compile # compile our feature branch changes

dbt ls --select state:modified+ --defer --state=prod-run-artifacts # list the modified dbt models

dbt run --select state:modified+ --defer --state=prod-run-artifacts # run modified dbt models only

The above code is just a simple example, but you can see why/how two versions of the dbt manifest file are needed for dbt’ state processing’.

Establishing a dbt ‘slim CI’ job provides an excellent foundation for implementing further CI/CD checks against your dbt project since you’re targeting only modified dbt models to test…rather than an entire dbt project.

4. Enhancing Your dbt CI/CD Pipeline

I recommend first implementing this dbt’ slim CI’ pattern as the first/foundation of your CI/CD job. Once you have this in place, there are several great next-candidate tests to add. I’ve tried to loosely list these in order of complexity/effort to implement:

Candidate 1: SQLFluff

Source: https://docs.sqlfluff.com/en/stable/production.html#adding-diff-quality-to-your-builds

For example:

$ diff-quality --violations sqlfluff --compare-branch origin/master --fail-under 100

Using diff-quality ensures that SQLFluff lint is only performed on modified dbt models (like Slim CI above).

Warning: If you’re not currently using SQLFluff, use this with caution

If SQLFluff isn’t currently used within your team… you’ll need to uplift your existing SQL code to comply with SQLFluff. Otherwise, any SQLFluff linting performed within a CI/CD job will identify SQL linting issues with existing/legacy code.

Candidate 2: Git Standards Enforcement

Very trivial one (and little effort), I recommend using CI/CD to validate your Git standards. For example, I use CI/CD to validate the Git branch names to ensure it follows our naming conventions. For example:

# the regex for valid git branch names (i.e., snake_case)
VALID_GIT_BRANCH_NAME="^(feature|hotfix|release)\/[a-z0-9_]+$"

# Validate the Git branch name used
if [[ ! $SOURCE_GIT_BRANCH_NAME =~ $VALID_GIT_BRANCH_NAME ]]
then
echo
echo -e "${RED}#############################################################################################"
echo -e "${RED}# ERROR: Invalid Git branch name."
echo -e "${RED}#############################################################################################${COLOUR_OFF}"
echo -e "Local git branch name = $SOURCE_GIT_BRANCH_NAME"

...

fi

This isn’t the complete code used, but it gives a flavour of what it’s doing.

Candidate 3 (High-Value): dbt-checkpoint

I strongly recommend using this. dbt-checkpoint provides a variety of pre-commit hooks that can be used within a CI/CD job to verify whether data quality checks have been set (a test to ensure (data quality) tests have been set.)

Using dbt-checkpoint to Verify DQ Checks Have Been Set

I won’t digress into the details of Data Quality dimensions, but dbt’s testing features that are shipped ‘out of the box’ can be used for many of the key dimensions of data quality, for example:

  • Data Uniqueness, using dbt’s unique test.
  • Data Completeness, using not null test.
  • Data Validity, using accepted value test.
  • Data Timeliness, using dbt ‘freshness’ checks.
  • Data Relevance, using dbt’s ‘singular tests’ to implement business rules.

As such, dbt-checkpoint can be used to ensure the dbt DQ tests have been set.

How dbt-checkpoint can be used to address DQ Dimensions

Candidate 4: data-diff

data-diff for dbt can be used to compare row counts between two tables, where you'd typically do this by comparing an original table version against a version containing your proposed table revision. To do this, you need to specify the 'development' database (and schema) that's to be used to store your proposed table changes.

For a good overview of using data-diff with dbt, see Automate dbt development testing in Snowflake with data-diff | datafold.com

5. Summary

There’s a lot of information listed above, but in summary:

  • If you want to use CI/CD against your dbt Core project, I recommend that you first replicate dbt Cloud’s ‘Slim CI’ pattern.
  • The above instructions provide a breakdown of the steps required to create this ‘Slim CI’ pattern.
  • Only once you’ve created this ‘Slim CI’ job should you investigate further dbt CI/CD checks.

--

--

Paul Fry

Welsh data architect, based in Dublin. Certified in dbt, Airflow, Snowflake & AWS