How to Test in Data & Analytics (dbt focused)
Before starting: This is how we currently test our Data and Analytics Models at The Fabulous. This is an internal documentation that sums up the Data Team’s learning since I joined the company 2 year ago. Take it with a grain of salt as we do (we are always improving on this)
Testing in Software Engineering
Testing is a foundational aspect in Software Engineering. As the Data Team is, by the end of the day, an engineering team (Analytics Engineering), Testing is a highly critical aspect of what we do.
If testing is new as a concept to you, please read up on it online. This article has relevant and concise information around this: https://www.guru99.com/software-testing-introduction-importance.html
Testing in Data & Analytics
As Engineering around Data & Analytics is relatively new and not as much mature as the other Engineering Domains (Apps, Backend, Front-End, Dev-Ops, Infra, etc…), Data Teams are generally not good with Testing. Most of the functional business-focused teams are not even aware of the topic to begin with. For instance, historical BI (Business Intelligence) teams don’t apply methodical testing beyond just double checking a query/chart/dashboard. This is highly fragile as it makes the whole system prone to errors and ultimately unreliable (broken queries, broken KPIs, undetected bugs in code, wrong KPIs leading to ‘risky’ business decisions, unmaintainable system etc…).
Long story short, we don’t want to be like this, even though when the team was first created end of 2020 we still had many bad practices. Our goal is to keep improving and level up to at least the same standards as the other Engineering teams (Dev Teams) at Fabulous.
Luckily, the global data community is also maturing further and improving with aspects related to Software Engineering. dbt as the best Analytics Modelling/Engineering tool for modern Data Stuck currently (strong personal opionin), has solved (and is still solving) many pain points for data and analytics practitioners. One of the key points that dbt definitely helped solving is Testing. The next section will focus on how we Test using dbt.
Note: The Data Team doesn’t do Data Engineering per say, as we use tools for that (Fivetran) and when needed we get help from the backend team who is responsible for handling the heavier lifting. What we focus on is driving value using Analytics and Data Science (the Data is already present for us with minimum investment from us).
0. dbt Testing
dbt offers 3 types of Testing:
- source freshness: https://docs.getdbt.com/reference/commands/source
- generic tests: https://docs.getdbt.com/docs/building-a-dbt-project/tests#singular-tests (Also known -previously- as schema tests)
- singular tests: https://docs.getdbt.com/docs/building-a-dbt-project/tests#singular-tests (Also known -previously- as data or bespoke tests
Please read about these (links above) before continuing to the next sections, as those will be functional to us.
The next sections are about how we use the features from dbt and other strategies to test effectively for our context!
1. Sources Testing: Freshness & Recency
This one is probably the easiest. The purpose of these tests is to make sure that we have fresh data in our data sources (data sources are listed in models/sources.yml file) and hence make sure that our Analytics are reliable and up-to-date. It also happens every now and then that some issues with the data ingestion occur that lead to no data getting uploaded to BQ in a timely way:
- For instance in May/June 2022, new itunes tables had some major unreliable delays in uploads → we asked Fivetran to check and fix what seemed to be a time-out/api-authentication issue due to a recent update on Apple’s itunes side.
In those cases, the freshness tests help identifying these quickly.
The standard for these is 24 hours to throw an error
Some additional notes:
- The command to run freshness tests is: dbt source freshness
- We also define 13 hours to throw a warning that we are not currently -as of June 2022- really using. The difference is that an error breaks the pipeline and a warning doesn’t (the output of a warning gets saved in one of the run output files and needs to be processed to be useful). That’s what we are not currently doing
- If we switch at some point to using the Cloud paid version of dbt, we will get both errors and warnings properly logged and accessible in their UI (we don’t play to do this for now — as of June 2022)
While this is the main freshness test we use, we have a specific use case where we additionally use another one: dbt_utils.recency test. We apply this test to important amplitude event models (staging models) to make sure that for those specific events we are still getting regular data as expected (even if Amplitude data source is still uploading properly). This helps catching up any tracking errors or bugs with those events. The standard is 1 day
2. Models Unit Testing: Generic Tests in yml files
These tests are what dbt calls generic tests. Why ‘generic’? This is basically because we can use them in multiple places as they don’t depend on specific models and their data. They only depend on a use case!
We use dbt built-in tests as much as possible (most used ones are unique, not_null, accepted_value), then we use tests from trustworthy packages that are listed on dbt hub (like: dbt_utils, dbt_expectations), and when needed we create our own custom tests (those are listed under macros/custom_tests).
In practice, those tests should populate the yml file attached to any model we create.
Note: We always create a .yml file for any model (always!) even if we don’t intend to test its columns/data at the time of creation
Those tests focus on 2 aspects:
- Testing columns and fields separately: If we expect a column to be unique, have a specific set values, not null, fulfilling some statistical expectation like not have more than 10% of nulls etc… => These could be called column tests
- Testing relationships between columns: uniqueness test over multiple columns is an easy example. Tests where we compare 2 columns (one is greater than the other for instance) is another example. Usually we want to put these by the end of the model under tests but we don’t apply this rigidly. We might set this as a new convention later if we see needed
A special care/attention should be paid to tests that have a statistical nature (a column is allowed to have outliers but not more than 10%, 90% of the values should be between x and y, etc…). This is important because data is not perfect and we need to be flexible with it on a case by case basis.
We need to pay attention to test ‘smartly’. We need to avoid the following 2 pitfalls:
- Test rigidly: not allowing for ‘expected’ outliers because let’s say we know that the tracking is faulty every now and then
- Test too loosely (allowing for lots of outliers to the point that our Analytics and stats become unreliable and untrustworthy or we fail to detect important bugs in tracking)
How to ensure this is by:
- Understanding why you are testing → catch a potential tracking bug — make sure our KPIs are solid, etc… => Business is the focus here
- Test to prevent what went wrong before
- Project to test for what could go wrong in the future
How to set a threshold to tolerate expected outliers:
A good rule of thumb to apply is check historical patterns/distribution and set the threshold to catch value to go beyond mean ± 3 std
Apply filters methodically
- For partitioned models, we usually apply a date filter in all tests to make them more efficient. This is the case, for instance, for staging models (as those are usually view over partitioned data sources) and amplitude staging models (incremental partitioned heavy models focused on 1 event at a times)
- Splitting by important dimensions like platform: Since patterns on the platforms could differ, setting a test per platform or one that groups by them if possible is in many cases important!
3. Models Unit Testing: Singular Functional Tests in under tests folder
These are what dbt calls singular tests (before bespoke and data tests). They reside under the folder tests .
They are usually more complex, but important, tests that we couldn’t easily/simply handle by using generic tests in the model’s yml file. An example is to compare a staging model to its source (usually number of records) to make sure that if we are doing some filtering for some corrupted data and expect that this filtering will be minimal (say we don’t expect to filter more than 1% of the data), this will be tested for automatically
- It happened for an itunes model, where we where filtering some older corrupted data, we started filtering considerably more records because of a new bug in their system. This led to unreliable analytics. After finding a fix for this, we added a data test to catch if this happens again
How to name data test
Using some standards from software engineering, we want to name these as follows: assert_something_expressive_that_describes_the_expected_behavior
- start with assert_
- then write in actual english (but concisely) what is the expectation of the test
- Example: assert_revenue_is_positive or assert_dim_user_coverage_above_95_percent
This is also how dbt names tests: https://docs.getdbt.com/docs/building-a-dbt-project/tests#singular-tests
Todo: Rename our current tests to fit the new convention
4. Integration & Mock Testing
We still don’t do this kind of testing that often but we want to incorporate it more and more. The purpose of this kind of testing is simulate the functioning of a whole block of logic. This could be:
- A macro
- A model
- Part of the pipeline with a bunch of linked models
The main idea here is that we want to ensure that given a specific predefined input, we will get the expected output once the data goes through the block we want to test.
Mock Testing for Macros
In the case of a macro for instance, we usually define a seed dataset that has both the input columns needed in the macro logic as well as the expected output. Then we build a model under the mock_tests folder where we compute the output using the macro. Then in the yml of the model we use dbt_utils.equality to test that the output of the macro (a column in newly added the model) matches the expected output that we defined in the seeds file.
The importance of this is that we ensure:
- During the development: we are building what we actually want
- After the development: any breaking changes to the macro will be caught → If we want to change the logic of the macro we need to change our mock tests!
Currently, we only have mock tests for macros. But as explained earlier under this section, we can apply a similar logic to a whole model or part of the pipeline with chained models. As long as we can define some input data and the output that should match it, we can do this kind of testing.
5. Regression Testing
Regression testing is about making sure that any changes we are about to merge to the production system won’t break any functionalities of the system. In other words, before merging any PR, we want to make sure that this PR won’t cause a ‘regression’ in the current state of the project.
A simple example of this is if we want to do refactoring with no functional changes: We will clean up some code and models without changing anything to the logic (this is what no functional changes mean). This is quite often the case when we decide to refactor some of the older code that will be needed for a new feature. We usually open a 1st refactoring PR to handle the refactoring without any changes to the logic and then a 2nd PR for the new feature(s)… Long story short, since the 1st PR is not expected to cause any changes to the data of the prod models, we want to test that this is indeed the case => Nothing in the dbt Models data will change after we merge the PR.
Generally, we want to apply non-regression testing all the time by default to be on the safer side.
Currently we are still doing regression testing manually by comparing dev and testing models (dev models are our local ones and testing models are the PR ones) to production. We do this by comparing statistics related to the model that should capture if a regression (unexpected change happens). An example of this, is to compare daily spend per network and platform for ua_performance_spend for the last 30 days. The compared stats should be related to the function of the model (spend, revenue, installs, ratios, etc…).
The audit_helper package
We also experimented a bit with the audit_helper package from dbt itself but it was tedious using it as it requires a lot of manual (not easily automated) steps.
The difference between this approach and the previous one is that here we compare all records 1 by 1 using a primary key. The primary key should be defined for the model — either 1 column that is already unique or we create a surrogate key that capture the uniqueness combining a couple of columns → dbt_utils.surrogate_key helps here. The package’s functions will help accomplishing this and then we need to check the results ourself in the BQ console or metabase and investigate further as needed.
In their course on SQL refactoring, dbt show how they use the package: https://courses.getdbt.com/courses/take/refactoring-sql-for-modularity/lessons/27780472-6-auditing => Worth taking a look at the course or at least the lesson on auditing especially if you want to use this approach
We applied this when switching from an older source to a new one.
Note: As long as we still don’t have a more automated approach (see next short section), it is up to the author to chose on of the 2 current approaches whichever works smoother in his/her workflow. Important is to use one and not ignore doing some kind of regression testing at least for the models that changed but don’t expect their outcome to change. And inform your reviewer to get their thoughts as well! → This will be improved on
Automated Testing (Experimental)
Ultimately we want to automate regression testing as much as possible and integrate it more smoothly into our workflow.
Two possible approaches will be considered:
- Using a tool like DataFold (especially their DataDiff feature)
- Adding to each functional model we build, a stats model that computes relevant summary stats from the model and use those stats to do regressions tests (compare the stats from dev/PR models to production ones)