On Data/Analytics Architecture: How to design you data model for scalable accurate Analytics (dbt focused)

6 min readMay 7, 2023

Context

During the last 3 years I’ve started the dbt data modelling @ Fabulous and scaled it up with my Data Team to what I consider now a healthy mature scalable architecture that respects both good system design, agility of implementation and accuracy of analytics

As of May 2023, we have converged to a layered architecture with more or less solid data modelling principles that govern each layer and how data flows between layers.

This is how the DAG looks like (focused on the most important, namely UA performance).

The following sections capture these principles.

When reading, keep in mind that:

A well designed architecture is like a well designed building: It is easy and intuitive to navigate. In a sense its beauty is captured by its simplicity

Sources

All sources have to be explicitly declared in the sources.yml file. Freshness parameters have to be properly defined depending on the importance of the source and what would be the business consequences of its data not getting refreshed as expected.

Ensure that any source table is partitioned using the main date column. This makes filtering on those fields and then querying more performant in the next layer.

Some particularities:

history tables don’t need partitioning as they are generally small in size (like history of campaigns setup)
BQ (BigQuery) Meta-data tables and views can’t be partitioned. Those are only used for DataOps

Staging

1–1 relationship with the source table to do basic column selection (only needed columns are to be selected) and data processing (we apply the DRY principle and try to get any repeatedly needed transformation as upstream as possible). No complex joining or aggregations are allowed. If joining with ‘helper’ sources is needed (like getting most recent campaign naming from a `campaign_history` table) then use a work (used to be called base) model and join with it. Proper documentation for the model and its columns is mandatory as well.

Testing is highly important

The involvement of a senior team member is highly recommended (either as author or reviewer) before rolling out any changes to this layer. Solid engineering principles need to be respected to effectively manage TechDebt and not be obliged to do frequent changes to this layer. If you see yourself applying changes to this layer it means one of the following:

The design was poor to begin with and some basic design thinking was left out (hence my emphasis on having engineers working on this not just analysts)
The data model and its use cases are still immature and many pieces are still unstable. In this case, it is probably best to not commit to a mature architecture yet. This would have been the case for my team 1–2 years ago when the whole data function was still stabilise itself within the company and its already existing business units

This is the only layer that should/have read data from sources (tests folder can also read data from sources for testing purposes — not a layer)

Staging shouldn’t read data from other layers

For each source, create a subdirectory and name it similarly to the source (we can use an abbreviation for the source if its name is long and the abbreviation is well known like using `asa` instead of `apple_search_ads` ). Specific to my context: Most sources have their Fivetran connectors and BQ schemas sharing their name or something close to it.

Remark on Filtering

Always filter on the partition field whenever you are querying the source table.

Core

This layer contains the main business logic that serves the business marts use cases in down stream layers/models. It is as important as staging and requires strong engineering practices to contribute, review, scale and maintain.

Core is the only layer that can read from Staging and shouldn’t read from marts, staging and sources

The involvement of a senior team member is highly recommended (either as author or reviewer) before rolling out any changes to this layer.

Test solidly in this mart. Remember many business cases will be built on top of models here, build your test cases with this in mind

Always move logic to Core models when:

A piece of functional logic gets consolidated beyond an MVP/PoC and we are more convinced that it is here to stay with a potential for it to expand
A piece of logic starts expanding to additional use cases beyond the one that it was initially applied to

Marts

Marts are the most functional business focused models of the architecture. Generally it makes sense to build a mart for each business unit. Models within a mart focus on build the needed Analytics logics and metrics and expose this data in an easy to utilise and manipulate way for end users (business colleagues). The data flows from Core to marts, and NEVER directly from Staging or Sources

Also avoid flowing data between marts. If you see that you might need that, refactor your architecture to abstract the core logic needed for both marts into Core model(s) and then create more functional targeted use cases within each mart as needed. If the use cases are quite similar (ex: monthly spend reporting for Finance and daily one for UA), it is highly recommended to build and inter-mart integrity test to compare the same metrics on a highly aggregated level

Exposure models are naturally part of Marts and NEVER of another layer

Folder Structure and Model Naming Conventions (apply to all layers)

Create sub-folders under the layer’s folders in order to add another layer of ontological structure (don’t throw in models within a layer’s folder without thinking of how to categorise that model and the next similar ones)

Don’t define materialisation locally in the model files, rather globally in the dbt_project.yml as much as possible

Name models as follows:

Source:

Don’t hesitate to use an alias to give a more convenient name for sources (datasets as well as tables) that have a non expressive naming (like instead of appxxxxx (the x(s) represent the project number as defined by amplitude in our case), we use amplitude_android_fabulous)

Staging:

[stg]_[subdirectory]__[bq_source_table_name]

Core:

`[core]_[subdirectory_name]__[relevant_expressive_model_name]`
work models: `[work_core]_[subdirectory_name]__[relevant_expressive_model_name]`
Use plural for the entities (call core_users__… instead of core_user__…). This is a common practice in engineering and the one recommend by dbt Labs as well
Materialise all models as tables. Work models could be materialised as views. Work models are not expected to be called for outside of their subdirectory

Marts:

`[mart_name_or_abreviation]_[subdirectory_name]__[relevant_expressive_model_name]`
work models: `[work_mart_name_or_abreviation]_[subdirectory_name_or_abreviation]__[relevant_expressive_model_name]`

Note on ‘work’ models

Work models are usually meant to serve more important downstream models. They are generally materialised as views
We can decide to create such a model (instead of just having it as a CTE) to have better control over its data like to test and simplify the code of the main functional models → Think modularity and testing!
[Rule of thumb] If a piece of logic is important to test, create a work model for it
[Rule of thumb] If the code of the main model becomes too long and complex (not for the author but for other readers!), think about creating work models
[Rule of thumb] If a work model starts getting used by other models from other business marts (basically other business cases), it is no longer a work model and a proper refactoring is needed! Consider moving some logic to Core
[Convention] A work model is hence better put close to where it serves (not where it’s source data is coming from)
[Convention] We put these under a sub-folder called work
[Convention] All work models start with the prefix work_ to make identifying them in BQ easy
[Convention] Work models are generally not meant to be used for analyses in BQ. If you see that you are using a work model a lot for adhoc analyses, consider doing some architectural thinking and refactoring

A coupe of the final (redundant) thoughts:

Explicitly define all sources and exposures
No source should be called outside of staging
No core model gets exposed to business users without going through a business mart
No interdependency between different marts (or at least reduce it)
No more than 3 nested sub-folders within a specific layer
Use work models tactfully and avoid cascading them often
Test all over the places: Diversify the tests. Don’t just stick to the basics. categorise tests to make them easy to apply: Try integration, proportions, non-regression tests… Keep refining…

Final Note: DBT has lots of interesting materials on Data Modelling and Analytics Architecture like these ones here. Take a look at them for a more complete picture.

On Data/Analytics Architecture: How to design you data model for scalable accurate Analytics (dbt focused)

Context

Sources

Staging

Remark on Filtering

Core

Marts

Folder Structure and Model Naming Conventions (apply to all layers)

Note on ‘work’ models

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ahmed Omrane

Responses (1)