BQ+DBT: 5 proven practices to scale you analytics infrastructure effectively without exploding your cloud costs

At , we have been using BQ and DBT as the core of our Data Analytics for the past 2 years. As of end of 2022, we have fully operationalised User Acquisition reporting and attribution, successfully transferring the SoT (Source of Truth) from scattered tools (Ad-Networks UIs, Amplitude UI, Adjust UI) to the data team. In terms of numbers, we have 150+ dbt models and 1500+ unit and integration tests that ensure a strong level of observability and reinforces our model’s accuracy.

All of this was achieved with the lowest cost one can spend on those 2 tools:

  • For dbt we are still using dbt core (no dbt Cloud costs)
  • For BQ, we are still using the flat rate pricing with only 100 slots that cost less than 2000 dollars a month.

In this article, I will be sharing the learnings and distilled techniques we are currently using to manage our Analytics Infra efficiently without falling into the common trap of: Let’s by more tools/cloud resources!

Invest in Learnings/Knowledge more than in ‘unneeded’ SaaS/Cloud

Before I start sharing my team’s practices, I’d like to state my positioning quite clearly on the topic of: Should we invest in learnings/knowledge or SaaS/Cloud ?

My experience-based call here is: Invest in mature controllable Tools (SaaS/Cloud) and then invest the rest in consolidating the internal knowledge/learnings how to best manage these and scale both as you grow hand-in-hand.

The main error I see many businesses doing is that they will over-invest in SaaS/Cloud and ignore the learning/knowledge. What happens when you do this is that your costs will keep increasing while your ability as a team/business to manage these tools will stagnate resulting in exponentially unmanageable Tech-Debt and Spending. And unless you have some major funding supporting you (don’t think you should count on this too much these days with the looming recession that it is hitting the world and the extremely over-valued Tech world in particular), this tactic can be disastrous later on.

PS: Our strategy at Fabulous was: Invest in mature SaaS tools (Fivetran, Amplitude), ‘needed’ manageable Cloud (BQ + S3), free low cost SaaS that easily integrates with our stack (dbt, data studio, metabase, google sheets); and then persist on leveling up the knowledge of the team and scale everything linearly as the business and the use cases mature and grow.

I won’t go too deep down this rabbit-hole to stay focused on the topic of this article. So after establishing this premise, let’s dive deeper.

BQ Tactics

1. Switch to BQ Flat Pricing

BQ has 2 main pricing plans: On-Demand vs Flat pricing.

End of 2020, after migrating our data warehouse from Redshift to BQ for analytics performance reasons, we started with the On-Demand pricing. Things went smooth at the beginning as we set quotas for users to not go beyond, but then within 2 months, the expected ‘unexpected’ happened: Some crazy queries evaded the quotas control and ended up costing us in 3 days what we usually get billed in 30 days (fyi: Google bills 5$ for 1TB of processing).

This called for a change in tactics, we switched to the flat pricing and I started building the practices to fit this restriction on resources. As I mentioned earlier, we opted for the lowest number of slots (Slots is how Google manages BQ processing resources, and we bought 100 only — the absolute minimum).

2. Partition all heavy data sources before using them in dbt

The huge advantage of flat pricing is that you know exactly how much you will get billed each month. No surprises that will get Finance to go crazy and call for some explanation why the cost exploded!

The inconvenience is that your queries need to be more efficient.

Concretely, the main tactic that we applied here was partitioning: More specifically any source table that we are using in dbt workflows and that had a couple of GB to a couple of TB size got parititioned using an adequate date column. As most of our use case cases were and still are based on some date based cohorting (basically aggregating over days of installs or sale for instance), partitioning over the event dates made the most sense.

For this tactic to work, the team needed to get educated around the need to use the partition fields by default even for simple analyses to not over-stretch BQ resource. This helped consolidating a new solid practice with the engineering learnings behind it leading to easier maintaince of queries and models. More on this in the next tactic…

Bonus: Clustering

Clustering can also be used if the business use case for it exists. In our case, we additionally clustered the events tables (3 tables with up to 10TB each) using the field ‘event_type’. This helped tremendously as only a dozen of events is needed on a regular basis for analyses and only a handful is needed for dbt production workflows. More practically, whenever someone queried these tables, they had to use the partitioning `event_time` field and the clustering `event_type` field together in the where clause. This helped tame those beastly tables quite well without the need to double/triple/… our BQ costs.

Note: BQ has the feature of enforcing the need to use the partitioning field in queries (in the where cause), but this is not compatible with Fivetran connections (they break when setting this as a requirement). So, we couldn’t use this tactic.

DBT Tactics

The previous 2 tactics were BQ focused and they basically setup the frame for the analytics engineering tactics to use in dbt, as everything has to work in sync for our cost-managing strategy to work.

3. Set env-dependent look-back windows

We are a 4–5 individuals team and we version control everything we do. We are by the end of the day highly engineering driven. As such, we had 3 types of environments:

  • Dev: Local development
  • Test: Girhub PRs run here (we use github actions wokflows for our CI/CD)
  • Prod: 3 types of production workflows (also github actions). The 3 workflows fulfil different purposes: regular scheduled production runs every 6 hours. Post PR merge workflows to update the production data right after a change is rolled out without waiting for the next scheduled run to happen. And a weekly full refresh workflow (this was added when we introduced dbt incremental models — see 5th tactic)

Note: All environments access the production source tables. In other words, as for now we didn’t create a staging project on BQ with a duplicate of the sources (potentially with filtered and/or sampled data). We might apply this tactic later in the future as we mature further!

As such, it could easily happen that 3–5 full dbt workflows would be concurring for BQ resources. This caused lots of bottlenecks at the beginning. The solution, was to methodically limit the lookback window in all github workflows as follows:

  • Test workflows: Only last 3 months of data are to be processed for testing PRs. This is more than enough to create an environment similar to production and test the analytics/modelling logic there consistently before allowing any PR to get merged. (fyi: Later down the line, we had some use cases of models needing more historical data to be tested properly. When this happened we introduced rule-based exceptions to allow dbt to collect more data from the sources needed for these models)
  • Prod: ~ Last 13 months of data is collected. 13 months are usually enough for business people to have the metrics/KPIs they need and for analysts to do exploratory analyses on current and recent patterns. Concretely, 13 months allow to have 1 full year of data, plus the possibility to do a year-over-year comparison for the last month or 30 days

Now, you might be asking how we did this in dbt. The answer is in the next tactic!

4. Introduce Staging Layer: Control access to data sources and filter on partition fields

This tactic here is basically the glue that got all of the previous ones to fit together in our dbt project quite neatly. For the filtering to work properly and methodically, we decided to level up the architecture and introduce a solid staging layer.

In addition to being the first quality control layer where data processing, cleaning, testing and documentation are an absolute MUST for all columns used in downstream models, this layer ensured 2 additional things:

  • Raw sources (tables uploaded by Fivetran connectors) are only accessed through staging. In other words, no other models have the right to access sources
  • The date filter mentioned in the previous tactic are implemented in all staging models. The staging_date_filter_months is a project variable that depends on the environment as explained earlier. This also ensured that the look-back window will apply to all downstream models!
Staging layer date filter applied to the partition field of a source table

Remark: Don’t user the datediff function from dbt as when compiled the code won’t actually make use of the partitioning field properly. In other words, don’t use the following syntax

This syntax doesn‘t use paritioning properly!

5. Use incremental model for heavier sources

The previous 4 tactics got us moving for 1 whole year without the need to buy new BQ slots, and then as our use cases started expanding around more events data (remember that those tables are the heaviest with 10+ TB each), we hit a new bottle neck. The bottleneck was the processing of those tables, which even with the date partitions and the clustering, having 3–5 workflows building staging models for around 20 events from those tables over-stretched the 100 slots of BQ beyond the limit. Also, the staging of these events require lots of json extractions which is not the strongest suit of BQ (Snowflake is better at this)

The solution for this was the famous incremental models form dbt. We resisted adding these to the mix till it became really useful and to some extend necessary. We only applied this to events staging models as those where the heaviest to reprocess with each dbt run as we were doing till that point (normal dbt models reprocess all the data that the queries are asked to process).

The main benefit of this tactic was that now, we can build the bulk of heavy events staging model once. Examples:

  • When PR is opened and it testing workflow is triggered for the 1st time
  • The first prod run after a dbt run with the full-refresh flag and then just keep updating them with the new data as it comes

Another benefit is that: if the incremental models are paritioned, we can filter the dbt tests to only test recent data and not all the collected data. This become the standard for staging incremental models as those models are heavily tested and using partitioning there makes the tests more manageable

The main inconvenience was to manage the full-refreshing for the incremental models when and as needed:

  • Manual github action to allow developers deleting PR data themself: To manage access control to different kinds of data, PR data can only be manipulated/altered by the PR testing service account. Hence, a developer (analyst of analytics engineer) can only delete a PR data by running this action that uses the proper service account. This helps managing the risk of giving everyone the rights to alter PR/testing data. Why would a developer need to delete such data, you might ask: This actually happens quite often when the PR is about changing the data or the schema for an incremental model and hence the need to drop it and recreate it!
  • Weekly full-refresh product workflows: We introduced the weekly full-refresh production workflow to manage the risk of some corrupted data sneaking into the incremental models without our tests catching that. When would this happen, you might ask: When the logic of processing for a column changes (schema doesn’t change) and no test fails, the change gets merged and the incremental model ends up with outdated logic for data before the change and updated logic for data after the change. This puts at risk the accuracy/reliability of the data (remember Data Team’s models are the Source of Truth for UA KPIs and performance measurements)
  • The need for an admin to drop any production staging model that went through some schema/logic changes: We didn’t automate this. We still forget to do this one every now and then till we get notified that the workflow that should have updated production data after a PR merged failed due to a change in the schema of an incremental

As you see, incremental models are not just magic solutions. They come with a hidden cost as any advanced technique/tactic/tool. That’s why, we only introduced them this year, after we consolidated lots of learnings and knowledge around our data, stack and practices.

Conclusion: Invest in Learning and Knowledge

The tactics I listed here are the distilled learnings from 2 years of a committed journey to add value to the business as best as we could. For us, this meant that needed to commit to the strategy of investing in Learning and Tools hand-in-hand linearly. This helped me personally and the team level up tremendously in terms of skills and competence.

SaaS Tools and Cloud Resources will eat you (your budget) alive if you let them run loose with no sense of responsibility and ownership to tame them and learn more about how to use them effectively in your context. I saw many get devoured by their cloud costs due to insisting on duct-taping clumsy -and sometimes plain ugly- engineering practices with more tools and more cloud resources.

As I said earlier, my standpoint is that both factors (learning/practices & Tools) should evolve and mature hand-in-hand for sustainable growth for data teams and business use cases cover by these teams. You can play it differently and run your analytics team and infra more hedonistically and wildly, but be ready to pay the costs of such risky endeavour later when your TechDebt catches up with you. Maybe that the best strategy for you…

PS: I think under-investing in resources is also risky if it ends up slowing up the business! Be careful of that as well. A well-thought through flexible/agile balance is needed IMO. But again, this depends highly on your context.

--

--

Head of Data & Analytics @ Fabulous. On Data, Analytics, Tech, Business and Life…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ahmed Omrane

Head of Data & Analytics @ Fabulous. On Data, Analytics, Tech, Business and Life…