Sources of truth when working with Data

Ahmed Omrane

3 min readJun 24, 2022

Context

Before we delve into this subtle topic, we need to list some guiding pointers

Data captures some aspects of ‘reality’ that we are tracking/measuring for different business objectives (monitoring, optimization, decision making, experimentation, etc…). Hence data is just a representation of the real aspects/patterns we are tracking.
Data is created, instrumented and shared by different systems. The tools we use as well as the raw data sources are ‘reflections’ of what those systems collected and are sharing with us.
In the process of all of this (previous 2 points), some inefficiencies occur. Always! Divergence from reality is one (think tracking quality). Another one, which is the focus on this page, is discrepancies related to ‘seemingly’ the same data being tracked by multiple systems but then the results differ! This is why the concept on main source of truth is important

Main source of truth

The main 2 points to take from the previous section in order to navigate the rest of this documentation are:

Seemingly the same data being tracked by multiple systems → Systems are diverse: event instrumentation (like Amplitude), MMPs (like Adjust), UA Networks (Facebook, Tiktok…), App Stores, Payment/Transactions data, backend databases, etc …
Discrepancies are expected and normal → Expecting otherwise (the everything is perfect) is a fallacy

A data discrepancy is when two or more sources/systems are supposed to be tracking and reporting the same thing (metrics, records, events, etc..), but their results differ (example: when adjust raw data differ from Adjust UI reporting data). If the tracking is not the same, we don’t talk about discrepancy in this case (example: itunes connect tracks and logs stats about downloads, and adjust tracks installs => these do not represent the same tracking and they are expected to be different)

Data Teams needs to be aware of the first point and then from it master the 2nd one to ensure reliability of their Analytics and algorithms.

Reliability is Data Teams currency. Teams need to guard it all the way through!

Practically, we need to master the following aspects and strive to keep improving on them:

Know what source is the main source of truth (outlined in the next section)
Identify discrepancies, solve what could be solved (not all of them), be informed (and inform our stakeholders) about the others and definitely document! — check my last blogpost on documentation

While the second point ensures that we are taming the discrepancies that we identify during our journey (this is a continuous process and a mindset!), the first one ensures that we do this in an effective data-informed way. What this means is, in case of having multiple source of some data (say Sales Events), with (potential) discrepancies — this is a certainty from my experience — , what should be considered the source of truth for us and our stakeholders.

This topic is actually subtler than this, and for the sake of staying pragmatic and practical, the next section lists some of the main sources of truth of the data we work with @ The Fabulous

Sources of truth listing (example)

Whenever you are facing a case where you can get the same data from 2 different sources, always ask:

Which one is the closest to the data creation ?

Thinking in terms of interconnected systems helps with this (what system collects what data, etc…)

The answer to this will give you the source of truth.

If you can’t find the answer, don’t assume (My golden rule is: Assume Less. Ask More) → Ask others

This is the listing for the most important data that we currently work with (grouped functionally):

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Data

Analytics Engineering

Data Science

Data Engineering

Written by Ahmed Omrane

72 Followers

3 Following

CTPO @ Enakl | On Tech, Product, Management & AI

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

More from Ahmed Omrane

On Data/Analytics Architecture: How to design you data model for scalable accurate Analytics (dbt…

Ahmed Omrane

On Data/Analytics Architecture: How to design you data model for scalable accurate Analytics (dbt…

A well designed architecture is like a well designed building: It is easy and intuitive to navigate-Its beauty is captured by its simplicity

May 7, 2023

How to Test in Data & Analytics (dbt focused)

Ahmed Omrane

How to Test in Data & Analytics (dbt focused)

Before starting: This is how we currently test our Data and Analytics Models at The Fabulous. This is an internal documentation that sums…

Jun 9, 2022

Ahmed Omrane

Documentation in Data & Analytics

Documentation is something that gets misunderstood a lot IMO. These are my macro-level conceptual thoughts on the topic (applies to Data…

Jun 17, 2022

Practical guideline to make remote work effective

Ahmed Omrane

Practical guideline to make remote work effective

Remote work is a privilege. The only still relatively a nascent style of collaboration. Many adopted it, others still resist it.

Dec 23, 2024

See all from Ahmed Omrane

Recommended from Medium

Silver Layer Data Modeling Best Practices (Medallion Architecture)

Kishan Raj

Silver Layer Data Modeling Best Practices (Medallion Architecture)

In modern data architectures, the Silver layer plays a pivotal role as an intermediary between raw data (Bronze layer) and refined…

Jan 3

Medallion Architecture: Principles and Practical Exploration

Level Up Coding

Santosh Shinde

Medallion Architecture: Principles and Practical Exploration

Data Layout Approach: A Modern Approach to Scalable Data Lakehouse Design and Understanding with Databricks notebook

Feb 15

117

Lists

Predictive Modeling w/ Python

20 stories1857 saves

Practical Guides to Machine Learning

10 stories2225 saves

data science and AI

40 stories340 saves

Coding & Development

11 stories1033 saves

How to Become a World-Class Data Architect

Lewis Gavin

How to Become a World-Class Data Architect

Tips and advice after 10 years of experience

Jan 29

104

Data Engineer Things

Leo Godin

No, Data Engineers Don’t NEED dbt.

But It Sure Does Solve a Lot of Problems

Jul 19, 2024

1.2K

Optimizing Storage Using Snowflake Copy Command

Sumit Gangwar

Optimizing Storage Using Snowflake Copy Command

Purge=True Option in Copy Command

Oct 26, 2024

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

Lu Zhenna

How Has the Covid-19 Pandemic Affected Singapore: the Cost of Living and More

How the Covid-19 pandemic has affected the cost of living, unemployment rate, and population structure of Singapore using open datasets.

Sep 25, 2024

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams