Sources of truth when working with Data
Context
Before we delve into this subtle topic, we need to list some guiding pointers
- Data captures some aspects of ‘reality’ that we are tracking/measuring for different business objectives (monitoring, optimization, decision making, experimentation, etc…). Hence data is just a representation of the real aspects/patterns we are tracking.
- Data is created, instrumented and shared by different systems. The tools we use as well as the raw data sources are ‘reflections’ of what those systems collected and are sharing with us.
- In the process of all of this (previous 2 points), some inefficiencies occur. Always! Divergence from reality is one (think tracking quality). Another one, which is the focus on this page, is discrepancies related to ‘seemingly’ the same data being tracked by multiple systems but then the results differ! This is why the concept on main source of truth is important
Main source of truth
The main 2 points to take from the previous section in order to navigate the rest of this documentation are:
- Seemingly the same data being tracked by multiple systems → Systems are diverse: event instrumentation (like Amplitude), MMPs (like Adjust), UA Networks (Facebook, Tiktok…), App Stores, Payment/Transactions data, backend databases, etc …
- Discrepancies are expected and normal → Expecting otherwise (the everything is perfect) is a fallacy
A data discrepancy is when two or more sources/systems are supposed to be tracking and reporting the same thing (metrics, records, events, etc..), but their results differ (example: when adjust raw data differ from Adjust UI reporting data). If the tracking is not the same, we don’t talk about discrepancy in this case (example: itunes connect tracks and logs stats about downloads, and adjust tracks installs => these do not represent the same tracking and they are expected to be different)
Data Teams needs to be aware of the first point and then from it master the 2nd one to ensure reliability of their Analytics and algorithms.
Reliability is Data Teams currency. Teams need to guard it all the way through!
Practically, we need to master the following aspects and strive to keep improving on them:
- Know what source is the main source of truth (outlined in the next section)
- Identify discrepancies, solve what could be solved (not all of them), be informed (and inform our stakeholders) about the others and definitely document! — check my last blogpost on documentation
While the second point ensures that we are taming the discrepancies that we identify during our journey (this is a continuous process and a mindset!), the first one ensures that we do this in an effective data-informed way. What this means is, in case of having multiple source of some data (say Sales Events), with (potential) discrepancies — this is a certainty from my experience — , what should be considered the source of truth for us and our stakeholders.
This topic is actually subtler than this, and for the sake of staying pragmatic and practical, the next section lists some of the main sources of truth of the data we work with @ The Fabulous
Sources of truth listing (example)
Whenever you are facing a case where you can get the same data from 2 different sources, always ask:
- Which one is the closest to the data creation ?
Thinking in terms of interconnected systems helps with this (what system collects what data, etc…)
The answer to this will give you the source of truth.
If you can’t find the answer, don’t assume (My golden rule is: Assume Less. Ask More) → Ask others
This is the listing for the most important data that we currently work with (grouped functionally):
