Documentation in Data & Analytics
Documentation is something that gets misunderstood a lot IMO.
These are my macro-level conceptual thoughts on the topic (applies to Data and beyond)
Its Objective: Context & Clarity
In more details, you want to make sure that the future readers/users (yourself included!) of the models/transformations/query/code/etc will understand what has been done without the need to come back to you and ask you about the work your you did and why.
It’s flavours: More diverse that what we might think
A couple of ones I consciously consider:
- Your Code!!! This is where everything starts. The ultimate goal is clarity, don’t write clever code or models, write something that everyone can understand. Your code should document itself
- Naming: This a basically a corollary of the 1st point, as you code has the logic it implements and the language/wording/naming to express it easily to a human reader. Helpful intuitive naming (for files, datasets, data models, columns…). Names like tmp, tmp_1, test_data, etc… are definitely a bad practice. If you are using them, make sure they never get exposed to the outside world! Seriously…
- Comments: If the code (logic and naming) is not enough -for instance for more complex use cases that require adding historical context and pointing to some external documentation-, additional comments in the code are highly recommended. To prevent distracting the next developer/reader, make sure to keep it concise, clear and definitely not confusing. No documentation is better than a confusing/misleading one
- ToDos: ToDos is the code, is basically like a comment++ (a comment that calls for a future action). Use this strategically for instance to manage TechDebt smartly: If you see that a refactoring is needed for some older pieces and your deadlines for the current project won’t let you tackle it yourself, definitely add a ToDo (with context). The next reader will revisit this and decide what todo. That’s responsible collaborative development!
- Testing: Testing helps the user/reader understand the structure of the model and the data it processes. An example: uniqueness testing in data tests help understanding the granularity of the data (first thing I check when reading dbt tests.
- Documentation fils in codebase (as needed): Almost all languages that ‘self-respect’ offer conventional ways how to document your code methodically. For dbt, a yml file attached to the model (sql) your are building is the place where you can do this. Python for instance offer docstrings. Use this well as well. This should be used in conjunction with comments in the code. The right balance between both is an art that gets polished with practice
- Documentation outside codebase (like on confluence): Mostly for more complex contextual subjects like documenting the business use case of an ML model, reporting pipeline, architecture, complex data source, etc… My recommendation to my Team at The Fabulous is to document anything related to the code (concrete logic around it) close to the code, and anything closer to the business (functional documentation) on confluence
Ultimately, one needs to I ask oneself:
will the next reader face friction with what I am creating ?’.
If the answer is yes, simplify further and document better!
Some final guiding thoughts
- Documentation is a tool: It should be considered a major tool in your daily workflow, like your IDE, treat it well, it will save you and your colleagues/team lots of time later.
- Documentation should be a habit: At Fabulous, we apply Documentation First to reinforce this. I personally learned it the hard way after becoming a bottle neck in some projects because I didn’t pay enough effort in the functional documentation (around the business use case, why and how I tackled it).
- Documentation is a skill: Once you start building the habit around documentation, you will start seeing how you can refine it, make better, make easier to maintain, use different aspects of it (list from previous section) more smoothly, etc… I personally started liking this aspect once I’ve gotten to it (used to hate documentation before, like I used to hate testing…).
- Documentation has to be reviewed: Like code, definitely ask someone from your team to review your documentation. This is for instance rarely done for documentation outside the code base. This is in my opinion one of the main reasons why documentation gets outdated quickly (because it was poorly done to begin with). Remember: No Documentation is better than a misleading/confusing one
- Documentation is collaborative: Apply the Boy Scout Rule. Like code, documentation can and should be improved one, especially when it starts getting outdated. I generally state clearly when I am updating a section, especially when I know that things will change in a couple of months. Concrete example: ‘As of June 2022, the way we decided to tackle this problem is … because of ….’.
And one, last bonus is: when you become smooth and good at documenting important pieces of experience/knowledge, write blog posts around them becomes easy! This new post of mine is basically a repurposed internal documentation I wrote for the Data Team @ The Fabulous